### SENTIMENT ANALYSIS

**Highlights:**
 <br> 1. A binary classification model
 <br> 2. Object Oriented Approach 
 <br> 3. AN end-point API that accepts english text and respond with the predicted sentiment 
 <br> 4. Early stopping and dropout to avoid model overfitting 
 
 Note: 
 <br> A. The Epochs for the deep learning models can be further enhanced in the range from 30-50 for very precise accuracy
 <br> B. For API validation data when there is quotes within the string data it must be a single quotes with-in double quotes, for example: "   'bad'  "

#### 1. Environment setup

##### 1.a Install the necessary packages

In [1]:
!pip3 install pandas
!pip install scikit-learn
!pip install nltk
!pip install fastapi
!pip install uvicorn
pip install deta



In [3]:
cd D:\\చదువు మరియు సర్టిఫికేట్లు\\చ-22\\చ-22-MachineLearning-NeuralNets Projects\\7. Sentiment Analysis API

D:\చదువు మరియు సర్టిఫికేట్లు\చ-22\చ-22-MachineLearning-NeuralNets Projects\7. Sentiment Analysis API


##### 1.b Import Libraries

In [4]:
# The general library packages

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn import metrics

# The necessary library packages for text pre-processing

import nltk 
nltk.download('punkt') 
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
import string

# The required library packages for Naive Bayes Classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# The required library packages for deep learning models

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from tensorflow.keras.layers import Flatten, LSTM, Dense, Embedding, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping

# The required library packages for end-point API
import nest_asyncio
from fastapi import FastAPI
import uvicorn 
import requests

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vpara\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vpara\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\vpara\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vpara\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##### 1.c Fetch the Data set and do required modifications

In [5]:
pd.set_option('max_colwidth',6000)
data = pd.read_csv('airline_sentiment_analysis.csv',usecols = ['airline_sentiment','text'], low_memory = True)
data.airline_sentiment.replace('positive', 1, inplace=True)           # the label as positive is replaced with 1
data.airline_sentiment.replace('negative', 0, inplace=True)           # the label as negative is replaced with 0
data.head(3)                                                          # displays the dataframe content, first 3 rows

Unnamed: 0,airline_sentiment,text
0,1,@VirginAmerica plus you've added commercials to the experience... tacky.
1,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
2,0,@VirginAmerica and it's a really big bad thing about it


In [318]:
# data.info()   # 11541 entries, 2 columns
# data.isna().sum() # no missing values

#### 2. Sentiment Analysis Object-Oriented Design  

##### The following are the methods declared using the OOPS concept:
<br> I. Text pre-processing: Filters the specific text , that includes proper wording, removing un-necessary symbols-punctuations-words and implementing lemmatization
<br> II. Model Data: Text pre-processing of the entire 'text' column of the original dataset.
<br> III. The Naive Bayes Classifier: Binary classification using the machine learning model as per Naive Bayes Theorm.
<br> IV. The CNN Classifier: A convolutional deep learning model using tensorflow and keras.
<br> V. The LSTM Classifier: A recurrent deep learning model using tensorflow and keras.
<br> VI. The default representation method to display the model metrics.


For all of the three models the dataset 'airline_sentiment' column is pre-processed and is set to match the model input criteria


In [6]:
class sentimentanalysis:
    def __init__(self,data):
        self.data = data        
        
    # Method for preprocessing the dataset review text
    
    def textpreprocessing(self,inp):                                 
        
        Pattern1 = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)" # pattern for filtering
        Pattern2 = '@[^\s]+'                                         # pattern for filtering
        stopword = set(stopwords.words('english'))                   # set the words to remove from review content
        inp = inp.lower()                                            # converts everything to lowercase
        
        # replace the shorcuts with proper user defined english words
        inp = re.sub(r"yrs", "years", inp)
        inp = re.sub(r"hrs", "hours", inp)
        inp = re.sub(r"bday", "b-day", inp)
        inp = re.sub(r"mother's", "mother", inp)
        inp = re.sub(r"mom's", "mom", inp)
        inp = re.sub(r"dad's", "dad", inp)
        inp = re.sub(r"hahah|hahaha|hahahaha", "haha", inp)
        
        inp = re.sub(r"can't", "can not", inp)
        inp = re.sub(r"wasn't", "was not", inp)
        inp = re.sub(r"don't", "do not", inp)
        inp = re.sub(r"aren't", "are not", inp)
        inp = re.sub(r"isn't", "is not", inp)
        inp = re.sub(r"won't", "will not", inp)
        inp = re.sub(r"shouldn't", "should not", inp)
        inp = re.sub(r"wouldn't", "would not", inp)       
        inp = re.sub(r"haven't", "have not", inp)
        inp = re.sub(r"hasn't", "has not", inp)        
        inp = re.sub(r"couldn't", "could not", inp)
        inp = re.sub(r"weren't", "were not", inp)
        inp = re.sub(r"didn't", "did not", inp)
        inp = re.sub(r"ain't", "am not", inp)
        inp = re.sub(r"haven't", "have not", inp)
        inp = re.sub(r"doesn't", "does not", inp)

        inp = re.sub(r"he's", "he is", inp)
        inp = re.sub(r"here's", "here is", inp)
        inp = re.sub(r"what's", "what is", inp)
        inp = re.sub(r"there's", "there is", inp)
        inp = re.sub(r"he's", "he is", inp)
        inp = re.sub(r"it's", "it is", inp)
        inp = re.sub(r"there's", "there is", inp)
        inp = re.sub(r"we're", "we are", inp)
        inp = re.sub(r"that's", "that is", inp)     
        inp = re.sub(r"who's", "who is", inp)
        inp = re.sub(r"that's", "that is", inp)
        inp = re.sub(r"where's", "where is", inp)
        inp = re.sub(r"what's", "what is", inp)
        
        inp = re.sub(r"they're", "they are", inp)
        inp = re.sub(r"you're", "you are", inp)
        inp = re.sub(r"i'm", "I am", inp)
        inp = re.sub(r"you've", "you have", inp)
        inp = re.sub(r"we're", "we are", inp)
        inp = re.sub(r"we've", "we have", inp)
        inp = re.sub(r"y'all", "you all", inp)
        inp = re.sub(r"would've", "would have", inp)
        inp = re.sub(r"it'll", "it will", inp)
        inp = re.sub(r"we'll", "we will", inp)
        inp = re.sub(r"he'll", "he will", inp)
        inp = re.sub(r"they'll", "they will", inp)
        inp = re.sub(r"they'd", "they would", inp) 
        inp = re.sub(r"they've", "they have", inp)
        inp = re.sub(r"i'd", "i would", inp)
        inp = re.sub(r"should've", "should have", inp)
        inp = re.sub(r"we'd", "we would", inp)
        inp = re.sub(r"i'll", "I will", inp)
        inp = re.sub(r"they're", "they are", inp)
        inp = re.sub(r"let's", "let us", inp)
        inp = re.sub(r"it's", "it is", inp)
        inp = re.sub(r"you're", "you are", inp)
        inp = re.sub(r"i've", "I have", inp)
        inp = re.sub(r"you'll", "you will", inp)
        inp = re.sub(r"you'd", "you would", inp)
        inp = re.sub(r"could've", "could have", inp)
        inp = re.sub(r"youve", "you have", inp)  

        inp = re.sub(Pattern1,'',inp)
        inp = re.sub(Pattern2,'', inp) 
        inp = inp.translate(str.maketrans("","",string.punctuation)) # removes punctuations form the review text
        
        
        tokens = word_tokenize(inp)                                 # review text words tokenization
        my_tokens = [w for w in tokens if w not in stopword]
        wordLemm = WordNetLemmatizer()                              # Lemmatization, the morphological analysis of the review text
        words=[]
        for w in my_tokens:
            if len(w)>1:
                ele = wordLemm.lemmatize(w)
                words.append(ele)

        return ' '.join(words)                                     # the review text after pre-processing
    
    def model_data(self):                                          # the method that filters the entire dataset review text
        self.data['text'] = self.data['text'].apply(lambda x: obj.textpreprocessing(x))
        return self.data
        
    # MODEL 1 NAIVE BAYES 
    
    def model_NB(self):                                           
        self.model_data()                                                         # filters the dataset text column
        
        self.count_vect =  CountVectorizer(max_features= 1000)                    # groups text column words as a vector
        self.feature_vector = self.count_vect.fit(self.data.text)                 # fit the countvectorizer methods
        self.data_features =  self.count_vect.transform(self.data.text)           # transforms the text column words to match the model input
        # split the dataset into train data and test data
        self.train_x_m1, self.test_x_m1, self.train_y_m1, self.test_y_m1 =  train_test_split(self.data_features, self.data.airline_sentiment,test_size = 0.3, random_state = 42)
        self.model_1 = MultinomialNB()                                            # constructs a naive Bayes Model, used mutinomial for enhanced performance
        self.model_1.fit(self.train_x_m1.toarray(), self.train_y_m1)              # fit the model
        self.predicted_model_1 = self.model_1.predict(self.test_x_m1.toarray())   # predict the test data for understanding the model metrics 
        
        return self.count_vect, self.model_1                                      # these will be used in the API for respective model prediction
  
    # MODEL 2 CONVOLUTIONAL NEURAL NETWORK 
    
    def model_CNN(self):          
        self.model_data()                                                        # filters the dataset text column
        
        self.text = self.data['text'].to_numpy()                                 # converts the text column to n-dimensional array
        self.sentiment = self.data['airline_sentiment'].to_numpy()               # converts the label column to n-dimensional array
        # split the dataset into train data and test data
        self.train_x_m2, self.test_x_m2, self.train_y_m2, self.test_y_m2  = train_test_split(self.text, self.sentiment, test_size=0.3,random_state = 42)
        
        self.vocab_size = 10000                                                  # model parameters
        self.sequence_length = 1000
        self.embedding_dim = 16
        
        self.tokenizer = Tokenizer(num_words=self.vocab_size, oov_token="<OOV>")                                                # tokenization
        self.tokenizer.fit_on_texts(self.train_x_m2)                                                                            # fit with train data
        self.train_sequences = self.tokenizer.texts_to_sequences(self.train_x_m2)                                               # train data - convert to sequence       
        self.train_padded = pad_sequences(self.train_sequences, maxlen=self.sequence_length, padding='post', truncating='post') # pad the train data sequence
        self.test_sequences = self.tokenizer.texts_to_sequences(self.test_x_m2)                                                 # test data - convert to sequence 
        self.test_padded = pad_sequences(self.test_sequences, maxlen=self.sequence_length, padding='post', truncating='post')   # pad the test data sequence
        
        self.model_2 = Sequential()                                                                         # Construct a convolutional model using keras sequential API
        self.model_2.add(Embedding(self.vocab_size, self.embedding_dim, input_length=self.sequence_length)) # An embedded layer for input text
        self.model_2.add(Conv1D(filters=128, kernel_size=5, activation='relu'))                             # convolutional layer
        self.model_2.add(MaxPooling1D(pool_size=2))                                                         # max-pool layer
        self.model_2.add(Flatten())                                                                         # flatten the output from max-pool layer
        self.model_2.add(Dense(1, activation='sigmoid'))                                                    # activation function
        self.model_2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])            # compile the model
        self.callbacks = [EarlyStopping(patience=2)]                                                        # define a model callback, in this case only early stopping is used
        # fit the model, very less epoch is used (for quick API deomnstration), in practise it must be much more for a very precise classification
        self.history_model_2 = self.model_2.fit(self.train_padded, self.train_y_m2, epochs=3, validation_data=(self.test_padded, self.test_y_m2), callbacks=self.callbacks)

        return self.tokenizer, self.model_2                                                                 # these will be used in the API for respective model prediction
    
    # MODEL 3 LSTM RECURRENT NEURAL NETWORK  
    
    def model_LSTM(self):             

        lstm_out = 32
        self.model_3 = Sequential()                                                                         # Construct a recurrent model using keras sequential API
        self.model_3.add(Embedding(self.vocab_size, self.embedding_dim, input_length=self.sequence_length)) # An embedded layer for input text        
        self.model_3.add(Bidirectional(LSTM(lstm_out)))                                                     # Recurrent layer
        self.model_3.add(Dense(10, activation='relu'))                                                      # Activation layer- first level
        self.model_3.add(Dense(1, activation='sigmoid'))                                                    # Activation layer- output level
        self.model_3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])            # compile the model
        # fit the model, very less epoch is used (for quick API deomnstration), in practise it must be much more for a very precise classification
        self.history_model_3 = self.model_3.fit(self.train_padded, self.train_y_m2, epochs=3, validation_data=(self.test_padded, self.test_y_m2), callbacks=self.callbacks)

        return self.model_3                                                                                 # these will be used in the API for respective model prediction

    # The method for populating the performance metrics of all the three models
    def __repr__(self):
        # All three model metrics
        self.model_NB()                                                                                    # calls the naive bayes model method
        self.model_CNN()                                                                                   # calls the CNN model method
        self.model_LSTM()                                                                                  # calls the LSTM model method

        model1_metrics = metrics.classification_report(self.test_y_m1, self.predicted_model_1)             # Pulls the naive bayes model performance metrics
        model2_metrics = pd.DataFrame(self.history_model_2.history)                                        # Pulls the CNN model performance metrics
        model3_metrics = pd.DataFrame(self.history_model_3.history)                                        # Pulls the LSTM performance metrics
        
        Output = ['THE METRICS FOR NAIVE BAYES CLASSIFIER: ', model1_metrics,'THE METRICS FOR CNN CLASSIFIER: ', str(model2_metrics), 'THE METRICS FOR LSTM CLASSIFIER: ',repr(model3_metrics)]

        return   '\n\n'.join(Output)                                                                       # returns all of the model metrics as string

#### 3. The Model Metrics 

As a default step: Upon object declaration of the sentiment analysis class the model metrics will be displayed

In [11]:
obj = sentimentanalysis(data)
obj

Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3


THE METRICS FOR NAIVE BAYES CLASSIFIER: 

              precision    recall  f1-score   support

           0       0.94      0.94      0.94      2771
           1       0.74      0.75      0.75       692

    accuracy                           0.90      3463
   macro avg       0.84      0.84      0.84      3463
weighted avg       0.90      0.90      0.90      3463


THE METRICS FOR CNN CLASSIFIER: 

       loss  accuracy  val_loss  val_accuracy
0  0.397175  0.824462  0.272987      0.892001
1  0.187712  0.928943  0.227260      0.909616
2  0.102438  0.963481  0.246673      0.906151

THE METRICS FOR LSTM CLASSIFIER: 

       loss  accuracy  val_loss  val_accuracy
0  0.403052  0.837336  0.230674      0.907017
1  0.168591  0.934390  0.206804      0.920878
2  0.096254  0.966081  0.244344      0.912792

#### 4. Sentiment Analysis API

<br> I. An end-point API is constructed using FASTAPI
<br> II. Accepts the data for validation as string 
<br> III. The three models respectively predicts the sentiment 

In [None]:
nest_asyncio.apply()                                               # runs threads asyncronously

obj = sentimentanalysis(data)                                      # object declaration

app = FastAPI(debug=True)                                          # declares the API usage
@app.get("/Sentiment Analysis")

def predict(Validation_Data):                                      # API predict function for the data to be validated
    processed_data = obj.textpreprocessing(Validation_Data)        # text pre-processing
    
    x,y = obj.model_NB()                                           # Naive Bayes model prediction
    trans_data_1 = x.transform([processed_data])    
    model_prediction_1 = y.predict(trans_data_1.toarray())
    
    df = pd.DataFrame({"input_data":[processed_data]})             # Data preparation for next models 
    trans_data_2_3 = df["input_data"].to_numpy()
    
    m,n = obj.model_CNN()                                          # CNN model prediction
    trans_data_2_3_seq = m.texts_to_sequences(trans_data_2_3)
    trans_data_2_3_padded = pad_sequences(trans_data_2_3_seq, maxlen=1000, padding='post', truncating='post')
    model_predict_2 = n.predict(trans_data_2_3_padded)             # the output is a float value
    model_prediction_2 = 1 if model_predict_2[0][0] >= 0.70 else 0 # binary value as per probability   

    a = obj.model_LSTM()                                           # LSTM model prediction
    model_predict_3 = a.predict(trans_data_2_3_padded)             # the output is a float value
    model_prediction_3 = 1 if model_predict_3[0][0] >= 0.70 else 0 # binary value as per probability  
    
    output = [model_prediction_1,model_prediction_2,model_prediction_3] # A list of precited values   
    output = ["Positive" if ele==1 else "Negative" for ele in output]   # translates to original sentiment category
            
    api_out = {'The Review is:      '
               + '{}'.format(output[0]) + ' '+'as per Naive Bayes Classifier' + '  -->          ' 
               + '{}'.format(output[1]) + ' '+ 'as per CNN Classifier' + '  -->          ' 
               + '{}'.format(output[2]) + ' '+ 'as per LSTM Classifier' }
    
    return api_out
    

if __name__ == '__main__':
    uvicorn.run(app)


INFO:     Started server process [14104]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)


INFO:     127.0.0.1:57996 - "GET / HTTP/1.1" 404 Not Found
INFO:     127.0.0.1:57997 - "GET /docs HTTP/1.1" 200 OK
INFO:     127.0.0.1:57997 - "GET /openapi.json HTTP/1.1" 200 OK
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
INFO:     127.0.0.1:58003 - "GET /Sentiment%20Analysis?Validation_Data=%40united%20Late%20Flight%20to%20Denver%2C%20%40%21%22xyzr%20%22%20Late%20Flight%20to%20Newark...let%27s%20not%20even%20get%20into%20the%20disaster%20that%20was%20checking%20bags.%20Unacceptable.%40%21%28h HTTP/1.1" 200 OK
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 1/3
Epoch 2/3
Epoch 3/3
INFO:     127.0.0.1:58071 - "GET /Sentiment%20Analysis?Validation_Data=%40virginair%20the%20flight%20was%20not%20in%20good%20condition.%20effected%20departure.%20delayed%20business%20meeting.%20bye.%40%22xzuigqb%21%40%22 HTTP/1.1" 200 OK


#### 5. Deploying API
An end-point API can be deployed using deta.


# THE END