<h1> Sentiment Analysis (general purpose) - Vishaal Yalamanchali</h1>
<h4> Importing packages </h4>
<p> The purpose of this jupyter notebook is to create a production level sentiment analysis machine learning api, interfacing through the cloud. </p>

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import settings
import gensim
import gensim.models.keyedvectors as word2vec

from tokenization import tokenize
from evaluation import evaluate

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.pipeline import Pipeline
from sklearn import svm, datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from reduce_header_df import reduce_mem_usage
from sklearn.preprocessing import LabelEncoder
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding,Dense,LSTM,Bidirectional
from tensorflow.keras.layers import BatchNormalization,Dropout
from tensorflow.keras import Sequential
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

<h1> Helper Functions </h1>
<p> Create helper functions and variables for future use </p>

In [13]:
cols = ["sentiment", "text"]
encoding = 'latin'
decode_map = {0: "negative", 2: "neutral", 4: "positive"}
n = 100
num_lines = 1600000
# skip_idx = [x for x in range(1, num_lines) if x % n != 0]
TRAIN_SIZE = 0.8   

In [14]:
### instantiate helper functions for later

In [15]:
def decode_sentiment(label):
    return decode_map[int(label)]

def decode_sentimentC(label):
    return label.lower()

def predict(vectoriser, model, text):
    # Predict the sentiment
        
    listD = tokenize(str(text).lower())
    textdata = vectoriser.transform(listD)
    sentiment = model.predict(textdata)
    # Make a list of text with sentiment.
    data = []
    for text, pred in zip(text, sentiment):
        data.append((text,pred))
        
    # Convert the list into a Pandas DataFrame.
    df = pd.DataFrame(data, columns = ['text','sentiment'])
    df = df.replace([0,1,2], ["Negative","Neutral","Positive"])
    print(df.sentiment)
    return df

<h1> Data Preprocessing </h1>
<p> in order to combine multiple datasets we must modify the sentiment values from each dataset to match. the following code does that by specifically picking certain cols from each dataset to extract just the sentiment value and the tweet. </p>

In [None]:
### --- DATA PREPROCESSING --- ###
df = pd.read_csv('/Users/vishaalyalamanchali/Desktop/twitter-sentiment-analysis/data/training.1600000.processed.noemoticon.csv', encoding=encoding, names=cols, 
nrows = 16000, usecols=[0,5])
nf = pd.read_csv('/Users/vishaalyalamanchali/Desktop/twitter-sentiment-analysis/data/Tweets.csv', encoding=encoding, names=cols, usecols=[1,10])

df.sentiment = df.sentiment.apply( lambda x: decode_sentiment(x) )
frames = [df,nf]
df = pd.concat(frames)
# tokenize all tweets from the dataset
df.text = df.text.apply(lambda x: tokenize(x))
print(type(df.text))
# check to see if any params datatype can be changed to reduce memory usage
df, NAlist = reduce_mem_usage(df)
df = df.sample(frac=1).reset_index(drop=True)
# shuffle the given dataframe

<h6> split the data into training and test data </h6>

In [18]:
train_data, test_data = train_test_split(df, test_size=1-TRAIN_SIZE,
                                         random_state=42) # Splits Dataset into Training and Testing set
print("Training data size:", len(train_data))
print("Test data size:", len(test_data)) 
# documents = [_text.split() for _text in df_train.text] 
train_data.head(10)

Training data size: 24512
Test data size: 6128


Unnamed: 0,sentiment,text
23872,negative,"('oh no morning suppose better get ready work',)"
9061,negative,('The Company I work shuts Thursday Joblessvil...
26137,negative,('USER please help regarding PNR A3ZZ0F Why I ...
16599,negative,('USER Audi A6 fender gone insurance claim 300...
21992,positive,"('USER If follow I able DM Thanks',)"
18672,negative,('USER think bust 2day late overslept But I st...
9320,negative,('USER Thank U It opened time I would love I F...
25837,negative,"('USER know TFA seemed totally disappear',)"
22013,negative,('Trying new Eucerin lotion hand It pretty awe...
20489,negative,"('USER I wan na see shoot able see',)"


<h6> detect labels using unique row values from training data </h6>

In [19]:
labels = train_data.sentiment.unique().tolist()
print(labels)

['negative', 'positive', 'neutral']


In [20]:
encoder = LabelEncoder()
encoder.fit(train_data.sentiment.tolist())

LabelEncoder()

In [21]:
X_train = train_data.text
X_test = test_data.text

y_train = encoder.transform(train_data.sentiment.tolist())
y_test = encoder.transform(test_data.sentiment.tolist())
# y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

<h6> Create TFIDF vectoriser to create ngram features </h6>

In [22]:
file = open('vectoriser.pkl','rb')
vectoriser = pickle.load(file)
file.close()

In [23]:
X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)
print(X_train.shape)

(24512, 500000)


In [24]:
print(y_train.shape)

(24512,)


<h6> create model and evaluate on test data </h6>

In [25]:
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
evaluate(LRmodel,X_test,y_test)

              precision    recall  f1-score   support

           0       0.89      0.98      0.93      5044
           1       0.68      0.33      0.44       619
           2       0.82      0.50      0.62       465

    accuracy                           0.88      6128
   macro avg       0.80      0.60      0.67      6128
weighted avg       0.86      0.88      0.86      6128



In [28]:
text = []
inputT = ""
while(inputT != "0"):
    inputT = input("enter text that you want to evaluate: ")
    text.append(inputT.lower())
    dfN = predict(vectoriser, LRmodel, text)
#     dfT = predict(vectoriser, RFmodel, text)
    print(dfN.head())
    text.clear()

enter text that you want to evaluate: i want you to know how much i hate you
0    Negative
Name: sentiment, dtype: object
                                     text sentiment
0  i want you to know how much i hate you  Negative
enter text that you want to evaluate: i love you
0    Positive
Name: sentiment, dtype: object
         text sentiment
0  i love you  Positive
enter text that you want to evaluate: 0
0    Negative
Name: sentiment, dtype: object
  text sentiment
0    0  Negative


<h6> Save and serialize the model and tfidf vectoriser </h6>

In [29]:
# file = open('vectoriser.pkl','wb')
# pickle.dump(vectoriser, file)
# file.close()

# file = open('LR.pkl','wb')
# pickle.dump(LRmodel, file)
# file.close()

In [30]:
file = open('LR.pkl', 'rb')
LRmodel = pickle.load(file)
file.close()

file = open('vectoriser.pkl','rb')
vectoriser = pickle.load(file)
file.close()

## vectoriser has deprecated please use this code to run the while loop