## Introduction

* **Natural Language Processing (NLP):** The discipline of computer science, artificial intelligence and linguistics that is concerned with the creation of computational models that process and understand natural language. These include: making the computer understand the semantic grouping of words (e.g. cat and dog are semantically more similar than cat and spoon), text to speech, language translation and many more

* **Sentiment Analysis:** It is the interpretation and classification of emotions (positive or negative) within text data using text analysis techniques. 

In this notebook, we'll develop a **Sentiment Analysis model** to categorize a tweet as **Positive or Negative.**

In [None]:

import re
import pickle
import numpy as np
import pandas as pd

import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt

from nltk.stem import WordNetLemmatizer
from nltk import download
download('stopwords')
download('wordnet')
from nltk.corpus import stopwords

from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

## <a name="p2">Importing dataset</a>
The dataset being used here contains 1,600,000 tweets extracted using the **Twitter API**. The tweets have been annotated **(0 = Negative, 4 = Positive)** and they can be used to detect sentiment.
 
The training data isn't perfectly categorised as it has been created by tagging the text according to the emoji present. So, any model built using this dataset may have lower than expected accuracy, since the dataset isn't perfectly categorised.

**It contains the following 6 fields:**
1. **sentiment**: the polarity of the tweet *(0 = negative, 4 = positive)*
2. **ids**: The id of the tweet 
3. **date**: the date of the tweet 
4. **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
5. **user**: the user that tweeted 
6. **text**: the text of the tweet 

We require only the **sentiment** and **text** fields, so we discard the rest.

Furthermore, we're changing the **sentiment** field so that it has new values to reflect the sentiment. **(0 = Negative, 1 = Positive)**

In [None]:

import os
for dirname, _, filenames in os.walk('/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
encoding_used = "ISO-8859-1"
tweets_df = pd.read_csv("input/sentiment140/training.1600000.processed.noemoticon.csv", encoding = encoding_used)
tweets_df.head()

In [None]:
data_columns  = ["target", "ids", "date", "flag", "user", "text"]
encoding_used = "ISO-8859-1"

tweets_df = pd.read_csv("input/sentiment140/training.1600000.processed.noemoticon.csv", encoding = encoding_used, names = data_columns )
tweets_df.head()

## <a name="#p2-a">Exploratory Data Analysis</a>

In [None]:
target_group = tweets_df.groupby('target').count()['text']
target_group

In [None]:
# Plotting the distribution for dataset.
ax = target_group.plot(kind='bar', title='Distribution of data', legend=False)
ax.set_xticklabels(['Negative','Positive'], rotation = 0)

* **target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)**
* **This means we only have negative and positive labels.**
* **I will change all positive labels to 1 i.e all 4 == 1**

## <a > Keeping only relevant values </a>

In [None]:
tweets_df.columns

In [None]:
data = tweets_df[['target', 'text']]
data.head()

## Renaming Column

In [None]:
data.columns = ["sentiment", "text"]
data.head()

## Replacing value

In [None]:
data[data['sentiment'] != 0].head()

In [None]:
data['sentiment'] = data['sentiment'].replace(4,1)

## Finding the missing values

In [None]:
data.isnull().sum()

## Converting data to a list data structure

In [None]:
text, sentiment = list(data['text']), list(data['sentiment'])

## Checking Output of Data Structure

In [None]:
text[0:16]

In [None]:
sentiment[0:16]

## <a name="p3">Preprocess Text</a>
**Text Preprocessing** is traditionally an important step for **Natural Language Processing (NLP)** tasks.

**The Preprocessing steps taken are:**
1. **Lower Casing:** Each text is converted to lowercase. #Helps to keep things normalized
2. **Replacing URLs:** Links starting with **"http" or "https" or "www"** are replaced by **"URL"**.
3. **Replacing Emojis:** Replace emojis by using a pre-defined dictionary containing emojis along with their meaning. *(eg: ":)" to "EMOJIsmile")*
4. **Replacing Usernames:** Replace @Usernames with word **"USER"**. *(eg: "@Kaggle" to "USER")*
5. **Removing Non-Alphabets:** Replacing characters except Digits and Alphabets with a space.
6. **Removing Consecutive letters:** 3 or more consecutive letters are replaced by 2 letters. *(eg: "Heyyyy" to "Heyy")*
7. **Removing Short Words:** Words with length less than 2 are removed.
8. **Removing Stopwords:** Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. *(eg: "the", "he", "have")*
9. **Lemmatizing:** Lemmatization is the process of converting a word to its base form. *(e.g: “Great” to “Good”)*

## Defining Emojis and their meanings

In [None]:
# Defining dictionary containing all emojis with their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

## Defining Stop words in English

In [None]:
mystopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

In [None]:
english_stop_words =  stopwords.words('english')
english_stop_words[:10]

In [None]:
stopwordlist = stopwords.words('english') + mystopwordlist

## Preprocesing Function

In [None]:
def preprocess(textdata):
    processedText = []
    
    #creating a Lemmatizer
    wordLemma = WordNetLemmatizer() #define the imported library
    
    # Defining regular expression pattern we can find. in tweets
    
    urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)" # e.g check out https://dot.com for more
    userPattern       = '@[^\s]+' # e.g @FagbamigbeK check this out
    alphaPattern      = "[^a-zA-Z0-9]" # e.g I am *10 better!
    sequencePattern   = r"(.)\1\1+"  # e.g Heyyyyyyy, I am back!
    seqReplacePattern = r"\1\1" # e.g Replace Heyyyyyyy with Heyy
    
    
    for tweet in textdata:
        tweet = tweet.lower() #normalizing all text to a lower case
        
        
        # Replace all URls with 'URL'
        tweet = re.sub(urlPattern,' URL',tweet) #using the substitution method of the regular expression library
        
        
        # Replace all emojis.
        for emoji in emojis.keys(): #in each of the looped tweet, replace each emojis with their respective meaning
            tweet = tweet.replace(emoji, "EMOJI" + emojis[emoji])  # emojis[emoji] helps to get the value of the emoji from the dictionary
            
            
        # Replace @USERNAME to 'USER'.
        tweet = re.sub(userPattern,' USER', tweet)  #To hide Personal Information, we can replace all usernames with User
        
        
        # Replace all non alphabets.
        tweet = re.sub(alphaPattern, " ", tweet) # e.g I am *10 better!
        
        
        # Replace 3 or more consecutive letters by 2 letter.
        tweet = re.sub(sequencePattern, seqReplacePattern, tweet) # e.g Replace Heyyyyyyy with Heyy
        
        
        tweetwords = ''
        for word in tweet.split():
            if len(word) > 2 and word.isalpha():
                word = wordLemma.lemmatize(word)
                tweetwords += (word + ' ')
        
        processedText.append(tweetwords)
        
    return processedText

## Noting the time text preprocessing took

In [None]:
import time
t = time.time()
preprocessedtext = preprocess(text) #the preprocess function at work
print(f'Text Processing Done.')
print(f'Time taken for text processing: {round(time.time()-t)} seconds')

## <a name="p4">Analysing the data</a>
Now we're going to analyse the preprocessed data to get an understanding of it. We'll plot **Word Clouds** for **Positive and Negative** tweets from our dataset and see which words occur the most.

## Before Processing

In [None]:
text[0:11]

## After Processing

In [None]:
preprocessedtext[0:11]

In [None]:
negative_sentiments = preprocessedtext[:800000]
negative_sentiments[0:10]

In [None]:
data_neg = []
for words in negative_sentiments:
    words = words.lower().replace("user","")
    words = words.lower().replace("url","")
    data_neg.append(words)
    
data_neg[0:10]

In [None]:
word_cloud = WordCloud(max_words = 1000 , width = 1600 , height = 800, collocations=False)

In [None]:
plt.figure(figsize = (20,20))
negative_wc = word_cloud.generate(" ".join(data_neg))
plt.imshow(negative_wc)

### Word-Cloud for Positive tweets.

In [None]:
positive_sentiments = preprocessedtext[800000:]
positive_sentiments[0:10]

In [None]:
data_pos = []
for words in positive_sentiments:
    words = words.lower().replace("user","")
    words = words.lower().replace("url","")
    data_pos.append(words)
    
data_pos[0:10]

In [None]:
plt.figure(figsize = (20,20))
positive_wc = word_cloud.generate(" ".join(data_pos))
plt.imshow(positive_wc)

### <a name="p7">Tranforming the dataset</a>
Transforming the **X_train** and **X_test** dataset into matrix of **TF-IDF Features** by using the **TF-IDF Vectoriser**. This datasets will be used to train the model and test against it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(preprocessedtext, sentiment,
                                                    test_size = 0.05, random_state = 0)
print(f'Data Split done.')

## <a name="p6">TF-IDF Vectoriser</a>
##### Term Frequency Inverse Document Frequency
**TF-IDF indicates what the importance of the word is in order to understand the document or dataset.**

**TF-IDF Vectoriser** converts a collection of raw documents to a **matrix of TF-IDF features**. The **Vectoriser** is usually trained on only the **X_train** dataset. 

**ngram_range**  is the range of number of words in a sequence. *[e.g "very expensive" is a 2-gram that is considered as an extra feature separately from "very" and "expensive" when you have a n-gram range of (1,2)]*

**max_features** specifies the number of features to consider. 

- gives weight to each word and tells how important the word is. 
- Importances increases proportionally to the number of times a word appears in the sentence but is penalized by the frequency of the word in all the sentences
- Weight is the product of term frquency(frequency of a word occuring in a sentence) and inverse document frequency(measures how important the word is)
- weight = term frequency * inverse document frequency
- ranage(1,2) means vectorizer will consider a single word or pair of word for calculation
- range(2,2) means a pair of word only
- strip_accent to protect against unwanted encoding

In [None]:
vectoriser = TfidfVectorizer(ngram_range=(1,2),stop_words = stopwordlist, strip_accents = 'unicode', max_features = 500000)
vectoriser.fit(X_train) #fit the training data
print(f'Vectoriser fitted.')
print('No. of feature_words: ', len(vectoriser.get_feature_names()))

### <a name="p7">Tranforming the dataset</a>
Transforming the **X_train** and **X_test** dataset into matrix of **TF-IDF Features** by using the **TF-IDF Vectoriser**. This datasets will be used to train the model and test against it.

In [None]:
#transform the training and test data
X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)
print(f'Data Transformed.')

## <a name="p8">Creating and Evaluating Models</a>

We're creating 3 different types of model for our sentiment analysis problem: 
* **Bernoulli Naive Bayes (BernoulliNB)**
* **Linear Support Vector Classification (LinearSVC)**
* **Logistic Regression (LR)**
* **Category and Boosting (Cat Boost)**
* **Light Gradient Boosting Machine(LightGBM)**

. We're choosing **Accuracy** as our evaluation metric. Furthermore, we're plotting the **Confusion Matrix** to get an understanding of how our model is performing on both classification types.

### Evaluate Model Function

In [None]:
def model_Evaluate(model):
    
    # Predict values for Test dataset
    y_pred = model.predict(X_test) #Xtest is not used in model training

    # Print the evaluation metrics for the dataset.
    print(classification_report(y_test, y_pred))
    
    # Compute and plot the Confusion matrix
    cf_matrix = confusion_matrix(y_test, y_pred)

    categories  = ['Negative','Positive']
    group_names = ['True Neg','False Pos', 'False Neg','True Pos'] #configuration of a confusin matrix
    group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)] #converting confusion matrix value to percentage in 2 decimal places.

    labels = [f'{v1}\n{v2}' for v1, v2 in zip(group_names,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)

    sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
                xticklabels = categories, yticklabels = categories)

    plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
    plt.ylabel("Actual values"   , fontdict = {'size':14}, labelpad = 10)
    plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)

### <a name="p8-1">BernoulliNB Model</a>

    * [BernoulliNB Model](#p8-1)
    * [LinearSVC Model](#p8-2)
    * [Logistic Regression Model](#p8-3)
    * [Gradient Boosting Model](#p8-4)
    * [Naive Bayes Model](#p8-5)

In [None]:
BNBmodel = BernoulliNB(alpha = 2)
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)

### <a name="p8-2">LinearSVC Model</a>

In [None]:
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)

### <a name="p8-3">Logistic Regression Model</a>

In [None]:
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)

### <a name="p8-4">Cat Boost Model</a>

In [None]:
# #categorical_features_indices = np.where(X.dtypes != np.float)[0]
# CatBoostmodel = CatBoostClassifier(cat_features = preprocessedtext, eval_metric = (X_test, y_test))
# CatBoostmodel.fit(X_train, y_train)
# model_Evaluate(CatBoostmodel)

### <a name="p8-5">LightGBM Model</a>

In [None]:
# LGBMmodel = LGBMClassifier()
# LGBMmodel.fit(X_train, y_train)
# model_Evaluate(LGBMmodel)

### <a name="p8-6">Gradient Boosting Model</a>

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
GRDBCmodel = GradientBoostingClassifier()
GRDBCmodel.fit(X_train, y_train)
model_Evaluate(GRDBCmodel)

### <a name="p8-7">Naive Bayes Model</a>

In [None]:
from sklearn.naive_bayes import MultinomialNB
NBClassifier_model = MultinomialNB()
NBClassifier_model.fit(X_train, y_train)
model_Evaluate(NBClassifier_model)

We can clearly see that the **Logistic Regression Model** performs the best out of all the different models that we tried. It achieves nearly **80% accuracy** while classifying the sentiment of a tweet.

Although it should also be noted that the **BernoulliNB Model** is the fastest to train and predict on. It also achieves **78% accuracy** while calssifying.

## <a name="p9">Saving the Models</a>
We're using **PICKLE** to save **Vectoriser, BernoulliNB, Logistic Regression Model, Linear Support Vector Classification and Light Gradient Boosting Machine** for later use.

In [None]:
#Vectoriser
file = open('vectoriser-ngram-(1,2).pickle','wb')
pickle.dump(vectoriser, file)
file.close()


#Bernoulli
file = open('Sentiment-BNB.pickle','wb')
pickle.dump(BNBmodel, file)
file.close()


#Linear Regression
file = open('Sentiment-LR.pickle','wb')
pickle.dump(LRmodel, file)
file.close()


#SVCmodel
file = open('Sentiment-SVCmodel.pickle','wb')
pickle.dump(SVCmodel, file)
file.close()


#CatBoost
# file = open('Sentiment-CatBoost.pickle','wb')
# pickle.dump(CatBoostmodel, file)
# file.close()


#LightGBM
# file = open('Sentiment-LightGBM.pickle','wb')
# pickle.dump(LightGBM, file)
# file.close()

## <a name="p10">Using the Model.</a>

To use the model for **Sentiment Prediction** we need to import the **Vectoriser** and **LR Model** using **Pickle**.

The vectoriser can be used to transform data to matrix of TF-IDF Features.
While the model can be used to predict the sentiment of the transformed Data.
The text whose sentiment has to be predicted however must be preprocessed.

In [None]:
def load_models():
    '''
    Replace '..path/' by the path of the saved models.
    '''
    
    # Load the vectoriser.
    file = open('..path/vectoriser-ngram-(1,2).pickle', 'rb')
    vectoriser = pickle.load(file)
    file.close()
    
    
    # Load the LR Model.
    file = open('..path/Sentiment-LRv1.pickle', 'rb')
    LRmodel = pickle.load(file)
    file.close()
    
    return vectoriser, LRmodel

def predict(vectoriser, model, text):
    # Predict the sentiment
    textdata = vectoriser.transform(preprocess(text)) #Passing the tweet through the processing stage and transforming it with the vectoriser
    sentiment = model.predict(textdata)
    
    # Make a list of text with sentiment.
    data = []
    for text, pred in zip(text, sentiment):
        data.append((text,pred))
        
    # Convert the list into a Pandas DataFrame.
    df = pd.DataFrame(data, columns = ['text','sentiment'])
    df = df.replace([0,1], ["Negative","Positive"]) #Replacing the class of 0 and 1 with Negative and Positive respectively
    return df

## <a name="p11">Model Testing.</a>

In [None]:

if __name__=="__main__":
    # Loading the models.
    #vectoriser, LRmodel = load_models()
    
    # Text to classify should be in a list.
    text = ["I Love Google!",
            "May the Good Lord be with you.", "I hate peanuts!",
            "Mr. Kehinde, what are you doing next? this is great!"]
    
    df = predict(vectoriser, LRmodel, text)
    print(df.head())