# **US Airlines Tweets Sentiment Classification**

### **1. Problem Statement**

<div style="text-align: justify">Sentiment analysis is a text analysis technique that detects polarity (e.g. a positive or negative opinion) within text, whether a whole document, paragraph,sentence, or clause. Sentimentanalysis is also known as opinion mining. Understanding people’s emotions is essential for businesses since customers express their thoughts and feelings more openly than ever before. The commonly used social media platform to express one’s opinions or emotions is Twitter. </div> <br>

<div style="text-align: justify">Performing analysis on customer feedback, such as opinions in survey responses and social media conversations, allows brands to listen attentively to their customers, and tailor products and services to meet their needs. However, all the opiniated data from the Twitter is in the form of text which is unstructured. . Sentiment analysis, however, helps businesses make sense of all this unstructured text by automatically understanding, processing, and tagging it.</div> <br>

<div style="text-align: justify">Objective of this project is to perform sentiment analysis on the tweets of six US Airlines. The scrapped tweets contain positive, negative, or neutral sentiments about the airline from their respective customers. The task is to analyze how travelers in February 2015 expressed their feelings on Twitter about six major US airlines. Few of the algorithms used for sentiment analysis are Naive Bayes, SVM, Logistic Regression and LSTM. Out of them, in this project Naïve Bayes classifier is used to build the sentiment analysis model for the US Airline Tweets. The classifier is hard coded in Python without using any libraries with inbuilt classifiers. </div> <br>

The tasks invloved in this project are as follows:

* Build a dictionary based on your training corpus. Calculate conditional probability of each token for each class (this is also called unigram probability). Then evaluate on test data and report accuracy.
* Try to improve your algorithm. Some suggestions: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i. Remove STOP words from the vocabulary that appear vary frequently but not related to the attitude or opinion of the writer. <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ii. Reduce the size of your vocabulary further by taking only top-k frequent word types that appear in the training dataset. Vary k and compare performance.

# **2. Importing Libraries & Loading Data**

In [None]:
#importing the required libraries & packages
import pandas as pd
import numpy as np
import time
import argparse
import string
from sklearn.model_selection import train_test_split
from nltk.tokenize import regexp_tokenize
from datetime import datetime
import pytz

In [None]:
#loading the US Airline Sentiment data
data_frame = pd.read_csv('../input/Tweets.csv')
data_frame.head()

In [None]:
data_frame.shape

# **3. Data Preprocessing**

The US Airline Sentiment dataset has so many features. Among them, in our project we will be working with the 'text' & 'airline_segment' features. Here, the 'text' is considered as a feature (X) and the 'airline_segment' as a target label (y) for the classifier.

The values in the target data are categorical and are 'neutral', 'positive' & 'negative'. For the easy computation, we are replacing the 'neutral', 'positive' & 'negative' with 0, 1 & 2 values respectively.

In [None]:
#repalcing the categorical values of 'airline_sentiment' to numeric values
data_frame['airline_sentiment'].replace(('neutral', 'positive', 'negative'), (0, 1, 2), inplace=True)
data_frame['airline_sentiment'].value_counts()

In [None]:
#forming the feature & label variables
data = data_frame['text'].values.tolist()
labels = data_frame['airline_sentiment'].values.tolist()

In [None]:
#First five samples text
data[:5]

In [None]:
#first 5 samples label
labels[:5]

# **4. Splitting the data for Classification**

The data splitting is done in 80-20 split using trian_test_split method of sklearn.

In [None]:
#splitting the data into 80 and 20 split
train_X, test_X, y_train, y_test = train_test_split(data, labels, test_size=0.2, 
                                                    random_state=42, shuffle=True)

print(f'Number of training examples: {len(train_X)}')
print(f'Number of testing examples: {len(test_X)}')

# **5. Text Preprocessing**

A process of transforming text into something an algorithm can digest is text processing. This includes:
* &nbsp;tokenizing the data
* &nbsp;removing the punctuation
* &nbsp;removing the stopwords
* &nbsp;stemming 
* &nbsp;lemmatization

As of now, we are only going to tokenize the data and work with it without removing the punctuation or stop words and apply any other text processing methods.

In [None]:
# Here is a default pattern for tokenization
default_pattern =  r"""(?x)                  
                        (?:[A-Z]\.)+          
                        |\$?\d+(?:\.\d+)?%?    
                        |\w+(?:[-']\w+)*      
                        |\.\.\.               
                        |(?:[.,;"'?():-_`])    
                    """

In [None]:
#funtion for tokenizing the data
""" Tokenize sentence with specific pattern
Arguments: text {str} -- sentence to be tokenized, such as "I love NLP"
Keyword Arguments: pattern {str} -- reg-expression pattern for tokenizer (default: {default_pattern})
Returns: list -- list of tokenized words, such as ['I', 'love', 'nlp'] """

def tokenize(text, pattern = default_pattern):
    text = text.lower()
    return regexp_tokenize(text, pattern)

In [None]:
# Tokenize training text into tokens
tokenized_text = []
for i in range(0, len(train_X)):
    tokenized_text.append(tokenize(train_X[i]))

X_train = tokenized_text

# Tokenize testing text into tokens
tokenized_text = []
for i in range(0, len(test_X)):
    tokenized_text.append(tokenize(test_X[i]))

X_test = tokenized_text

In [None]:
#tokenized train & test data
print(X_train[0], X_train[1])
print(X_test[0])

# **6. Building Dictionary**

Building dictionary of the training data.

In [None]:
#building dictionary
""" Function: To create a dictionary of tokens from the data
Arguments: data in the type - list
Returns: Sorted dictionary of the tokens and their count in the data """

def createDictionary(data):
    dictionary = dict()
    for sample in  data:
        for token in sample:
            dictionary[token] = dictionary.get(token, 0) + 1
    
    #sorting the dictionary based on the values
    sorted_dict = sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
    return dict(sorted_dict)

In [None]:
bog = createDictionary(X_train)
#top 10 items in the dictionary
print("Top 10 tokens in the training dictionary:\n")
list(bog.items())[:10]

# **7. Building the Navie Bayes Classifier**

Now, we define text classifier class called NBClassifier, which comprises of three functions:
* createDictionary()
* fit()
* predict()
* score()

**createDictionary():** This function takes in the tokenized text data and gives out the dictionary or the bag of words of the data.

**fit():** This function has all the word counts required to calculate the Navie Bayes Classifier probabilities and then fits the classifier on our training data.

**predict():** The test data is inputed to this function which determines the sentiment label based of each tweet by using the word counts computed during the training process (from fit function). In this step, Laplace smoothing is applied while computing Naïve Bayes probabilities for the test data.

**score():** Determine how many tweets are classified correctly and measures the performance of the model in terms of accuracy.

In [None]:
#Navie Bayes Classifier 
class NBClassifier:

    def __init__(self, X_train, y_train, size):
        tz_NY = pytz.timezone('America/New_York') 
        print("Model Start Time:", datetime.now(tz_NY).strftime("%H:%M:%S"))
        self.X_train = X_train
        self.y_train = y_train
        self.size = size

    def createDictionary(self):
        dictionary = dict()
        for sample in  X_train:
            for token in sample:
                dictionary[token] = dictionary.get(token, 0) + 1
        #sorting the dictionary based on the values
        sorted_dict = sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
        return dict(sorted_dict)
    
    def fit(self):
        """ Function: To compute the count of words in training data dictionary
        Arguments: Trianing data & Size of dictionary
        Returns: dictionary of tokens with their class wise probabilities """
      
        X_train_dict = self.createDictionary()
        if self.size == 'full':
            self.words_list = list(X_train_dict.keys())
            self.words_count = dict.fromkeys(self.words_list, None)
        else:
            self.words_list = list(X_train_dict.keys())[:int(self.size)]
            self.words_count = dict.fromkeys(self.words_list, None)
            
        #DataFrame of training data
        train = pd.DataFrame(columns = ['X_train', 'y_train'])
        train['X_train'] = X_train
        train['y_train'] = y_train

        train_0 = train.copy()[train['y_train'] == 0]
        train_1 = train.copy()[train['y_train'] == 1]
        train_2 = train.copy()[train['y_train'] == 2]

        #computing the prior of each class
        Pr0 = train_0.shape[0]/train.shape[0]
        Pr1 = train_1.shape[0]/train.shape[0]
        Pr2 = train_2.shape[0]/train.shape[0]

        self.Prior = np.array([Pr0, Pr1, Pr2])

        #converting list of lists into a list
        def flatList(listOfList):
            flatten = []
            for elem in listOfList:
                flatten.extend(elem)
            return flatten
  
        #Creating the data list for each class - tokens of each class
        X_train_0 = flatList(train[train['y_train'] == 0]['X_train'].tolist())
        X_train_1 = flatList(train[train['y_train'] == 1]['X_train'].tolist())
        X_train_2 = flatList(train[train['y_train'] == 2]['X_train'].tolist())
    
        self.X_train_len = np.array([len(X_train_0), len(X_train_1), len(X_train_2)])

        for token in self.words_list:
            #list to store three word counts of a token
            res = []

            #inserting count of token in class 0: Neutral
            res.insert(0, X_train_0.count(token))

            #inserting count of token in class 1: Positive
            res.insert(1, X_train_1.count(token))

              #inserting count of token in class 2: Negative
            res.insert(2, X_train_2.count(token))

            #assigning the count list to its token in the dictionary 
            self.words_count[token] = res
        return self

    def predict(self, X_test):
        """ Function: Predicts the label of the data
        Arguments: self and the test data
        Returns: List of predicted labels for the test data """     
        pred = []
        for sample in X_test:
            mul = np.array([1,1,1])
            for tokens in sample:
                vocab_count = len(self.words_list)
                if tokens in self.words_list:
                    prob = ((np.array(self.words_count[tokens])+1) / (self.X_train_len + vocab_count))
                mul = mul * prob
            val = mul * self.Prior
            pred.append(np.argmax(val))
        tz_NY = pytz.timezone('America/New_York') 
        print("Model End Time:", datetime.now(tz_NY).strftime("%H:%M:%S"))
        return pred
    
    def score(self, pred, labels):
        """ Function: To compute the perfoemance of the model
        Arguments: self, predicted labels and actual labels of the test data
        Returns: Number of lables correctly predicted and the accuracy of the model """
        correct = (np.array(pred) == np.array(labels)).sum()
        accuracy = correct/len(pred)
        return correct, accuracy

# **8. Navie Bayes Classifier Training and Evaluation**

The Navie Bayes Classifier, NBClassifier takes three arguments:
* X_train: Features of training dataset
* y_train: Labels of training dataset
* size: Size of vacabulary to be used in the model

All three arguments are needed for the model to work.

In [None]:
# Creating holders to store the model performance results
attributes = []
corr = []
acc = []

#function to call for storing the results
def storeResults(attr, cor,ac):
    attributes.append(attr)
    corr.append(round(cor, 3))
    acc.append(round(ac, 3))

In [None]:
#training the classifier     
nb = NBClassifier(X_train, y_train, 'full')  
nb.fit()

#predicting the labels for test samples
y_pred = nb.predict(X_test)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred))

In [None]:
#Performance of the classifier
cor1, acc1 = nb.score(y_pred, y_test)
print("Count of Correct Predictions:", cor1)
print("Accuracy of the model: %i / %i = %.4f " %(cor1, len(y_pred), acc1))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Unprocessed Data', cor1, acc1)

The Navie Bayers Classifier that we trainined on the data predicts 78.24% of samples correctly. Now, to improve this number few more text processing methods are appilied on the training data and then the classifier is trained on this  modified data to predict the sentiment of the test samples.



# **9. Trying to improve the NBClassifier**
To improve the performance of the NBClassifier, 
* apply other text processing methods
* reduce the size of dictonary

## **9.1. Further Processing Text Data**

In this step, we are going to apply two text processing methods on the previously tokenized data:
* remove the punctuation 
* remove stop words

### **9.1.1. Remove Puntuation**

In [None]:
#string of punctiations
string.punctuation

In [None]:
#Removing the punctuation
'''Function: Removes the punctuation from the tokens
   Arguments: list of text data samples
   Returns: list of tokens of each sample without punctuation '''
def removePunctuation(data):
    update = []
    for sample in data:
        #removing punctuation from the tokens
        re_punct = [''.join(char for char in word if char not in string.punctuation) for word in sample]
        #removes the empty strings
        re_punct = [word for word in re_punct if word]
       
        update.append(re_punct)
    return update

In [None]:
#Removing punctuation from training data text tokens  
X_train_P = removePunctuation(X_train)

#Removing punctuation from testing data text tokens
X_test_P = removePunctuation(X_test)

#train & test data after removing punctuation
print(X_train_P[0])
print(X_test_P[0])

In [None]:
#training the classifier     
nb_punct = NBClassifier(X_train_P, y_train, 'full')
nb_punct.fit()

#predicting the labels for test samples
y_pred_P = nb_punct.predict(X_test_P)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_P))

In [None]:
#Performance of the classifier
cor2, acc2 = nb_punct.score(y_pred_P, y_test)
print("Count of Correct Predictions:", cor2)
print("Accuracy of the model: %i / %i = %.4f " %(cor2, len(y_pred_P), acc2))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('No Punctuation Data', cor2, acc2)

### **9.1.2. Remove Stopwords**

In [None]:
'''Function: Removes the stopwords from the tokens
   Arguments: list of text data samples
   Returns: list of tokens of each sample without punctuation '''
def removeStopWords(data):
    update = []
    stopwords = ['the', 'at','i', 'of', 'us', 'have', 'a', 'you','ours', 'themselves', 
                 'that', 'this', 'be', 'is', 'for']
    for sample in data:
        #removing stopwords from tokenized data
        re_stop = [word for word in sample if word not in stopwords]
        
        update.append(re_stop)
    return update

In [None]:
#Removing stopwords from training data text tokens  
X_train_S = removeStopWords(X_train)

#Removing stopwords from testing data text tokens
X_test_S = removeStopWords(X_test)

#train & test data after removing stopwords
print(X_train_S[0])
print(X_test_S[0])

In [None]:
#training the classifier     
nb_stop = NBClassifier(X_train_S, y_train, 'full')
nb_stop.fit()

#predicting the labels for test samples
y_pred_S = nb_stop.predict(X_test_S)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_S))

In [None]:
#Performance of the classifier
cor3, acc3 = nb_stop.score(y_pred_S, y_test)
print("Count of Correct Predictions:", cor3)
print("Accuracy of the model: %i / %i = %.4f " %(cor3, len(y_pred_S), acc3))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Removed few Stopwords', cor3, acc3)

### **9.1.3. Removing both Punctuation & Few Stopwords**

In [None]:
#Removing stopwords from training data text tokens  
X_train_PS = removeStopWords(X_train_P)

#Removing stopwords from testing data text tokens
X_test_PS = removeStopWords(X_test_P)

#train & test data after removing stopwords
print(X_train_PS[0])
print(X_test_PS[0])

In [None]:
#training the classifier     
nb_PS = NBClassifier(X_train_PS, y_train, 'full')
nb_PS.fit()

#predicting the labels for test samples
y_pred_PS = nb_PS.predict(X_test_PS)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_PS))

In [None]:
#Performance of the classifier
cor4, acc4 = nb_PS.score(y_pred_PS, y_test)
print("Count of Correct Predictions:", cor4)
print("Accuracy of the model: %i / %i = %.4f " %(cor4, len(y_pred_PS), acc4))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Removed both Punctuation & Few Stopwords', cor4, acc4)

## **9.2. Reducing the Dictionary Size**

To improve the model performance, we reduce the size of training dictionary further by taking only top-k frequent word types that appear in it. Here, we vary the value of k and compare the model performance.



In [None]:
#total tokens in training dictionary
print('Total tokens in the dictionary:', len(bog))

### **9.2.1. Considering Top 5k Tokens**

**5k Tokens of Vocabulary - Unprocessed data**

In [None]:
#training the classifier - 5000 tokens 
nb_5k = NBClassifier(X_train, y_train, '5000')
nb_5k.fit()

#predicting the labels for test samples
y_pred_5k = nb_5k.predict(X_test)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k))

In [None]:
#Performance of the classifier
cor5, acc5 = nb_5k.score(y_pred_5k, y_test)
print("Count of Correct Predictions:", cor5)
print("Accuracy of the model: %i / %i = %.4f " %(cor5, len(y_pred), acc5))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - Unprocessed Data', cor5, acc5)

**5k Tokens of Vocabulary - No Punctuation Data**

In [None]:
#training the classifier - 5000 tokens 
nb_5k_P = NBClassifier(X_train_P, y_train, '5000')
nb_5k_P.fit()

#predicting the labels for test samples
y_pred_5k_P = nb_5k_P.predict(X_test_P)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k_P))

In [None]:
#Performance of the classifier
cor6, acc6 = nb_5k.score(y_pred_5k_P, y_test)
print("Count of Correct Predictions:", cor6)
print("Accuracy of the model: %i / %i = %.4f " %(cor6, len(y_pred), acc6))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - No Punctuation Data', cor6, acc6)

**5k Tokens of Vocabulary - Removed few Stopwords**




In [None]:
#training the classifier - 5000 tokens 
nb_5k_S = NBClassifier(X_train_S, y_train, '5000')
nb_5k_S.fit()

#predicting the labels for test samples
y_pred_5k_S = nb_5k_S.predict(X_test_S)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k_S))

In [None]:
#Performance of the classifier
cor7, acc7 = nb_5k_S.score(y_pred_5k_S, y_test)
print("Count of Correct Predictions:", cor7)
print("Accuracy of the model: %i / %i = %.4f " %(cor7, len(y_pred), acc7))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - Removed few Stopwords', cor7, acc7)

**5k Tokens of Vocabulary - Removed both Punctuation & Few Stopwords**




In [None]:
#training the classifier - 5000 tokens 
nb_5k_PS = NBClassifier(X_train_PS, y_train, '5000')
nb_5k_PS.fit()

#predicting the labels for test samples
y_pred_5k_PS = nb_5k_PS.predict(X_test_PS)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_5k_PS))

In [None]:
#Performance of the classifier
cor8, acc8 = nb_5k_PS.score(y_pred_5k_PS, y_test)
print("Count of Correct Predictions:", cor8)
print("Accuracy of the model: %i / %i = %.4f " %(cor8, len(y_pred), acc8))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('5k Tokens of Voab - Removed both Punctuation & Few Stopwords', cor8, acc8)

### **9.2.2. Considering Top 10k Tokens**
**10k Tokens of Vocabulary - Unprocessed data**

In [None]:
#training the classifier - 5000 tokens 
nb_10k = NBClassifier(X_train, y_train, '5000')
nb_10k.fit()

#predicting the labels for test samples
y_pred_10k = nb_10k.predict(X_test)

#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_10k))

In [None]:
#Performance of the classifier
cor9, acc9 = nb_10k.score(y_pred_10k, y_test)
print("Count of Correct Predictions:", cor9)
print("Accuracy of the model: %i / %i = %.4f " %(cor9, len(y_pred), acc9))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - Unprocessed Data', cor9, acc9)

**10k Tokens of Vocabulary - No Punctuation Data**

In [None]:
#training the classifier - 10000 tokens 
nb_10k_P = NBClassifier(X_train_P, y_train, '10000')
nb_10k_P.fit()

#predicting the labels for test samples
y_pred_10k_P = nb_10k_P.predict(X_test_P)
  
#Checking
print("NBClassifier Model miss any prediction???", len(X_test) != len(y_pred_10k_P))

In [None]:
#Performance of the classifier
cor10, acc10 = nb_10k_P.score(y_pred_10k_P, y_test)
print("Count of Correct Predictions:", cor10)
print("Accuracy of the model: %i / %i = %.4f " %(cor10, len(y_pred), acc10))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - No Punctuation Data', cor10, acc10)

**10k Tokens of Vocabulary - Removed few Stopwords**




In [None]:
#training the classifier - 10000 tokens 
nb_10k_S = NBClassifier(X_train_S, y_train, '10000')
nb_10k_S.fit()

#Sredicting the labels for test samSles
y_pred_10k_S = nb_10k_S.predict(X_test_S)
  
#Checking
print("NBClassifier Model miss any Srediction???", len(X_test) != len(y_pred_10k_S))

In [None]:
#Performance of the classifier
cor11, acc11 = nb_10k_S.score(y_pred_10k_S, y_test)
print("Count of Correct Predictions:", cor11)
print("Accuracy of the model: %i / %i = %.4f " %(cor11, len(y_pred), acc11))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - Removed few Stopwords', cor11, acc11)

**10k Tokens of Vocabulary - Removed both Punctuation & Few Stopwords**




In [None]:
#training the claPSPSifier - 10000 tokenPS 
nb_10k_PS = NBClassifier(X_train_PS, y_train, '10000')
nb_10k_PS.fit()

#PSredicting the labelPS for tePSt PSamPSlePS
y_pred_10k_PS = nb_10k_PS.predict(X_test_PS)
  
#Checking
print("NBClaPSPSifier Model miSS any PSrediction???", len(X_test) != len(y_pred_10k_PS))

In [None]:
#Performance of the classifier
cor12, acc12 = nb_10k_PS.score(y_pred_10k_PS, y_test)
print("Count of Correct Predictions:", cor12)
print("Accuracy of the model: %i / %i = %.4f " %(cor12, len(y_pred), acc12))

In [None]:
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('10k Tokens of Voab - Removed both Punctuation & Few Stopwords', cor12, acc12)

# **10. Comparing the Results**

In [None]:
#creating dataframe
results = pd.DataFrame({ 'Data Modification': attributes,    
    'Correct Predictions': corr,
    'Model Accuracy': acc})

In [None]:
results.sort_values(by=['Model Accuracy', 'Correct Predictions'], ascending=False)

**NOTE: Detailed description & analysis of each step are mentioned in the report @ https://github.com/shreyagopal/US-Airlines-Sentiment-Classification/blob/main/Report_US%20Airlines%20Sentiment%20Classification.pdf.**