# iLykei Lecture Series   
# Text Analytics (MLDS 414)   
# Assignment: Sentiment Analysis with Naive Bayes Bag-of-Words Model 

### Y.Balasanov, M. Tselishchev, &copy; iLykei 2023

## Preparing the data    

Data for this project are in the form of a corpus of documents.   
Each document is a tweet regarding an airline service.    
The goal is to identify (predict) the sentiment of the document: +1 for positive, 0 - for neutral and -1 - for negative.   
The training set contains the sentiment column in which allocation of sentiments was done by humans.    
Vocabulary for this project is created from the table of all words in the corpus of documents.    

Install necessary libraries

In [1]:
#!pip install -q matplotlib numpy pandas scikit-learn nltk

Install `protobuf` following the instructions [here](https://github.com/protocolbuffers/protobuf/blob/main/src/README.md). Then run the following line, it should not result in any error messages.

In [2]:
!protoc --python_out=./ *.proto

In [3]:
import re
import joblib
from datetime import datetime
import pickle

import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.metrics import log_loss

Download NLTK modules with stopwords, punctuation, and wordnet. 

In [4]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /home/yuri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/yuri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yuri/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Add some specific stopwords for this corpus.   
Add lemmatizer based on WordNet.

In [5]:
aircompanies_accounts = ['VirginAmerica', 'United', 'SouthwestAir', 'JetBlue', 
                         'Delta', 'USAirways', 'AmericanAir']
other_stopwords = ['fly', 'flying', 'flight', 'flights', 'plane']

eng_stopwords = stopwords.words('english')
eng_stopwords.extend([w.lower() for w in aircompanies_accounts])
eng_stopwords.extend(other_stopwords)

lemmatizer = WordNetLemmatizer()

Create function preparing bag-of-words documents.

In [6]:
def my_tokenizer(tweet):
    # Remove everything but letters:
    tweet = re.sub("[^a-zA-Z]", " ", tweet)
    # Make lower-case:
    tweet = tweet.lower()
    # Tokenize tweet:
    tokens = nltk.word_tokenize(tweet)
    # Remove stop-words:
    tokens = list(filter(lambda token: token not in eng_stopwords, tokens))
    # Lemmatize all tokens:
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

Load pretrained models.   
Note that this document uses only one model: binary Naive Bayes.    
Train and load multinomial model to improve results.    
Experiment with model ensembling if necessary.

In [7]:
vectorizer_model = joblib.load('models/vectorizer_bernoulli.joblib')
nb_model = joblib.load('models/nb_bernoulli.joblib')

## Prepare the process of responding to the tweets in real time   

Define global variables for the process.   
Initialized data frame for received tweets.

In [8]:
tweet_counter = 0     # tweet event counter
BUF_SIZE = 1000       # no need to change this buffer size

# we create buffers in advance:
tweets_df = pd.DataFrame(index=range(0, BUF_SIZE),
                         columns=['time', 'tweet_id', 'text', 
                                  'prob_neg', 'prob_neutral', 'prob_positive'])
start_time = datetime.now()

Define the event handler.     
Event hadler is a function that executes the logic of responses to the incoming messages with tweets.    
This function is automatically called every time a new message is received from the server. The function has the following steps:   

- Identify time stamp;   
- Update the data frame with received tweets;    
- Tokenize the tweet and make it a bag-of-of words;   
- Predict probabilities of classes. This step uses the pre-fitted model uploaded in the memory. The best model must be selected, or an ensembling logic with several models must be defined here;   
- Update the data frame with the predicted probabilities.   

In [9]:
def tweet_handler(tweet_id, text):
    global tweets_df, tweet_counter
    now = datetime.now()
    # update tweets_df dataframe:
    tweets_df.loc[tweet_counter] = [now, tweet_id, text, np.nan, np.nan, np.nan]
    tweet_counter += 1
    # process new tweet
    print(tweet_id, text)
    tokens =  my_tokenizer(text)
    matrix_model = vectorizer_model.transform([tokens]).toarray()
    model_proba = nb_model.predict_proba(matrix_model)
    probs = list(model_proba[0])
    print(f'{probs=}')
    tweets_df.loc[tweet_counter - 1, ['prob_neg', 'prob_neutral', 'prob_positive']] = probs
    return probs

## Run the reali-time process   

Connect to the server using your credentials stored in `my_credentials.txt`. The file must contain 2 lines: login name (email address) and the streaming password.    
Connect and see how the handler with your model classifies the documents.   
The score reflecting the accuracy of the classification of the test sample will appear in the log at the end of the session.

In [10]:
from AirTweet_connection import connect

with open("my_credentials.txt",'r') as f:
    lines = f.readlines()
login, password = map(str.strip, lines)

# server options; do not change
host = 'datastream.ilykei.com'      
port = 30019
stream_name = 'AirTweet'
catch_handler_errors = True  # we recommend using TRUE during the test and FALSE during preparation

# make connection with your personal handlers
result = connect(host, port, login, password, stream_name,
                 tweet_handler, catch_handler_errors)

Connecting to datastream.ilykei.com:30019
Sending login message
Logged in successfully as  m@ts
0 @americanair I finally got someone on the phone so no worries!
probs=[0.9493228800612477, 0.01584377173418462, 0.034833348204570305]
1 @SouthwestAir Flying South by Southwest from SJC to SNA tiday http://t.co/KJKuVJ6CMo
probs=[0.024508783933002945, 0.9166847824766486, 0.058806433590349524]
2 @united  of course this morning I see a non-stop from IAH to SFO but that was not available yesterday. #UnitedHatesUsAll
probs=[0.9222129532223743, 0.07700070913043905, 0.0007863376471874148]
3 @JetBlue For one way, not letting me select TO city....strange...and now some fairs look "normal" ????
probs=[0.9673330808472415, 0.0321620784140334, 0.0005048407387240523]
4 @united caught earlier flight to ORD. Gate checked bag, and you've lost it at O'Hare. original flight lands in 20minutes. #frustrating!
probs=[0.9999715198824672, 2.6709808771873222e-05, 1.7703087557714555e-06]
5 @SouthwestAir really this i

45 @united @JedediahBila Why is United voted every year as one of the worst airlines? Do enjoy that title? You should give Jedediah free passes
probs=[0.9848803525151636, 0.004382975282915597, 0.010736672201918677]
46 @united Not appropriate to ask in public (hence the dm). each united employee, each a new answer. your process was such a hassle i Cancelled Flighted.
probs=[0.9988328845786547, 0.0011362776374024576, 3.0837783937465415e-05]
47 @SouthwestAir Buying Early Bird was pointless. You moved me to a diff flight b/c first one was delayed, so I lost my boarding position.
probs=[0.9839092498132073, 0.014482921640292766, 0.0016078285464986575]
48 @SouthwestAir great example of customer service this morning at MSY headed to ATL. Alison and Bobbi were fantastic! Gate B8. Thank you.
probs=[0.0007957896515506072, 1.7164842187178954e-05, 0.9991870455062626]
49 @united Hopefully my baggage fees will be waived tomorrow when I actually get on a flight, as well as compensation for my hotel ro

90 @USAirways @Truthh4 they didn't give you the wifi password? Smh
probs=[0.9383640974916291, 0.05933691633620467, 0.002298986172162959]
91 @VirginAmerica so loyal that I'm driving to #NYC from #PA, to fly Virgin,  since you cut #Philly flights ;)
probs=[0.9522180156017781, 0.04157547139691773, 0.006206513001301588]
92 @USAirways I wasnt  flying your airline tonight, however a friend was and I was present for her help. I flying United and they could learn.
probs=[0.9086221773656376, 0.08603796644368372, 0.005339856190679819]
93 @AmericanAir I did
probs=[0.8437706485881202, 0.12926985073497002, 0.026959500676910452]
94 @united   Pushing 2 hours on hold. Priceless. http://t.co/thS10LDY2a
probs=[0.9491511616435871, 0.05002757014234199, 0.0008212682140672908]
95 @JetBlue Understood but I watched as fam heading to FL was expedited when they had 25m to takeoff. I am speaking at conference in SF
probs=[0.29826630482525285, 0.10699857779339657, 0.5947351173813467]
96 @AmericanAir why is does i

Check the result

In [11]:
result

{'problems': [],
 'n_signals': 100,
 'penalty': 0.7703130801826984,
 'missed_id': [],
 'score': 92}

In [12]:
# remove empty values from buffers
tweets_df = tweets_df.head(tweet_counter)
tweets_df

Unnamed: 0,time,tweet_id,text,prob_neg,prob_neutral,prob_positive
0,2023-09-18 14:21:25.183644,0,@americanair I finally got someone on the phon...,0.949323,0.015844,0.034833
1,2023-09-18 14:21:26.200747,1,@SouthwestAir Flying South by Southwest from S...,0.024509,0.916685,0.058806
2,2023-09-18 14:21:26.211786,2,@united of course this morning I see a non-st...,0.922213,0.077001,0.000786
3,2023-09-18 14:21:31.395553,3,"@JetBlue For one way, not letting me select TO...",0.967333,0.032162,0.000505
4,2023-09-18 14:21:31.444474,4,@united caught earlier flight to ORD. Gate che...,0.999972,0.000027,0.000002
...,...,...,...,...,...,...
95,2023-09-18 14:22:49.082134,95,@JetBlue Understood but I watched as fam headi...,0.298266,0.106999,0.594735
96,2023-09-18 14:22:49.797636,96,@AmericanAir why is does it seem to be impossi...,0.999851,0.000148,0.000002
97,2023-09-18 14:22:50.104544,97,@SouthwestAir got it squared away. Thank you. ...,0.121028,0.085417,0.793555
98,2023-09-18 14:22:52.523432,98,@JetBlue Thanks! See you soon!,0.345754,0.274332,0.379914


Save the log.

In [13]:
# after all you can dump your data/results and analyze it later
with open('results.pkl', 'wb') as output_f:
    pickle.dump([tweets_df, result], output_f)

## Penalty Function    

The penalty for this project is the logloss measure of accuracy of sentiment classification
$$LogLoss=-\frac{1}{N} \sum_{i=1}^N \left( y_{i,neg} \log(p_{i,neg}) + y_{i,neut} \log(p_{i,neut}) + y_{i,pos} \log(p_{i,pos}) \right),$$
where $y_{i,c}=1$ when the tweet belongs to class $c$, and 0 otherwise; $p_{i,c}$ are predicted probabilities of classes. 