# iLykei Lecture Series   
# Text Analytics (MLDS 414)   
# Assignment: Sentiment Analysis with Naive Bayes Bag-of-Words Model 

### Y.Balasanov, M. Tselishchev, &copy; iLykei 2023

## Preparing the data    

Data for this project are in the form of a corpus of documents.   
Each document is a tweet regarding an airline service.    
The goal is to identify (predict) the sentiment of the document: +1 for positive, 0 - for neutral and -1 - for negative.   
The training set contains the sentiment column in which allocation of sentiments was done by humans.    
Vocabulary for this project is created from the table of all words in the corpus of documents.    

Install necessary libraries

In [1]:
# !pip install -q matplotlib numpy pandas scikit-learn nltk

Install `protobuf` following the instructions [here](https://github.com/protocolbuffers/protobuf/blob/main/src/README.md). Then run the following line, it should not result in any error messages.

In [1]:
!protoc --python_out=./ *.proto

In [2]:
import re
import joblib
from datetime import datetime
import pickle

import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.metrics import log_loss

Download NLTK modules with stopwords, punctuation, and wordnet. 

In [3]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nuke2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nuke2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nuke2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Add some specific stopwords for this corpus.   
Add lemmatizer based on WordNet.

In [4]:
aircompanies_accounts = ['VirginAmerica', 'United', 'SouthwestAir', 'JetBlue', 
                         'Delta', 'USAirways', 'AmericanAir']
other_stopwords = ['fly', 'flying', 'flight', 'flights', 'plane']

eng_stopwords = stopwords.words('english')
eng_stopwords.extend([w.lower() for w in aircompanies_accounts])
eng_stopwords.extend(other_stopwords)

lemmatizer = WordNetLemmatizer()

Create function preparing bag-of-words documents.

In [5]:
def my_tokenizer(tweet):
    # Remove everything but letters:
    tweet = re.sub("[^a-zA-Z]", " ", tweet)
    # Make lower-case:
    tweet = tweet.lower()
    # Tokenize tweet:
    tokens = nltk.word_tokenize(tweet)
    # Remove stop-words:
    tokens = list(filter(lambda token: token not in eng_stopwords, tokens))
    # Lemmatize all tokens:
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

Load pretrained models.   
Note that this document uses only one model: binary Naive Bayes.    
Train and load multinomial model to improve results.    
Experiment with model ensembling if necessary.

In [6]:
vectorizer_model = joblib.load('models/vectorizer_bernoulli.joblib')
nb_model = joblib.load('models/nb_bernoulli.joblib')

## Prepare the process of responding to the tweets in real time   

Define global variables for the process.   
Initialized data frame for received tweets.

In [7]:
tweet_counter = 0     # tweet event counter
BUF_SIZE = 1000       # no need to change this buffer size

# we create buffers in advance:
tweets_df = pd.DataFrame(index=range(0, BUF_SIZE),
                         columns=['time', 'tweet_id', 'text', 
                                  'prob_neg', 'prob_neutral', 'prob_positive'])
start_time = datetime.now()

Define the event handler.     
Event hadler is a function that executes the logic of responses to the incoming messages with tweets.    
This function is automatically called every time a new message is received from the server. The function has the following steps:   

- Identify time stamp;   
- Update the data frame with received tweets;    
- Tokenize the tweet and make it a bag-of-of words;   
- Predict probabilities of classes. This step uses the pre-fitted model uploaded in the memory. The best model must be selected, or an ensembling logic with several models must be defined here;   
- Update the data frame with the predicted probabilities.   

In [8]:
def tweet_handler(tweet_id, text):
    global tweets_df, tweet_counter
    now = datetime.now()
    # update tweets_df dataframe:
    tweets_df.loc[tweet_counter] = [now, tweet_id, text, np.nan, np.nan, np.nan]
    tweet_counter += 1
    # process new tweet
    print(tweet_id, text)
    tokens =  my_tokenizer(text)
    matrix_model = vectorizer_model.transform([tokens]).toarray()
    model_proba = nb_model.predict_proba(matrix_model)
    probs = list(model_proba[0])
    print(f'{probs=}')
    tweets_df.loc[tweet_counter - 1, ['prob_neg', 'prob_neutral', 'prob_positive']] = probs
    return probs

## Run the reali-time process   

Connect to the server using your credentials stored in `my_credentials.txt`. The file must contain 2 lines: login name (email address) and the streaming password.    
Connect and see how the handler with your model classifies the documents.   
The score reflecting the accuracy of the classification of the test sample will appear in the log at the end of the session.

In [9]:
from AirTweet_connection import connect

with open("my_credentials.txt",'r') as f:
    lines = f.readlines()
login, password = map(str.strip, lines)

# server options; do not change
host = 'datastream.ilykei.com'      
port = 30019
stream_name = 'AirTweet'
catch_handler_errors = True  # we recommend using TRUE during the test and FALSE during preparation

# make connection with your personal handlers
result = connect(host, port, login, password, stream_name,
                 tweet_handler, catch_handler_errors)

Connecting to datastream.ilykei.com:30019
Sending login message
Logged in successfully as  samuelswain2023@u.northwestern.edu
0 @AmericanAir can I DM you info?
probs=[0.6790637267872803, 0.3069015772997881, 0.014034695912933248]
1 @united u Cancelled Flighted my flight from IAD to JAX. Was supposed to use plane from BNA but u used that plane for another destination instead. 1/2
probs=[0.990016673142036, 0.00998227395500848, 1.0529029530790213e-06]
2 @united has once again earned a place as the worst airline in the business
probs=[0.9990209099989454, 0.0007452479221496575, 0.00023384207890638346]
3 @jetblue thanks
probs=[0.6522716941313366, 0.14727615085724804, 0.2004521550114162]
4 @SouthwestAir: #VIPLiveintheVieyard - first time we tried to redeem pts for *anything*, it really did not go well. #disappointed
probs=[0.9984974821979196, 0.0014773698263571865, 2.5147975723277468e-05]
5 @AmericanAir Thank you for being so responsive on Twitter. Truly impressive.
probs=[0.020175063986686587

Check the result

In [10]:
result

{'problems': [],
 'n_signals': 500,
 'penalty': 0.7775988497017884,
 'missed_id': [],
 'score': 91}

In [11]:
# remove empty values from buffers
tweets_df = tweets_df.head(tweet_counter)
tweets_df

Unnamed: 0,time,tweet_id,text,prob_neg,prob_neutral,prob_positive
0,2023-10-14 18:13:03.259260,0,@AmericanAir can I DM you info?,0.679064,0.306902,0.014035
1,2023-10-14 18:13:04.441764,1,@united u Cancelled Flighted my flight from IA...,0.990017,0.009982,0.000001
2,2023-10-14 18:13:06.776018,2,@united has once again earned a place as the w...,0.999021,0.000745,0.000234
3,2023-10-14 18:13:07.404867,3,@jetblue thanks,0.652272,0.147276,0.200452
4,2023-10-14 18:13:09.475422,4,@SouthwestAir: #VIPLiveintheVieyard - first ti...,0.998497,0.001477,0.000025
...,...,...,...,...,...,...
495,2023-10-14 18:21:06.439764,495,@SouthwestAir live in Atlanta but cant enroll ...,0.194262,0.800462,0.005276
496,2023-10-14 18:21:07.058681,496,"@united Hey, thanks again for helping me miss ...",0.424402,0.004039,0.571559
497,2023-10-14 18:21:08.515750,497,@AmericanAir I was flying from Ft Lauderdale F...,0.07763,0.918104,0.004266
498,2023-10-14 18:21:08.975469,498,@JetBlue 2 aisles of empty #evermoreroom seats...,0.999135,0.000763,0.000102


Save the log.

In [12]:
# after all you can dump your data/results and analyze it later
with open('results.pkl', 'wb') as output_f:
    pickle.dump([tweets_df, result], output_f)

## Penalty Function    

The penalty for this project is the logloss measure of accuracy of sentiment classification
$$LogLoss=-\frac{1}{N} \sum_{i=1}^N \left( y_{i,neg} \log(p_{i,neg}) + y_{i,neut} \log(p_{i,neut}) + y_{i,pos} \log(p_{i,pos}) \right),$$
where $y_{i,c}=1$ when the tweet belongs to class $c$, and 0 otherwise; $p_{i,c}$ are predicted probabilities of classes. 