# Sentiment analysis of Tweets

The directory `data/Tweets` contains a dataset of tweets by airlines customers. Tweets can contain positive (Label=1) or negative (Label=0) comments. 

The `train.csv` contains annotated data, while `test.csv` includes only the textual data.

Your task is to train a classifier to predict the class of the tweets in `test.csv`, you should generate a comma separated file `tweet_results.csv` with a **single column** named `Label` containing the predicted class of the corresponding tweet.

The test data was originally annotated and the accuracy of your prediction will be evaluated w.r.t. this ground truth.

This notebook should contain all the description of your experiments, the code to generate the classifier **and** the result file, as well as the rationale for your choices. If you prefer to split your work in different source files and/or notebooks, then use this notebook as a guide to the rest of your submitted material.

# Add your code below

***


#### Libraries

In [1]:
# pip install wordsegment
# pip install gensim

# Basics
import collections, itertools, joblib, re
import gensim as gs
import pandas as pd

from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from wordsegment import load, segment
from nltk.tokenize.treebank import TreebankWordDetokenizer as Detok
from nltk.tokenize import word_tokenize, TweetTokenizer

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer

# Models and Evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix

# pip install imblearn
from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

# Files
import json
# Text
import nltk
nltk.download('omw-1.4')


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\samue\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

#### Reading files

In [2]:
# Reading data for training and testing and generate dataframes

tweets_train_path = r'.\data\Tweets\train.csv'
tweets_test_path = r'.\data\Tweets\test.csv'

tweets_train_df = pd.read_csv(tweets_train_path, header=0)
tweets_test_df = pd.read_csv(tweets_test_path, header=0)

print('Number of tweets\nTrain:', len(
    tweets_train_df), 'Test:', len(tweets_test_df))
print('\nTweets for training')
tweets_train_df.head(10)


Number of tweets
Train: 8007 Test: 890

Tweets for training


Unnamed: 0,Label,text
0,1,@AmericanAir well Done all of you xx
1,0,@united is the worst airline. Lost my luggage ...
2,0,@united Why don't you respond to my e-mails of...
3,0,@AmericanAir our flight AA 1338 is delayed out...
4,0,@virginAmerica Other carriers are less than ha...
5,0,@USAirways you guys have my luggage in San Jos...
6,0,@USAirways #usairways lost a passenger today f...
7,0,@united OMG!!! you just bumped me off the last...
8,1,@SouthwestAir Finally! Integration w/ passbook...
9,0,@AmericanAir stranded in Miami because your au...


#### Preprocessing

1. Cleaning
2. Tokenize
3. Split words that are together (e.g. *datascience -> data science*)
4. Removing stop words 
5. Omitting terms with length lower than 3

In [3]:
def cleaning(tweet):
    '''
    1. Remove HTTP/HTTPS and URLs
    2. Remove mentions
    3. Remove hashtag symbols
    4. Split words 
    5. Remove digits
    6. Convert text to lower case
'''

    # to remove links that start with HTTP/HTTPS in the tweet
    tweet = re.sub(r"http\S+", "", tweet)

    # to remove other url links
    tweet = re.sub(r'[-a-zA-Z0–9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0–9@:%_\+.~#?&//=]*)', ' ',
                   tweet, flags=re.MULTILINE)

    # remove mentions
    tweet = re.sub(r"@\S+", "", tweet)

    # remove hashtags symbols
    tweet = re.sub(r"#", "", tweet)

    # to split words that are together
    tweet = ''.join([a for a in re.split('([A-Z][a-z]+)', tweet) if a])

    # to remove digits
    tweet = re.sub(r"\d", "", tweet)

    # to lower the characters in a text
    tweet = tweet.lower()

    return tweet


In [4]:
def createTokens(tweet):
    '''
    Given a tweet as input, it returns list with all tokens
    '''

    tknzr = TweetTokenizer()

    return tknzr.tokenize(tweet)


In [5]:
def splitWordsLemma(tokens):
    '''
    Given a list of tokens it divides words that are together (e.g. datascience -> data science)
    and lemmatize words: "rocks" -> "rock", "corpora" -> corpus
    '''
    lemmatizer = WordNetLemmatizer()

    splitted = []

    for term in tokens:
        term = ' '.join(segment(term))
        splitted.append(term)

    tknzr = TweetTokenizer()
    splitted_tokens = [tknzr.tokenize(st) for st in splitted]
    tokens = []

    for doc_tokens in splitted_tokens:
        for word in doc_tokens:
            word_lemma = lemmatizer.lemmatize(word)
            tokens.append(word_lemma)

    return tokens


In [6]:
def removeStopWords(terms):
    '''
    Given as input a list of terms and the number of stop words to be removed, 
    it gives a list of terms removing those stop words.
    '''
    terms_no_stopwords = []
    stop_words = nltk.corpus.stopwords.words('english')

    for term in terms:
        if term not in stop_words:
            terms_no_stopwords.append(term)

    return terms_no_stopwords


In [7]:
def omittingShortTerms(terms):
    '''
    if the words are short this will be omitted from the text.
    '''

    # the stemmer requires a language parameter

    terms_kept = []

    for term in terms:
        if len(term) >= 3:
            terms_kept.append(term)

    return terms_kept


In [8]:
def preprocessing(tweet):
    '''
    given a document it preprocess it following these steps:
        - tokenize
        - split words
        - lemmatizer
        - remove stop words
        - omitting short terms
    '''
    final = []

    # Cleaning
    cleaned_tweet = cleaning(tweet)

    # Tokenize
    tokens = createTokens(cleaned_tweet)

    # Split words and lemmatize
    split = splitWordsLemma(tokens)

    # remove stop words
    no_sw = removeStopWords(split)

    # omitting terms with lenght = 1
    final = omittingShortTerms(no_sw)

    return final


In [9]:
# Preprocessing
load()
terms_train = [preprocessing(x) for x in tweets_train_df['text']]
terms_test = [preprocessing(x) for x in tweets_test_df['text']]


#### Saving files

In [10]:
# Save files, in case it is necessary
with open("vocabulary/vocabulary_train_tweeter", "w") as fp:
    json.dump(terms_train, fp)

with open("vocabulary/vocabulary_test_tweeter", "w") as fp:
    json.dump(terms_test, fp)


#### Loading files and summmaryzing the information

In [11]:
with open("vocabulary/vocabulary_train_tweeter", "r") as fp:
    terms_train = json.load(fp)

with open("vocabulary/vocabulary_test_tweeter", "r") as fp:
    terms_test = json.load(fp)

detokenizer = Detok()
sentences_train = [detokenizer.detokenize(
    terms_doc) for terms_doc in terms_train]
sentences_test = [detokenizer.detokenize(
    terms_doc) for terms_doc in terms_test]

dct_train = Dictionary(terms_train)
dct_test = Dictionary(terms_test)
print('Train/Val - We have '+str(dct_train.num_docs)+' documents and ' +
      str(len(dct_train.token2id))+' terms in our collection.')
print('Test - We have '+str(dct_test.num_docs)+' documents and ' +
      str(len(dct_test.token2id))+' terms in our collection.')


Train/Val - We have 8007 documents and 6325 terms in our collection.
Test - We have 890 documents and 2064 terms in our collection.


In [12]:
# List of all words across tweets for training
words_in_tweets = list(itertools.chain(*terms_train))
# Create counter
words_freq = collections.Counter(words_in_tweets)
words_freq.most_common(5)


[('flight', 3441),
 ('hour', 947),
 ('service', 880),
 ('customer', 819),
 ('get', 813)]

#### Document-term matrix

In [13]:
vectorizer = CountVectorizer()
doc_term_train = vectorizer.fit_transform(sentences_train) #fit and transform
doc_term_test = vectorizer.transform(sentences_test) #test set only transform
print('Document-term matrix_shape:\nTrain:',doc_term_train.shape,' Test:', doc_term_test.shape)

Document-term matrix_shape:
Train: (8007, 6325)  Test: (890, 6325)


#### Obtain the dataset for training, validating and testing

In [14]:
X_train, X_test = doc_term_train, doc_term_test
y_train = tweets_train_df['Label']

print('Number of rows/Labels\nTrain:', X_train.shape[0], '=', len(y_train))

Number of rows/Labels
Train: 8007 = 8007


In [15]:
# These data will be used in cases when it is necessary to adjust the hypermarameters
X_train_75, X_val_25, y_train_75, y_val_25 = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

print('Number of rows/Labels\nTrain:', (X_train_75.shape[0]), '=', len(y_train_75))
print('Val:', (X_val_25.shape[0]), '=', len(y_val_25))

Number of rows/Labels
Train: 6005 = 6005
Val: 2002 = 2002


#### First Classifier - KNeighborsClassifier

In [139]:
# Choose the classifier
base_estimator = KNeighborsClassifier(algorithm='brute')

param_grid = {'n_neighbors': [1, 3, 5, 7, 10]}
clf = GridSearchCV(base_estimator, param_grid, cv=5)

%time clf.fit(X_train_75, y_train_75)

pd.concat([pd.DataFrame(clf.cv_results_["params"]), pd.DataFrame(
    clf.cv_results_["mean_test_score"], columns=["Accuracy"])], axis=1)

Wall time: 8.17 s


Unnamed: 0,n_neighbors,Accuracy
0,1,0.835803
1,3,0.869942
2,5,0.869276
3,7,0.858618
4,10,0.869942


In [140]:
predicted_label = clf.predict(X_val_25)

In [141]:
confusion_mat = confusion_matrix(y_val_25, predicted_label)
class_report = classification_report(y_val_25, predicted_label)

print(clf.best_estimator_)
print("\n\nConfusion Matrix:\n")
print(confusion_mat)
print("\nClassification Report:\n")
print(class_report)

KNeighborsClassifier(algorithm='brute', n_neighbors=3)


Confusion Matrix:

[[1562   99]
 [ 119  222]]

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.94      0.93      1661
           1       0.69      0.65      0.67       341

    accuracy                           0.89      2002
   macro avg       0.81      0.80      0.80      2002
weighted avg       0.89      0.89      0.89      2002



The classes of this model are imbalanced, the model tends to learn the most common class (class 0) and it doesn’t abstract information from the other class.

Proposed solutions for imbalanced data:

- Stratified train-test split
- Stratified k-fold cross-validation
- Metric to be included 'balanced accuracy'
- Classifiers + Class weight
- Re-sampling techniques
- Techniques of bagging and boosting

#### Working with Unbalanced data

A dictionary will be used to store the results of the models with 2 scores: accuracy and balanced accuracy

In [206]:
# List and dictionary to store the results of our analysis
index = []
scores = {"Accuracy": [], "Balanced accuracy": []}
scoring = ["accuracy", "balanced_accuracy"]


**Basic classifiers**

Models to be analysed: Logistic Regression, Decision Forest and Decision Tree

In [226]:
lr_clf = make_pipeline(StandardScaler(with_mean=False), LogisticRegression(max_iter=1000))
rf_clf = make_pipeline(RandomForestClassifier(random_state=42))
dt_clf = make_pipeline(DecisionTreeClassifier(random_state=42))

for model_name, model in zip(['Logistic Regression', 'Random Forest', 'Decision Tree'], [lr_clf, rf_clf, dt_clf]):
    # Cross_validate: for int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.
    cv_result = cross_validate(model, X_train, y_train, scoring=scoring)
    index.append(model_name)
    scores["Accuracy"].append(round(cv_result["test_accuracy"].mean(), 2))
    scores["Balanced accuracy"].append(
        round(cv_result["test_balanced_accuracy"].mean(), 2))
    df_scores = pd.DataFrame(scores, index=index)
    joblib.dump(model, ("models\_" + model_name.replace(" ", "") + "_classifier.joblib"))

df_scores


Unnamed: 0,Accuracy,Balanced accuracy
Logistic Regression,0.93,0.87
Random Forest,0.93,0.87
Decision Tree,0.89,0.83


**Classifiers + Class weight**

Most of the models in scikit-learn have a parameter class_weight. This parameter will affect the computation of the loss in linear model or the criterion in the tree-based model to penalize differently a false classification from the minority and majority class. Therefore, it will be set class_weight="balanced" such that the weight applied is inversely proportional to the class frequency. 

In [227]:
lr_clf.set_params(logisticregression__class_weight="balanced")
rf_clf.set_params(randomforestclassifier__class_weight="balanced")
dt_clf.set_params(decisiontreeclassifier__class_weight="balanced")

for model_name, model in zip(['Logistic regression with balanced class weights', 'Random forest with balanced class weights', 'Decision tree with balanced class weights'], [lr_clf, rf_clf, dt_clf]):
    # Cross_validate: for int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.
    cv_result = cross_validate(model, X_train, y_train, scoring=scoring)
    index.append(model_name)
    scores["Accuracy"].append(round(cv_result["test_accuracy"].mean(), 2))
    scores["Balanced accuracy"].append(
        round(cv_result["test_balanced_accuracy"].mean(), 2))
    df_scores = pd.DataFrame(scores, index=index)
    joblib.dump(model, ("models\_" + model_name.replace(" ", "") + "_w_classifier.joblib"))
df_scores


Unnamed: 0,Accuracy,Balanced accuracy
Logistic Regression,0.93,0.87
Random Forest,0.93,0.87
Decision Tree,0.89,0.83
Logistic regression with balanced class weights,0.92,0.88
Random forest with balanced class weights,0.93,0.86
Decision tree with balanced class weights,0.87,0.84


**Resample the training set during learning** 

One way of addressing imbalanced data is by re-sampling the dataset as to offset this imbalance.   
It consists of removing samples from the majority class (under-sampling) or adding more examples from the minority class (over-sampling).   
However, the implementation of over-sampling duplicates random records from the minority class, which can cause overfitting.
Using under-sampling, removing random records from the majority class, wich can cause loss of information.   

In [229]:
lr_clf_u = make_pipeline_with_sampler(RandomUnderSampler(
    random_state=42), LogisticRegression(max_iter=1000),)
rf_clf_u = make_pipeline_with_sampler(RandomUnderSampler(
    random_state=42), RandomForestClassifier(random_state=42, n_jobs=2),)
dt_clf_u = make_pipeline_with_sampler(RandomUnderSampler(
    random_state=42), DecisionTreeClassifier(random_state=42))

for model_name, model in zip(['Under-sampling + Logistic regression', 'Under-sampling + Random forest', 'Under-sampling + Decision Tree'], [lr_clf_u, rf_clf_u, dt_clf_u]):
    # Cross_validate: for int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.
    cv_result = cross_validate(model, X_train, y_train, scoring=scoring)
    index.append(model_name)
    scores["Accuracy"].append(round(cv_result["test_accuracy"].mean(), 2))
    scores["Balanced accuracy"].append(
        round(cv_result["test_balanced_accuracy"].mean(), 2))
    df_scores = pd.DataFrame(scores, index=index)
    joblib.dump(model, ("models\_" + model_name.replace(" ", "") + "_u_classifier.joblib"))
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic Regression,0.93,0.87
Random Forest,0.93,0.87
Decision Tree,0.89,0.83
Logistic regression with balanced class weights,0.92,0.88
Random forest with balanced class weights,0.93,0.86
Decision tree with balanced class weights,0.87,0.84
Under-sampling + Logistic regression,0.91,0.91
Under-sampling + Random forest,0.88,0.89
Under-sampling + Decision Tree,0.8,0.83


**Use of specific balanced algorithms from imbalanced-learn, bagging and boosting**

Tree models of bagging and bossting will be used:

- Balanced random forest: randomly under-samples each bootstrap sample to balance it.
- Histogram-based Gradient Boosting Classification Tree: similar to gradient boosting trees, reduces the number of splitting points to consider, faster than Gradient Boosting Classifier
- BalancedBaggingClassifier:  Bagging, it includes an additional step to balance the training set at fit time using a given sampler.


In [230]:
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from imblearn.ensemble import BalancedBaggingClassifier

rf_clf_b = make_pipeline(BalancedRandomForestClassifier(random_state=42, n_jobs=2),)
bag_clf = make_pipeline(BalancedBaggingClassifier(base_estimator=HistGradientBoostingClassifier(
    random_state=42), n_estimators=10, random_state=42, n_jobs=2,),)

for model_name, model in zip(['Balanced random forest', 'Balanced bag of histogram gradient boosting', 'Balanced bag of histogram gradient boosting'], [rf_clf_b, bag_clf]):
    # Cross_validate: for int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used.
    cv_result = cross_validate(model, X_train.toarray(), y_train, scoring=scoring)
    index.append(model_name)
    scores["Accuracy"].append(round(cv_result["test_accuracy"].mean(), 2))
    scores["Balanced accuracy"].append(
        round(cv_result["test_balanced_accuracy"].mean(), 2))
    df_scores = pd.DataFrame(scores, index=index)
    joblib.dump(model, ("models\_" + model_name.replace(" ", "") + "_classifier.joblib"))
df_scores


Unnamed: 0,Accuracy,Balanced accuracy
Logistic Regression,0.93,0.87
Random Forest,0.93,0.87
Decision Tree,0.89,0.83
Logistic regression with balanced class weights,0.92,0.88
Random forest with balanced class weights,0.93,0.86
Decision tree with balanced class weights,0.87,0.84
Under-sampling + Logistic regression,0.91,0.91
Under-sampling + Random forest,0.88,0.89
Under-sampling + Decision Tree,0.8,0.83
Balanced random forest,0.88,0.9


#### Selecting the best model

In [253]:
rf = joblib.load('models\_RandomForest_classifier.joblib')
rf_bal = joblib.load('models\_Balancedrandomforest_classifier.joblib')
lr_u =  joblib.load('models\_Under-sampling+Logisticregression_u_classifier.joblib')


In [254]:
rf.fit(X_train, y_train)
rf_bal.fit(X_train, y_train)
lr_u.fit(X_train, y_train)

Pipeline(steps=[('randomundersampler', RandomUnderSampler(random_state=42)),
                ('logisticregression', LogisticRegression(max_iter=1000))])

#### Predicting and saving results

In [257]:
y_predict_rf = rf.predict(X_test)
y_predict_rf_bal = rf_bal.predict(X_test)
y_predict_lr_u = lr_u.predict(X_test)
print (X_test.shape, len (y_predict_rf), len(y_predict_rf_bal), len (y_predict_lr_u ))

(890, 6325) 890 890 890


In [258]:
df_results_rf = pd.DataFrame()
df_results_rf['Label'] = y_predict_rf
df_results_rf.to_csv('tweet_results_rf.csv', index=False)

df_results_rf_bal = pd.DataFrame()
df_results_rf_bal['Label'] = y_predict_rf_bal
df_results_rf_bal.to_csv('tweet_results_rf_bal.csv', index=False)

df_results_lr_u = pd.DataFrame()
df_results_lr_u['Label'] = y_predict_lr_u
df_results_lr_u.to_csv('tweet_results_lr_u.csv', index=False)

Based on the results, We selected basic Random Forest as the mains solution because undersampling can remove important information from the data.  