In [101]:
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn import metrics

Reading data set from csv file and printing the first five rows to see what the dataset looks like

In [102]:
df = pd.read_csv('Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Check for missing data and see what data types are available

In [103]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

Remove tweets that have less than 100% confidence in provided sentiment.  This decision was made to help with model accuracy.  Without high confidence we could be training the model to predict incorrectly.

In [104]:
count = 0

for i, value in df["airline_sentiment_confidence"].items():
    if value >= 1:
        count = count + 1
    else:
        df.drop(i, inplace = True)

print(count)

10445


See how many of the remaining tweets are positive, neutral, and negative

In [106]:
df.airline_sentiment.value_counts()

negative    7382
neutral     1548
positive    1515
Name: airline_sentiment, dtype: int64

Set X to tweets and y to sentiment

In [107]:
X = df.text
y = df.airline_sentiment

It is clear the data needs to be cleaned.  We opted to do this manually as it was still fairly simple and it can be hard to speak to exactly what is happening when utilizing the tools available for data cleaning.  The code below iterates through each tweet in the data set.  First, the string is split into individual words, then all punctuation and special characters are removed.  Next, all characters are converted to lowercase.  Finally, the words are joined back into a string.

We mostly knew what needed to be done here, but the following resource was used to help figure out the correct syntax in python: https://machinelearningmastery.com/clean-text-machine-learning-python/

In [108]:
table = str.maketrans('', '', string.punctuation)

#remove emojis
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

for i, value in X.items():
    value = re.sub('(RT\s@[A-Za-z]+[A-Za-z0-9-_]+)', '', value)  # remove re-tweet
    value = re.sub(r'http\S+', '', value)   # remove http links
    value = re.sub(r'bit.ly/\S+', '', value)  # remove bitly links
    value = emoji_pattern.sub(r'', value)
    value = value.split()
    value = [w.translate(table) for w in value]
    value = [w.lower() for w in value]
    X[i] = ' '.join(value)

Split into train and test sets.  There are only 10445 remaining tweets so we opted to set the training set to 70% and the test set to 30%.  We also set the random state attribute so our results could be reproduced.

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2)

Print the number of positive, negative, and neutral tweets in train and test sets.

In [110]:
print(y_train.value_counts())
print('#########')
print(y_test.value_counts())

negative    5173
neutral     1082
positive    1056
Name: airline_sentiment, dtype: int64
#########
negative    2209
neutral      466
positive     459
Name: airline_sentiment, dtype: int64


Print the training set tweets to ensure the tweets have been cleaned before vectorizing and setting up tfidf

In [111]:
X_train

1568     united deceptive marketing practices promised ...
8331     jetblue pooling and gifting are completely dif...
10552    usairways what happens if the flight takes off...
1629     united everytime i fly ur airline i hate you e...
10834    usairways call volumes are high so the best an...
                               ...                        
1578     united is it possible to add a known traveler ...
3532     united unavailable leg that registered hours a...
9695     usairways americanair have the most rude unrel...
3626     united omg where is my bag yyzua70435  enough ...
10547    usairways enormous lines at customer service a...
Name: text, Length: 7311, dtype: object

Tfidf_vectorizer combines the count vectorizer and tfidf steps for simplicity. After we fit our tfidf vectorizer with the training tweets and transform both train and test set tweets

In [112]:
tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words='english')

X_train = tfidf_vectorizer.fit_transform(X_train)
X_test  = tfidf_vectorizer.transform(X_test)

Create logistic regression model with 5 fold cross validation for classification.  Fit the model with the training data.

In [113]:
LRmodel = LogisticRegressionCV(cv=5, max_iter = 2500, n_jobs=-1)
LRmodel.fit(X_train, y_train)

LogisticRegressionCV(cv=5, max_iter=2500, n_jobs=-1)

Lastly, we test the model on training and test sets, printing accuracy, precision, recall, and F1 scores as well as a confusion matrix.  We can see the model is likely not overfitted because the training and test accuracy scores are not wildly different.  It is expected the training score should be better than test as the model was fitted with training data.  From the results we can see the model accurately predicts 83% of the test tweets.  Further inspection with precision, recall and F1 reveals the model performs the best on the negative tweets which makes sense as there are a lot more negative tweets then positive/neutral tweets.  Precision, how accurate the model is considering true positives/predicted positives, is fairly close for both positive and negative tweets in the mid 80's, but is much lower for neutral tweets.  This is harder to explain, but there seems to be a fine line between neutral and positive or negative, and neutral tweets have limited training data, so it follows this would be the hardest category for the model to predict.  Recall, how accurate the model is considering true positives/actual positives, is very good for negative tweets, but steps down for positive tweets, and then down further for neutral tweets.  Recall is probably the best measure here if the airline intends to follow up with disgruntled customers or make changes based on negative feedback.  Fortunately, the subset of interest would be the negative tweets which has a 94% recall score, so our model is quite effective at capturing true negative tweets.  The confusion matrix was added to help visualize precision/recall across our model.

We used all of this in assignments troughout the semester so no resources were needed for this section.

In [114]:
# make predictions on test data
predicted_train = LRmodel.predict(X_train)
predicted_test = LRmodel.predict(X_test)

LRmodel_train_accuracy = (np.mean(predicted_train == y_train)) 
print("Training accuracy: ", LRmodel_train_accuracy)
LRmodel_test_accuracy = (np.mean(predicted_test == y_test)) 
print("Test accuracy: ", LRmodel_test_accuracy)
print()

# print precision and recall statistics
print(metrics.classification_report(y_test, predicted_test))

# print confusion matrix
print("Confusion Matrix:\n")
pd.DataFrame(
    metrics.confusion_matrix(y_test, predicted_test),
    index=['actual:negative', 'actual:neutral', 'actual:positive'], 
    columns=['pred:negative', 'pred:neutral', 'pred:positive']
)


Training accuracy:  0.8756668034468609
Test accuracy:  0.8270580727504786

              precision    recall  f1-score   support

    negative       0.86      0.94      0.90      2209
     neutral       0.62      0.46      0.53       466
    positive       0.80      0.65      0.72       459

    accuracy                           0.83      3134
   macro avg       0.76      0.68      0.72      3134
weighted avg       0.82      0.83      0.82      3134

Confusion Matrix:



Unnamed: 0,pred:negative,pred:neutral,pred:positive
actual:negative,2080,88,41
actual:neutral,219,215,32
actual:positive,121,41,297
