
Twitter US Airline Sentiment scraped from February of 2015 from Crowdflower’s Data for Everyone library. Tweets are classified as either “positive”, “neutral”, or “negative”. Other information are provided such as the reason for a negative classification and to which airline.
​
This notebook consists of three parts as following:
1. Explore data part
2. Feature engineering part
3. Models Part


In [None]:
#import needed libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
import string
from collections import Counter 
from wordcloud import WordCloud,STOPWORDS


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from mlxtend.plotting import plot_confusion_matrix

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import LSTM

In [None]:
#define work directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#read data 
data= pd.read_csv("/kaggle/input/twitter-airline-sentiment/Tweets.csv")


### Explore Data

In [None]:
#let's see samples of the data 
data.head()

In [None]:
#data size
data.shape


There are 14,640 rows and 15 columns. Features included are: tweet id, airline_sentiment, sentiment_confidence score, negativereason, negativereason_confidence, airline, airline_sentiment_gold, name, negativereason_gold, retweet_count, tweet_text, tweet_coord, tweet_created, tweet_location, and tweet_timezone.


Using info command we can explore each feature data type, and if there is any missing data 

In [None]:
data.info()

We know that the sentiment is from three options (positive, neutral, and negative). In the next cell, we explore the distribution of them among the data. It's clear that the "negative" sentiment tweets are the dominant of the data. But why?! This may due to the nature of the human to complain when there is an issue, but keeps calm when everything going fine.

In [None]:
data['airline_sentiment'].value_counts()

There is a feature called "negative reason" when the sentiment is negative, and NAN for other sentiments. Let's discover the reasons which led them to give negative reasons. From the cell below, there are 10 reasons, most of them complained from the customer service, later flight, and can't tell! The distribution of the different reasons are shown in the figure below; sorted descending.

Note: summation of all negative reasons equals the # of negative sentiments, so each negative sentiment has a reason; no missing data.

In [None]:
negativereason_values_count= data['negativereason'].value_counts()
print(negativereason_values_count)
print("all negative resons :" ,(negativereason_values_count).sum())


In [None]:
#plot negative reasons distribution
x =(data['negativereason']).value_counts().keys()
x_pos = np.arange(len(x))
y = (data['negativereason'].value_counts()).values


barlist = plt.bar(x_pos, y, align='center')
plt.xticks(x_pos, x,rotation=90)
plt.xlabel('Negative Reason')
plt.ylabel('Count')
plt.xlim(-1,len(x) )
plt.ylim(0,3000)

plt.show()

All the tweets are for US airlines, let's explore these airlines firstly. 
The following cell shows that there are 6 airlines, and the number of tweets for each.

In [None]:
airline_values= data['airline'].value_counts()
print(airline_values)


Let's visualize it to be more clear

In [None]:
#plot airlines with tweets distribution
x =(data['airline']).value_counts().keys()
x_pos = np.arange(len(x))
y = (data['airline'].value_counts()).values


barlist = plt.bar(x_pos, y, align='center')
plt.xticks(x_pos, x,rotation=90)
plt.xlabel('Airline')
plt.ylabel('Count')
plt.xlim(-1,len(x) )
plt.ylim(0,4000)

plt.show()

If we want to know which airline got a negative sentiment. From the figure below, it's clear that the airline with most tweets is the same airline with the most negative tweets.

In [None]:
#plot airlines with tweets distribution
data_neg = data[data['airline_sentiment']=='negative']

x =(data_neg['airline']).value_counts().keys()
x_pos = np.arange(len(x))
y = (data_neg['airline'].value_counts()).values


barlist = plt.bar(x_pos, y, align='center')
plt.xticks(x_pos, x,rotation=45)
plt.xlabel('Airline')
plt.ylabel('Negative Count')
plt.xlim(-1,len(x) )
plt.ylim(0,3000)

plt.show()

What about positive tweets for each airline?! The figure below shows that it's not necessary the airline with the most negative tweets to have the lowest positive tweets.

In [None]:
#plot airlines with tweets distribution
data_pos = data[data['airline_sentiment']=='positive']

x =(data_pos['airline']).value_counts().keys()
x_pos = np.arange(len(x))
y = (data_pos['airline'].value_counts()).values



barlist = plt.bar(x_pos, y, align='center')
plt.xticks(x_pos, x,rotation=45)
plt.xlabel('Airline')
plt.ylabel('Positive Count')
plt.xlim(-1,len(x) )
plt.ylim(0,700)

plt.show()

For neutral tweets, the same applies here as positive tweets, there is no direct relationship between the airline and number of neural tweets with the number of negative and positive tweets.

In [None]:
#plot airlines with tweets distribution
data_neut= data[data['airline_sentiment']=='neutral']

x =(data_neut['airline']).value_counts().keys()
x_pos = np.arange(len(x))
y = (data_neut['airline'].value_counts()).values



barlist = plt.bar(x_pos, y, align='center')
plt.xticks(x_pos, x,rotation=45)
plt.xlabel('Airline')
plt.ylabel('Neutral Count')
plt.xlim(-1,len(x) )
plt.ylim(0,800)

plt.show()

To sum up, let's see the distribution of the three sentiments among each airline separately as the following figures show. That in general, each airline has most of the tweets negative, then natural and the least are the positive tweets.

In [None]:
def plotAirlineSentiment(airline):
    data_air= data[data['airline']==airline]

    x =(data_air['airline_sentiment']).value_counts().keys()
    x_pos = np.arange(len(x))
    y = (data_air['airline_sentiment'].value_counts()).values



    barlist = plt.bar(x_pos, y, align='center')
    plt.xticks(x_pos, x)
    plt.xlabel('Sentiment')
    plt.ylabel('Sentiment Count')
    plt.xlim(-1,len(x) )
    plt.ylim(0, 3000)
    plt.title(airline)

In [None]:
plt.figure(1,figsize=(15, 15))
plt.subplot(231)
plotAirlineSentiment('United')
plt.subplot(232)
plotAirlineSentiment('US Airways')
plt.subplot(233)
plotAirlineSentiment('American')
plt.subplot(234)
plotAirlineSentiment('Southwest')
plt.subplot(235)
plotAirlineSentiment('Delta')
plt.subplot(236)
plotAirlineSentiment('Virgin America')

If we want to visualize the negative reasons for each airline, the following function helps us. For example for 'United' airline, the most mentioned reason for the negative tweets is customer service!

In [None]:
def plotAirlineNegativeReason(airline):
    data_air= data[data['airline']==airline]

    x =(data_air['negativereason']).value_counts().keys()
    x_pos = np.arange(len(x))
    y = (data_air['negativereason'].value_counts()).values



    barlist = plt.bar(x_pos, y, align='center')
    plt.xticks(x_pos, x,rotation=90)
    plt.xlabel('Negative Reason')
    plt.ylabel('Tweets Count')
    plt.xlim(-1,len(x) )
    plt.ylim(0, 1000)
    plt.title(airline)

In [None]:
plotAirlineNegativeReason('United')


Now for the most important part of the data which is tweets texts, let's discover if there is any something interesting about it.

In [None]:
#seprate each sentiment tweets 
tweets = data["text"]
neg_tweets = data[data["airline_sentiment"] =="negative"]["text"]
neut_tweets = data[data["airline_sentiment"] =="neutral"]["text"]
pos_tweets = data[data["airline_sentiment"] =="positive"]["text"]

In [None]:
stopwordslist = set(stopwords.words('english'))
airline_names =["united", "usairways", "americanair" ,"southwestair", "deltaair", "virginamericair" ,"flight"]
allunWantedwords= list(stopwordslist) + airline_names


In [None]:
# function to remove stop words, remove airline names, words starts with @ 
def pre_processData(data):
  
    data =list(data)
    new_data =[]
    
    
    for l in range(len(data)):
        line = data[l]
        line = line.lower()
        line=' '.join(word for word in line.split() if not word.startswith('@'))
        querywords = line.split()  
        resultwords  = [word for word in querywords if word not in stopwordslist]
        resultwords2  = [word for word in resultwords if word not in airline_names]

        newline = ' '.join(resultwords2)       
        newline = newline.translate(str.maketrans('', '', string.punctuation))
        new_data.append(newline)
        
    return new_data

To understand tweets texts,pre_processData function is created, which removes stop words, airline names, and any word starting with the symbol @
The reason for removing airline names is, regardless of the sentiment, the user who wrote the tweets will tag or mention the airline official profile, in addition, to remove any words starts with @ which include any tag to other twitter accounts.

In [None]:
#pre_process tweets from the three categories
neg_tweets= pre_processData(neg_tweets)
neut_tweets= pre_processData(neut_tweets)
pos_tweets= pre_processData(pos_tweets)

From the following cells, it's clear that on average the negative tweets are longer than positive and neutral tweets

In [None]:
#function to find some statistics about tweets texts
def tweetStat(tweetsList):
    min_len= min(len(x) for x in tweetsList) 
    max_len= max(len(x) for x in tweetsList) 
    avg_len= sum(len(x) for x in tweetsList) / len(tweetsList)

    return min_len, max_len, avg_len

In [None]:
print("Negative tweets stats (min, max, avg): ", tweetStat(neg_tweets))
print("Neutral tweets stats (min, max, avg): ", tweetStat(neut_tweets))
print("Positive tweets stats (min, max, avg): ", tweetStat(pos_tweets))

In [None]:
#function to get most k frquent words in a set of tweets
def getFrequentWords(textList, k):  #k frequent words
    texts = ' '.join(textList)
    split_it = texts.split() 
    count = Counter(split_it) 
    most_occur = count.most_common(k) 

    print (most_occur)
    return most_occur

From the cell below, it shows the most frequent words in negative tweets.
Words such as canceled, time, service, help, etc. are clearly connected with complaints.

In [None]:
print("Most frquent 10 words in negative tweets are:" )
negwords = getFrequentWords(neg_tweets,10)

From the cell below, it shows the most frequent words in neutral tweets.
Words such as please, need, thanks are kind of neutral compared with words used in positive tweets as will be shown next cell.

In [None]:
print("Most frquent 10 words in neutral tweets are:" )
neut_words =getFrequentWords(neut_tweets,10)

From the cell below, it shows the most frequent words in positive tweets.
Words such as thanks, great, love, etc. are clearly connected with satisfaction.

In [None]:
print("Most frquent 10 words in positive tweets are:" )
pos_words= getFrequentWords(pos_tweets,10)

To discover the frequent words in each tweets category in a more interesting way,
the following cell shows the cloud word for each category, where the most frequent word appears in large size.

In [None]:
#function to get wordcloud 
def getWordCloud(texts):
    texts =" ".join(texts)
    wordcloud = WordCloud(stopwords=allunWantedwords,
                          background_color='white',
                          width=2000,
                          height=2000
                         ).generate(texts)
    
    return wordcloud

In [None]:
#negtaive tweets word cloud 
neg_wordcloud = getWordCloud(neg_tweets)
plt.figure(1,figsize=(7, 7))
plt.imshow(neg_wordcloud)
plt.axis('off')
plt.show()


In [None]:
#neutral tweets word cloud 
neut_wordcloud = getWordCloud(neut_tweets)
plt.figure(2,figsize=(7, 7))
plt.imshow(neut_wordcloud)
plt.axis('off')
plt.show()

In [None]:
#positive tweets word cloud 
pos_wordcloud = getWordCloud(pos_tweets)
plt.figure(2,figsize=(7, 7))
plt.imshow(pos_wordcloud)
plt.axis('off')
plt.show()

## Feature Engineering 

The aim of the previous part content is to understand the data and take an insight into each feature. But to build a sentiment prediction model, we will use the tweet text to determine that. So, all the features are dropped except the tweet text. So, our focus on tweet text and the target label (sentiment).


In [None]:
#store the required two columns in a new data frame 
cleanedTweets= pre_processData(data["text"])  # apply preprocessing step 
finalData= pd.DataFrame()
finalData['text']= cleanedTweets
finalData['airline_sentiment']= data['airline_sentiment']
finalData.head()

In [None]:
#convert airline_sentiment values (negative, neutral, and positive) to numerical values (0,1,and 2) and map the data to it
sentiments = data['airline_sentiment'].astype('category').cat.categories.tolist()
replace_map_comp = {'airline_sentiment' : {k: v for k,v in zip(sentiments,list(range(0,len(sentiments))))}}
print(replace_map_comp)

finalData.replace(replace_map_comp, inplace=True)
finalData.head()

In [None]:
# store tweets text in x & the target label in y
x=  finalData['text'].values
y= finalData['airline_sentiment'].values

In [None]:
#split data into training & test with 15%
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.15, random_state=42)

print("x_train:" ,x_train.shape, ", y_train: ", len(y_train))
print("x_test: ",x_test.shape, ", y_test: ", len(y_test))


# Prediction Models

## A. Machine Learning Models

To build a machine learning (ML) model, we can't deal with words directly, so it's necessary to encode text words (i.e. convert them to numbers).
There are different methods for that, from them: CountVectorizer & TfidfVectorizer.
We will use the two methods with different ML models.

There are different ML algorithms, in this kernel, we use Logistic regression, Random forest, and SVM. 
Later will add Naive Bayes, decision tree, KNN, and ensemble classifiers, such as Adaboost and voting classifier.

#### 1- using CountVectorizer
 The idea of CountVectorizer is to tokenize texts and build a vocabulary of known words to encode texts with these vocabularies; An integer count for the number of times each word appeared in the document.


In [None]:
v = CountVectorizer(analyzer = "word")
train_features= v.fit_transform(x_train)
test_features=v.transform(x_test)

In [None]:
#convert from sparse (contain a lot of zeros) to dense
final_train_features=train_features.toarray()
final_test_features= test_features.toarray()
print(final_train_features.shape)
print(final_test_features.shape)

#### 1.1 - CountVectorizer with Logistic regression model

In [None]:
print('training model (this could take sometime)...')
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
clf.fit(final_train_features, y_train)

print('calculating results...')
predictions_train = clf.predict(final_train_features)
predictions_test = clf.predict(final_test_features)

accuracy = accuracy_score(predictions_train,y_train)
print(" Logistic Regression Train accuracy is: {:.4f}".format(accuracy))

accuracy = accuracy_score(predictions_test,y_test)
print(" Logistic Regression Test accuracy is: {:.4f}".format(accuracy))

In [None]:
#print other performance measures, espically the data is unbalanced
print(classification_report(predictions_test , y_test))


In [None]:
#calculate the confusion matrix and plot it

cm=confusion_matrix(predictions_test , y_test)
class_names =  ['Negative', 'Neutral', 'Positive']
fig, ax = plot_confusion_matrix(conf_mat=cm,
                                colorbar=True,
                                show_absolute=True,
                                show_normed=True,
                                class_names=class_names)
plt.show()

So, using CountVectorizer for encoding tweets texts and the Logistic regression algorithm as a prediction model for tweet sentiment achieved 79% accuracy. As the confusion matrix shows, most errors are in predicting 'neutral' as 'negative' followed by predicting 'neutral' as 'positive' and vice-versa. This is logical since some times neutral words combine both words of negativity and positivity.

#### 1.2 - CountVectorizer with Random Forest model

In [None]:
print('training model (this could take sometime)...')
clf =     RandomForestClassifier(n_estimators=10)
clf.fit(final_train_features, y_train)

print('calculating results...')
predictions_train = clf.predict(final_train_features)
predictions_test = clf.predict(final_test_features)

accuracy = accuracy_score(predictions_train,y_train)
print("Random Forest Train accuracy is: {:.4f}".format(accuracy))

accuracy = accuracy_score(predictions_test,y_test)
print("Random Forest Test accuracy is: {:.4f}".format(accuracy))


In [None]:
#print other performance measures, espically the data is unbalanced
print(classification_report(predictions_test , y_test))

In [None]:
#calculate the confusion matrix and plot it

cm=confusion_matrix(predictions_test , y_test)
class_names =  ['Negative', 'Neutral', 'Positive']
fig, ax = plot_confusion_matrix(conf_mat=cm,
                                colorbar=True,
                                show_absolute=True,
                                show_normed=True,
                                class_names=class_names)
plt.show()

Using CountVectorizer for encoding tweets texts and the Random Forest algorithm as a prediction model for tweet sentiment achieved 76% accuracy. As the confusion matrix shows, most errors are in predicting 'neutral' as 'negative', 'neutral' as 'negative' and 'negative'  as 'positive'. Most errors in classifying 'negative' tweets, but don't forget that the number of tweets with negative sentiment is the largest in the actual data.

#### 2- Using TfidfVectorizer
 Using TF-IDF method which assigns scores to each word based on its frequency, that the more frequent word got higher weights.


In [None]:
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(x_train)
train_features =  tfidf_vect.transform(x_train)
test_features =  tfidf_vect.transform(x_test)


In [None]:
#convert from sparse (contain a lot of zeros) to dense
final_train_features=train_features.toarray()
final_test_features= test_features.toarray()


#### 2.1 - TfidfVectorizer with Logistic regression model

In [None]:
print('training model (this could take sometime)...')
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
clf.fit(final_train_features, y_train)

print('calculating results...')

predictions_train = clf.predict(final_train_features)
predictions_test = clf.predict(final_test_features)

accuracy = accuracy_score(predictions_train,y_train)
print("Logisitc regression Train accuracy is: {:.4f}".format(accuracy))

accuracy = accuracy_score(predictions_test,y_test)
print("Logisitc regression Test accuracy is: {:.4f}".format(accuracy))



#### 2.2 - TfidfVectorizer with Random Forest model


In [None]:
from sklearn.ensemble import RandomForestClassifier

print('training model (this could take sometime)...')
clf = RandomForestClassifier(n_estimators=10)
clf.fit(final_train_features, y_train)

print('calculating results...')
predictions_train = clf.predict(final_train_features)
predictions_test = clf.predict(final_test_features)

accuracy = accuracy_score(predictions_train,y_train)
print("Random forest Train accuracy is: {:.4f}".format(accuracy))

accuracy = accuracy_score(predictions_test,y_test)
print("Random forest Test accuracy is: {:.4f}".format(accuracy))

Results of using logistic regression model with CountVectorizer & TfidfVectorizer are nearly the same . The same applies when using Random forest algorithm with the two encoding methods.

**Notice that it's possible to achieve a higher performance with more parameter-tuning.
**


## B. Deep Learning Model

There are different deep learning (DL) techniques that can be applied to this data such as Recurrent neural network (RNN) and convolutional neural networks (CNN). 

In this kernel I will use Long Short Term Memory networks (LSTM), which is a special kind of RNN; since RNN are mostly used with texts (as a sequence data); later will add more DL techniques.

First step is to choose the method of presenting texts: so I will use Glove embedding which is an unsupervised machine learning algorithm for getting vector representations for words. Each word is mapped to a 300-dimension vector to learn the semantics of words.

In [None]:
embeddings_index = {}
f = open('../input/glove840b300dtxt/glove.840B.300d.txt')

for line in f:
    values = line.split(' ')
    word = values[0] ## The first entry is the word
    coefs = np.asarray(values[1:], dtype='float32') 
    embeddings_index[word] = coefs
f.close()

print('GloVe data loaded')
print('Loaded %s word vectors.' % len(embeddings_index))

In [None]:
#encode train texts and test texts using the a tokenizer
MAX_NUM_WORDS = 1000
MAX_SEQUENCE_LENGTH = 135 #from the stats we found previously
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(x)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

sequences_train = tokenizer.texts_to_sequences(x_train)
x_train_seq = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH)

sequences_test = tokenizer.texts_to_sequences(x_test)
x_test_seq = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

#convert labels to one hot vectors
labels_train = to_categorical(np.asarray(y_train))
labels_test = to_categorical(np.asarray(y_test))

print("train data :")
print(x_train_seq.shape)
print(labels_train.shape)

print("test data :")
print(x_test_seq.shape)
print(labels_test.shape)

In [None]:
# Find number of unique words in our tweets
vocab_size = len(word_index) + 1 # +1 is for UNKNOWN words

In [None]:
# Define size of embedding matrix: number of unique words x embedding dim (300)
embedding_matrix = np.zeros((vocab_size, 300))

# fill in matrix
for word, i in word_index.items():  # dictionary
    embedding_vector = embeddings_index.get(word) # gets embedded vector of word from GloVe
    if embedding_vector is not None:
        # add to matrix
        embedding_matrix[i] = embedding_vector # each row of matrix

In [None]:
#DL model: pass the encoded data to an embedding layer and use the Glove pre_trained weights, then pass the 
# output to an LSTM layer follwed by 2 dense layers.
# the optimizer used is Adam, since it achivied higher accurcies usually.

cell_size= 256
deepLModel1 = Sequential()
embedding_layer = Embedding(input_dim=vocab_size, output_dim=300, weights=[embedding_matrix],
                           input_length = MAX_SEQUENCE_LENGTH, trainable=False)
deepLModel1.add(embedding_layer)
deepLModel1.add(LSTM(cell_size, dropout = 0.2))
deepLModel1.add(Dense(64,activation='relu'))
deepLModel1.add(Flatten())
deepLModel1.add(Dense(3, activation='softmax'))
deepLModel1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
deepLModel1.summary()

In [None]:
#train the model
deepLModel1_history = deepLModel1.fit(x_train_seq, labels_train, validation_split = 0.15,
                    epochs=100, batch_size=256)

In [None]:
# Find train and test accuracy
loss, accuracy = deepLModel1.evaluate(x_train_seq, labels_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))

loss, accuracy = deepLModel1.evaluate(x_test_seq, labels_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

In [None]:
predictions_test = deepLModel1.predict_classes(x_test_seq)
#print other performance measures, espically the data is unbalanced
print(classification_report(predictions_test , y_test))

In [None]:
#calculate the confusion matrix and plot it

cm=confusion_matrix(predictions_test , y_test)
class_names =  ['Negative', 'Neutral', 'Positive']
fig, ax = plot_confusion_matrix(conf_mat=cm,
                                colorbar=True,
                                show_absolute=True,
                                show_normed=True,
                                class_names=class_names)
plt.show()

**Notice that it's possible to achieve a higher performance with more parameter-tuning.**

### Overall Summary

* Best achieved test accuracy among the used techniques is 80% with logistic regression mode either with CountVectorizer & TfidfVectorizer. 
* ML models, especially logistic regression, outperform the DL model used. This is may due to the DL model needs more training or more data.
* In all models, the overfitting issue clearly appeared, where training accuracy reaches 95%-98% while test accuracy is 77%-80%.
* In all models, the number of tweets classified correctly is larger for negative tweets, this is logical since more half the data is for negative tweets. While the 'neutral' category is the least classified correctly, this might goes to neutral tweets use words between negativity and positivity so cause confusion.

**Whats next?**
* Study false predicted tweets, and reprocess data based on that.
* Try other ML models: Decision tree, SVM, Naive Bayes, and KNN.
* Try grid search CV technique to search for the best parameter for each ML model.
* Try ensemble methods: voting, adaboost, and gradient boost.
* Try another pre-trained embedding.
* Try more DL techniques such as CNN.

