**Introduction**

Let's say you are planning your trip from or to the United States, definitely you need to book your tickets with the best airlines to avoid anything that disturbs your mood during your vacation.
Advertisments can be tricky and misleading, so the best way to build your impression on an airline are people's feedback and review, thankfully we have social media that made accessing people's point of views and feedbacks easier than ever.

In this notebook, we are going to anaylyze 15k tweets from people about 6 american airlines, in order to have better intuiton and view about the quality of their services and the positive and negative reviews of their clients.
Also, we are going to train machine learning and deep learning models to predict and classify if a given tweet is a negative or a positive review or neither.

**Starting by importing the necessary library**

In [None]:
#starting by importing the necessary libraries
import numpy as np #The library that handles all the numbers and matrices
import pandas as pd # This library is for reading and manipulating the data, can do a lot of things
import matplotlib.pyplot as plt #for data visualization 

from sklearn.naive_bayes import MultinomialNB #mainly based on bayes theorem; not complicated but has its advantages
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression #The most simple classifier
#ensemble models are worth trying 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier #although it does not need training but still brilliant
from sklearn.svm import SVC #most popular machine learning classifier 
from sklearn.metrics import accuracy_score,confusion_matrix, f1_score #for accuracy and scores

**Reading the data**

Starting by reading the data and exploring it

In [None]:
#reading the data from CSV file using pandas
data = pd.read_csv('../input/Tweets.csv')
#have a look on the data
#data.head(5)

**Data visualization**

In this section, some exploratory analysis and visualization on the data will be done and presented.
Pandas and matplotlib are great libraries for data visualization ans statistics
We will start by showing the number of negative, positive and neutral reviews from the data we have.

In [None]:
counter = data.airline_sentiment.value_counts()
index = [1,2,3]
plt.figure(1,figsize=(12,6))
plt.bar(index,counter,color=['green','red','blue'])
plt.xticks(index,['negative','neutral','positive'],rotation=0)
plt.xlabel('Sentiment Type')
plt.ylabel('Sentiment Count')
plt.title('Count of Type of Sentiment')

Looks like people are not having pleasant flights these days.
It is important to know which airline pleases their costumers the most and vice versa, so we sill be looking at the percentage of the negative reviews for each airline.

In [None]:
neg_tweets = data.groupby(['airline','airline_sentiment']).count().iloc[:,0]
total_tweets = data.groupby(['airline'])['airline_sentiment'].count()

my_dict = {'American':neg_tweets[0] / total_tweets[0],'Delta':neg_tweets[3] / total_tweets[1],'Southwest': neg_tweets[6] / total_tweets[2],
'US Airways': neg_tweets[9] / total_tweets[3],'United': neg_tweets[12] / total_tweets[4],'Virgin': neg_tweets[15] / total_tweets[5]}
perc = pd.DataFrame.from_dict(my_dict, orient = 'index')
perc.columns = ['Percent Negative']
print(perc)
ax = perc.plot(kind = 'bar', rot=0, colormap = 'Greens_r', figsize = (15,6))
ax.set_xlabel('Airlines')
ax.set_ylabel('Percentage of negative tweets')
plt.show()

Next time I am flying either with Virgin or Delta airlines, but definitely not with the US Airways.
Looking on the brught side, we will show the numbers of each type of review for each line in the next graph.

In [None]:
figure_2 = data.groupby(['airline', 'airline_sentiment']).size()
figure_2.unstack().plot(kind='bar', stacked=True, figsize=(15,10))

In [None]:
print(figure_2)

Last but not least, people complain for many reasons about their flights; 10 reasons to be specific. In the next figure, we will break down these reason for each airline.

In [None]:
negative_reasons = data.groupby('airline')['negativereason'].value_counts(ascending=True)
negative_reasons.groupby(['airline','negativereason']).sum().unstack().plot(kind='bar',figsize=(22,12))
plt.xlabel('Airline Company')
plt.ylabel('Number of Negative reasons')
plt.title("The number of the count of negative reasons for airlines")
plt.show()

Except for Delta airlines, Costumer service is the leading reason for people's complaints on twitter.
US airways should really consider doing a fundemental change in their customer service.

**Data cleaning and Preprocessing**

This process is mandatory for training any machine learning and deep learning models, and the results differs significantly without cleaning and preprocessing, as computers are not as smart as humans (so far), so we somehow need to spoon-feed them the data, and this stage is mandatory for this purpose.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from mlxtend.plotting import plot_confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.feature_extraction.text import CountVectorizer

The preprocessing and cleaning stages consists of removing the stopwords, removing mentions; @VirginArilines, removing any links, punctuation and convert the data to lower case; for the purpose of word representation.

In [None]:
def remove_stopwords(input_text):
    stopwords_list = stopwords.words('english')
    #some words might give us something important for the sentiment analysis like not, so we keep them
    wl = ["not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in wl) and len(word) > 1] 
    return " ".join(clean_words)

def remove_mentions(input_text):
    for i in range(len(input_text)):
        input_text[i] = re.sub(r'@\w+', '', input_text[i])
    return input_text

def lower_case(input_text):
    for i in range(len(input_text)):
        input_text[i] = input_text[i].lower()
    return input_text

def remove_http(input_text):
    for i in range(len(input_text)):
        input_text[i] = re.sub(r'http\S+', '',input_text[i])
    return input_text

def remove_punctuation(input_text):
    for i in range(len(input_text)):
        input_text[i] = re.sub(r'[^\w\s]','',input_text[i])
    return input_text


After applying the preprocessing on the data we need; the sentiment and the tweet itself, we will be splitting our data to training (90%) and testing (10%) after mapping the sentiment to numbers, because again the computers only understand numbers.

In [None]:
data_2 = data[['text', 'airline_sentiment']]
preprocessed_data = data_2.apply(remove_mentions).apply(remove_http).apply(remove_punctuation).apply(lower_case)
clean_text = []
for tweet in preprocessed_data.text:
    clean = remove_stopwords(tweet)
    clean_text.append(clean)

X = clean_text
Y = preprocessed_data['airline_sentiment']
from sklearn.model_selection import train_test_split
Y = Y.map({'negative':0, 'positive':1, 'neutral':2}).astype(int)
X_train,X_test,y_train,y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

**Word Representation**

Word representation is actually a very powerful and important stage in natural language processing, and the move from one hot encoders to more efficient and meaningful word representations dramaticaly enhanced the performance of NLP models, actually in a paper published in 2019, researchers from CMU and facebook attempted to do sentence classification without training the encoders and relying only on pretrained embeddings, and it achieved pretty good results, [the paper](https://arxiv.org/abs/1901.10444). 
In this section, we will explore different word representation techniuqes.

Term frequency-inverse document frequency (TF-IDF) has emerged to be a powerful word representation technique until today, [this blog explains it brilliantly](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
text_features_train = vectorizer.fit_transform(X_train)
text_features_test = vectorizer.transform(X_test)


Another breakthrough in the field of NLP is [Word2vec](https://arxiv.org/abs/1301.3781), proposed by google. It gives word representation based on the context of the word. On the other hand, it has some limitations like the out of vocab, but since we are using a small dataset, it will not be an issue now.
The following code train a word2vec model on our data, and you can also [download pretrained](https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/) word2vec embeddings for many languages.

In [None]:
from gensim.models import Word2Vec
sentences = [line.split() for line in clean_text]
w2v = Word2Vec(sentences, size=50, min_count = 0, window = 5,workers=4,iter=500)

visualization of what word2vec actually learn, you can notice that words that are similar are actually close in the space.

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
X = w2v[w2v.wv.vocab]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[0:100])
plt.rcParams["figure.figsize"] = (20,20)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
labels = list(w2v.wv.vocab.keys())
for label, x, y in zip(labels, X_tsne[:, 0], X_tsne[:, 1]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-1, -1),
        textcoords='offset points', ha='right', va='bottom')

plt.show()

For deep learning techniques, every word should be replaced by a token, and stored in a dictionary that holds for every word its embeddings.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

t = Tokenizer()
t.fit_on_texts(clean_text)
vocab_size = len(t.word_index) + 1
encoded_docs = t.texts_to_sequences(clean_text)
padded_docs = pad_sequences(encoded_docs, maxlen=20, padding='post')
embedding_dict = dict()
for i in w2v.wv.vocab:
    embedding_dict[i] = w2v[i]

embedding_matrix = np.zeros((vocab_size, 50))
for word, i in t.word_index.items():
    embedding_vector = embedding_dict.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

After word2vec, companies and universities entered a competetion in the field of word embeddings, facebook proposed fastText that overcomes the issue of out of vocabulary, while Samsung with the University of edingburgh proposed Byte Per Encoding (BPE), and recently, researchers made another breakthrough my proposing contexual word embeddings (this was also a limitation in word2vec that it does not take into consideration that words may have multiple meanings) by proposing Elmo.

**Machine Learning Techniques**

Machine Learning techniques perform very well as classifiers. In fact, they usually surpass deep learning techniques when the training set is relatively small. In this section we will be exploring some of these techniques.
We will be using sklearn to implement these techniques.
Also, for machine learning techniques, we will use TF-IDF as word representation.

In [None]:
def logistic_regression(training_features, labels_train, test_features, labels_test):
    for c in [0.01, 0.05, 0.25, 0.5, 1, 5]:
    #changing the parameter C to get the optimal classification
        lr = LogisticRegression(C=c)
        lr.fit(training_features, labels_train)
        print ("Accuracy of logistic regression for C=%s: %s" 
           % (c, accuracy_score(labels_test, lr.predict(test_features))))
        results(labels_test, lr.predict(test_features))
    
  
def svm(training_features, labels_train, test_features, labels_test):
    for c in [1, 5, 10, 50]:
    #changing the parameter C to get the optimal classification
        model = LinearSVC(C=c)
        model.fit(training_features, labels_train)
        print ("Accuracy of svm for C=%s: %s" 
           % (c, accuracy_score(labels_train, model.predict(training_features))))
        results(labels_test, model.predict(test_features))
    
def naiive_bayes(training_features, labels_train, test_features, labels_test):
    clf = MultinomialNB()
    clf.fit(training_features, labels_train)
    print ("Accuracy of Naiive Bayes: %s" 
         % ( accuracy_score(labels_test, clf.predict(test_features))))
    results(labels_test, clf.predict(test_features))

def GB(training_features, labels_train, test_features, labels_test):
    model = GradientBoostingClassifier()
    model.fit(training_features,y_train)
    print ("Accuracy of GBM: %s" 
          % (accuracy_score(y_test, model.predict(test_features))))
    results(y_test, model.predict(test_features))
        
def RF(training_features, labels_train, test_features, labels_test):
    model = RandomForestClassifier(n_estimators=200)
    model.fit(training_features,y_train)
    print ("Accuracy of Random Forest for: %s" 
          % (accuracy_score(y_test, model.predict(test_features))))
    results(y_test, model.predict(test_features))

In order to get better insights on the results, we will calculate the confusion matrix, accuracy, precision, recall and F1 score for each classifier

In [None]:
def results(labels, pred):
    print(confusion_matrix(labels,pred))  
    print(classification_report(labels,pred))  
    print(accuracy_score(labels, pred))

**Logistic Regression**
To start with, Logistic regression is a simple classification techniuqe but yet powerful in sometimes, for multiclass classification, it uses the ONe-VS-rest technique.
The C parameter is varied in order to get the best result.

In [None]:
logistic_regression(text_features_train,y_train, text_features_test,y_test)

**Support Vector Machines**

Support Vector machines is a very popular technique for classification, it simply tries to find the line (vector) that best seperates the classes.
Also, the C parameter is varied to get the best results.

In [None]:
svm(text_features_train,y_train, text_features_test,y_test)

**Naiive Bayes**

Naiive Bayes technique is totally dependant on Bayes theorem of probability for classification. It has its advantages and disadvantages; such as it is simple, fast, and has low computation cost and can work on large datasets, but on the other hand, the assumption of independent features is a drawback as in practice it is almost impossible that model will get a set of predictors which are entirely independent.

In [None]:
naiive_bayes(text_features_train,y_train, text_features_test,y_test)

**Gradient Boosting algorithm**

Gradient Boosting is a type of ensemble models, it is fast, accurate and the algorithm is parallelizable. It combines a set of weak learners and delivers improved prediction accuracy.

In [None]:
GB(text_features_train,y_train, text_features_test,y_test)

**Random Forest**

Random forest depends mainly on decision trees for prediction, it is accurate and robust and rarely overfits, but it is slow as it has so many decisions to make for prediction.

In [None]:
RF(text_features_train,y_train, text_features_test,y_test)

**Deep Learning techniques**

This section was made possible by the great efforts of Geoffry Hinton, the man who came up with back porbagation; how neural networks actually learn.
In this section, we will explore the feed forward neural network, Long Short Term Memory (LSTM), and convolutional Neural Networks (CNN) for text classification. 
For Deep Learning, the embeddings extracted by word2vec will be used.

At first, the labels should be in the form of one hot encoders.


In [None]:
import tensorflow as tf
y_one_hot = tf.keras.utils.to_categorical(
    Y,
    num_classes=3,
    dtype='int32'
)

In [None]:
x_train_DL, x_test_DL,y_train_DL,y_test_DL = train_test_split(padded_docs,y_one_hot,test_size=0.1,random_state=42)

**Feed Forward Neural Networks**

Feed Forward neural networks can perform solid classifications given moderate amount of data. 
In this model, we will use the trained word2vec embeddings as the initialized embeddings for the first layer, followed by two dense layers each containing 128 hidden nodes, in addition to dropout to reduce the effect of overfitting. finally, softmax actiavtion is used to predict the most probable output.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding,LSTM, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.initializers import Constant
from sklearn.metrics import accuracy_score
model = Sequential()
embedding_layer = Embedding(vocab_size,50,embeddings_initializer = Constant(embedding_matrix),input_length=20, trainable=False)
model.add(embedding_layer)
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print(model.summary())
model.fit(x_train_DL,y_train_DL,epochs = 50, batch_size=128)
ans = model.predict(x_test_DL)
labels = [np.where(r==1)[0][0] for r in y_test_DL]
ans = np.argmax(ans,axis=1)
accuracy_score(ans,labels)
results(ans,labels)

**Long Short Term Memory (LSTM)**

Humans don’t start their thinking from scratch every second. As you read this sentence, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence. However, feed forward neural networks are not able to do this, which rised the need for recurrent neural network, but they still suffered from the vanishing gradients problem, which in turn led researchers to propose Gated recurrent Unit (GRU) and LSTM, which are better at caputring the meaning of a whole sequence. Until now in this paragraph, LSTM's sound like a great fit for NLP problems, which they really are. In this section we will explore them for this task.

In [None]:
model = Sequential() 
embedding_layer = Embedding(vocab_size,50,embeddings_initializer = Constant(embedding_matrix),input_length=20, trainable=True)
model.add(embedding_layer)
model.add(LSTM(32, activation = 'relu',return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(32, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(3,activation = 'softmax'))
print(model.summary())
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train_DL,y_train_DL,epochs = 20,batch_size=256)
ans = model.predict(x_test_DL)
labels = [np.where(r==1)[0][0] for r in y_test_DL]
ans = np.argmax(ans,axis=1)
accuracy_score(ans,labels)
results(ans,labels)

**Conolutional Neural Networks (CNN)**

CNN are well known in the tasks of image processing, but they also perform pretty good in the field of NLP. In fact, in machine translation, facebook proposed CNN, and it is considered one of the state of the art architectures, as they achieved high results on all the bench marks and actually they are leading in some languages. Also, they are well known for their ability of parallel computation. In this section, we will explore the ability of CNN for this task, the used filters are 1D, [this blog](https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f) explains how text classification with CNN works.

In [None]:
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.layers import Dropout
model = Sequential()
embedding_layer = Embedding(vocab_size,50,embeddings_initializer = Constant(embedding_matrix),input_length=20, trainable=False)
model.add(embedding_layer)
model.add(Convolution1D(nb_filter=100,
                        filter_length=5,
                        border_mode='valid',
                        activation='relu',
                        subsample_length=1))
model.add(GlobalMaxPooling1D())
model.add(Dense(128,activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train_DL,y_train_DL,epochs=20)
ans = model.predict(x_test_DL)
labels = [np.where(r==1)[0][0] for r in y_test_DL]
ans = np.argmax(ans,axis=1)
accuracy_score(ans,labels)
results(ans,labels)

**Results Summary, Discussion and Error Analysis**

Eight machine learning and deep learning techniques were implemented for the task of sentiment analysis for tweets for 6 US airlines. The training data is 13.1k while the testing data 1.4k. 

The machine learning techniques implemented were; Logistic regression, Support Vector Machines (SVM), Naiive Bayes, Gradient Boosting and Random Forest. Tf-IDF was utilized as a word representation for 
The reported results for best model of each technique are shown in the table below.

| Technique  | Accuracy | Precision | Recall | F1Score|
| ---------- | -------- |-------- |-------- |-------- |
| Logistic Regression  | 79.5%  | 0.78 | 0.8 | 0.78|
| Support Vector Machines  | 79%  | 0.78 | 0.79| 0.78|
| Naive Bayes  | 69.6%  | 0.75 | 0.7 | 0.62|
| Gradient Boosting | 72%  | 0.73 | 0.73 |0.67 |
| Random Forest| 78.3%  | 0.77 |0.78 |0.77 |

For further analysis, confusion matrix is useful to gain some insights on the results.

Logistic regression

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 862 | 22 | 40 |
| positive  |50  | 158 | 30|
| neutral  | 127 | 31| 144 |

Support Vector Machines

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 847 | 24 | 53 |
| positive  |44  | 161 | 33|
| neutral  | 115 | 38| 149 |

Naiive Bayes

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 920 | 0 | 44 |
| positive  |178 | 55 | 5|
| neutral  | 253 | 4| 45 |

Gradient Boosting

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 888 | 28 | 8 |
| positive  | 96  | 138 | 4|
| neutral  | 240 | 25| 37 |

Random Forest 

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 864 | 18 | 42 |
| positive  |72  | 140 | 26|
| neutral  | 134 | 25| 143 |



The implemented deep learning techniques were; feed forward neural network, LSTM model and CNN model. Word2vec embeddings were used for these techniques.
The reported results for each technique are shown in the table below.

| Technique  | Accuracy | Precision | Recall | F1Score|
| ---------- | -------- |-------- |-------- |-------- |
| Feed Forward | 78.7%  | 0.79 | 0.79 | 0.79|
| LSTM  | 78%  | 0.78 | 0.78| 0.78|
| CNN  |74.6%  | 0.75 | 0.74 | 0.74|

Confusion Matrix

Feed Forward Neural Network

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 813 | 35 | 97 |
| positive  | 37  | 170 | 37|
| neutral  | 74 | 33| 168 |

Long Short Term Memory (LSTM)

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 800 | 23 | 93 |
| positive  |42  | 179 | 45|
| neutral  | 82 | 36| 164 |

Convolutional Neural Networks (CNN)

| | negative | positive | neutral|
| ---------- | -------- |-------- |--------|
| negative | 777 | 38 | 95 |
| positive  |60 | 165 | 61|
| neutral  | 87 | 35| 146 |


**Error Analysis**

* The accuracy of all the implemented techniques were better at classifying the negative type of reviews, which is due to the fact that the training data has more negative reviews than neutral or positive reviews.
* The highest error rate is in the neutral class, especially in the ML techniques, where the accuracy of classifying the neutral class was around 50%, deep learning models were better at classifying neutral classes.
* The classifiers were much better in classifying sentences which consists of obvious words indicating any sentiment, such as; good, bad, awesome, terrible and the list goes on.
* The performance of the models on small sentences was better overall than the long ones, however, Deep learning techniques were better at classifying long sentences than ML algorithms.
* Machine learning techniques; especially logistic regression and SVM performed at the same level as deep learning techniques, due to the fact the dataset is considered small, and deep learning techniques have more trainable and learnable parameters and hence more data is required to give better performance.
* The problem of overfitting appears more in Deep learning techniques than machine learning techniques.
* Deep learning handles errors in training set better than machine learning.
* Deep learning models needs more tuning as they have more parameters to vary and tune.
* Deep learning models were better at handling words that can vary in meaning depending on the context, especially LSTM models (sequential), such as; blow my mind, blow up.
* Padding sequences negatively effects the learning of DL models.

**Future Work**

* Trying Bert for sentence classification.
* Replacing word2vec with Elmo/ FastText which can handle context and subwords respectively.
* Optimize and fine-tune the deep learning techniques.