## Abstract

The airline industry is a very competitive market which has grown rapidly in the past 2 decades. Airline companies resort to traditional customer feedback forms which in turn are very tedious and time consuming. This is where Twitter data serves as a good source to gather customer feedback tweets and perform a sentiment analysis

### Problem statement

Here i will be looking at US Airlines twitter data to  predict the accuracy of how customer tweet relates to a class sentiment(negative, positive, neutral). This is a classic case of a sentiment analysis in machine learning. 

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
import re
import os
from nltk.tokenize import WordPunctTokenizer
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from textblob import TextBlob

#from sklearn.cross_validation import train_test_split
from textblob.classifiers import NaiveBayesClassifier
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Style
sns.set(font_scale=1.5)
plt.style.use('seaborn-pastel')
plt.style.use('seaborn-poster')

In [2]:
df = pd.read_csv('tweets.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'tweets.csv'

In [None]:
df.head(13)

## Explore Data 

## Counting tweets for each sentiment class

From the results below we can see that the most shared sentiments based on US Airlines are negative sentiments.

In [None]:
df1 = df['airline_sentiment'].value_counts()
df1

## Overall sentiments for US airline

Graphically we can also confirm that the most shared sentiments for US Airlines are negative sentiments

In [None]:
Index = [1,2,3]
plt.bar(Index, df1)
plt.xticks(Index, ['negative','neutral','positive'])
plt.ylabel('sentiment count')
plt.xlabel('sentiment class')

 As we can see from the above graph the sentiments which are most shared about US airlines are largely negative. This could be due to a factor of reasons.

## Most Sentiments shared for each airline

In [None]:
pd.crosstab(df.airline, df.airline_sentiment)

In [None]:

#Pie chart of tweets frequency for each airline
labels = ['United','US Airways','American','Southwest','Delta','Virgin America']
sizes = [0.261, 0.198, 0.188, 0.165, 0.152, 0.0344]
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99', '#33ccff', '#ff6600']
fig1, ax1 = plt.subplots(figsize=(6.5, 6))
ax1.pie(sizes, colors=colors, labels=labels, autopct='%1.1f%%', startangle=90)

centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

ax1.axis('equal')  
plt.tight_layout()
plt.title('Tweets Frequency by Airline', fontsize=18)

### Observation 
United Airlines has the most negative sentiments but if you look at how frequently people tweet about the airline , it becomes evident that a possible reason why the United airline has so much negative feedback could be because it is the most frequently used airline in the United States(possibly due to affordability). We can further assume that because the airline is so affordable, not a lot services are offered ,as it will later be seen from a visual representing the reason for the negative sentiments. 

## Reason for people sharing negative sentiments for each airline

In [None]:
types = df.groupby("negativereason")['airline'].value_counts(normalize=False).sort_index()
types.unstack().plot(kind='barh', stacked='True')
plt.legend(bbox_to_anchor=(1.5, 1), loc='upper right')
plt.xlabel('Number of tweets')
plt.title('Distribution of Number of negative tweets for every Airline')

### Observation 

As previously seen the United Airline shares the most negative sentiments. One conclusion that i made was that it an affordable airline and due to the fact , it doesn't offer a lot of customer friendly services. If we look at the above graph we can see that United airlines accounts for a lot of the negative reasons shared concerning an airline and their biggest issue big that of customer service issues. This agrees with my initial generalisation , which suggested that the flight is the most frequently used with a possible reason being its good affordability and becuase of that fact , not a lot of customer friendly services are offered. These customer Service issues include waiting for a long period of time on the phone to speak to a consulted, not getting a response in time if you send an email about an issue, the customer support allocating tickets to the wrong department.  

## Top 5 negative reasons 

In [None]:
df.negativereason.value_counts().sort_values(ascending=False).head(5)

## Preprocessing Data 

In this code block , i wrote a script to clean the data. This will make it easier for my models to generalize the data better or improve the performance of the model. 

Looking at my dataset what needs to be cleaned is the text feature(which is basically the comments relating to the sentiment shared by the customers on twitter), as it has a lot of stop words and unnecessary punctuation. 

In [None]:
username = '@[A-Za-z0-9]+'
url = 'https?://[^ ]+'
link = 'www.[^ ]+'
combined_p = '|'.join((username, url, link))
neg_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg = re.compile('\b(' + '|'.join(neg_dic.keys()) + ')\b')
tok = WordPunctTokenizer()

def tweet_cleaner(text):
    stripped = re.sub(combined_p, '', text)
    lower_case = stripped.lower()
    neg_handled = neg.sub(lambda x: neg_dic[x.group()], lower_case)
    letters = re.sub("[^a-zA-Z]", " ", neg_handled)
    words = [x for x in tok.tokenize(letters) if len(x) > 1]
    drop_stopwords = [x for x in words if x not in stopwords.words('english')]
    return (" ".join(drop_stopwords)).strip()

In [None]:
clean_tweets = []
for tweet in df.text:
    clean_tweets.append(tweet_cleaner(tweet))
df['clean_text'] = pd.DataFrame(clean_tweets)

In [None]:
df

In [None]:
list(df)

In [None]:
df = df.drop(['text'], axis =1)

In [None]:
df

## Analysing the amount of missing values


A data set with a lot of missing values makes it hard for the model to create a general observation about the data. So the way i dealt with this issue , was by finding out the percentage of the missing values in each column and drop the columns with the highest missing values because they are considered redunant columns. For columns with few missing values , i used Imputation. 

In [None]:
df.shape

## Percentage of missing values 

In [None]:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
                                 

In [None]:
missing_value_df

In [None]:
df = df.drop(['airline_sentiment_gold','negativereason_gold','tweet_coord'],axis =1 )

In [None]:
df.head()

In [None]:
df['negativereason_confidence'].fillna(df['negativereason_confidence'].mean(), inplace=True)

In [None]:
from numpy import nan
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=nan, strategy='most_frequent')
df.negativereason = imputer.fit_transform(df['negativereason'].values.reshape(-1,1))[:,0]
df.tweet_location = imputer.fit_transform(df['tweet_location'].values.reshape(-1,1))[:,0]
df.user_timezone  = imputer.fit_transform(df['user_timezone'].values.reshape(-1,1))[:,0]

In [None]:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
missing_value_df

In [None]:
df

In [None]:
df2 = df.copy()

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
# creating initial dataframe
# bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
# bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Types'])
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
df2['airline_sentiment_en'] = labelencoder.fit_transform(df2['airline_sentiment'])
df2

## Train - validation split 

neutral = 1


positive = 2


negative = 0

In [None]:
# Seperate features and tagret variables
y = df2['airline_sentiment_en']
X = df2['clean_text']

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words="english")
X_vectorized = vectorizer.fit_transform(X)

In [None]:
# Split the train data to create validation dataset
X_train,X_val,y_train,y_val = train_test_split(X_vectorized,y,test_size=.1,shuffle=True, stratify=y, random_state=11)#changed test size to 0.1 from 0.3

Predict how customers feel about US airlines 

In [None]:
!pip install scikit-plot

## Random Forest 

Supervised machine learning algorithm that is generally used for classification problems. It operates by constructing multiple decision trees during the training phase. The random forest chooses the decision of the majority of the trees as the final decision.

In [None]:
from scikitplot.metrics import plot_roc, plot_confusion_matrix
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, classification_report, confusion_matrix
import time
modelstart = time.time()
rf = RandomForestClassifier(n_estimators = 100 , max_features= 'auto', bootstrap = 'False')
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)
rf_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
pd.DataFrame(report).transpose()

Recall: shows how many positive classes we predicted correctly 
        recall should be as high as possible 
        for neutral sentiments is where we got the highest recall which is a good measure for our class prediction
        lowest was negative sentiment 
        
Precision: from all the classes we have predicted as positive , how many are actually positive 
           Should be as high as possible 
F1-Score: Helps to measure Recall and Precision at the same time 

In [None]:
plot_confusion_matrix(y_val, y_pred, normalize=True,figsize=(8,8),cmap='winter_r')
plt.show()

## Decision Tree

A decision tree builds classification or regression models as a tree structure, with datasets broken up into small subsets while developing the decision tree, with branches and nodes. Decision trees can handle both categorical and numerical data. 



In [None]:
from sklearn.tree import DecisionTreeClassifier 
modelstart = time.time()
rf = DecisionTreeClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)
dt_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
pd.DataFrame(report).transpose()

In [None]:
plot_confusion_matrix(y_val, y_pred, normalize=True,figsize=(8,8),cmap='winter_r')
plt.show()

## Gradient Boost 

It is a supervised learning algorithm where strong predictors is build in additive or sequential manner using weak predictor typically Decision tree. It is used for both classification and regression task.

It is ensemble learning algorithm. It is based on strong theoretical concept of sequentially combining weak predictor to build strong predictor

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
modelstart = time.time()
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)
gb_model.fit(X_train, y_train)
y_pred = gb_model.predict(X_val)
gb_model_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
pd.DataFrame(report).transpose()

In [None]:
plot_confusion_matrix(y_val, y_pred, normalize=True,figsize=(8,8),cmap='winter_r')
plt.show()

## LinearSVC 

The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes.

In [None]:
from sklearn.svm import LinearSVC
modelstart = time.time() 
linsvc = LinearSVC()
linsvc.fit(X_train, y_train)
y_pred = linsvc.predict(X_val)
linsvc_f1 = round(f1_score(y_val, y_pred, average='weighted'),2)
print('Accuracy %s' % accuracy_score(y_pred, y_val))
print("Model Runtime: %0.2f seconds"%((time.time() - modelstart)))
report = classification_report(y_val, y_pred, output_dict=True)
results = pd.DataFrame(report).transpose()

results

In [None]:
plot_confusion_matrix(y_val, y_pred, normalize=True,figsize=(8,8),cmap='winter_r')
plt.show()

## Performance Evaluation 

F1 - score metric

In [None]:
# Compare Weighted F1-Scores Between Models
fig,axis = plt.subplots(figsize=(10, 5))
rmse_x = ['Random Forest Classifier','Linear SVC','DecisionTreeClassifier','Gradient Boosting Classifier']
rmse_y = [rf_f1,dt_f1,linsvc_f1,gb_model_f1]
ax = sns.barplot(x=rmse_x, y=rmse_y,palette=("Blues_d"))
plt.title('Weighted F1-Score Per Classification Model',fontsize=14)
plt.xticks(rotation=90)
plt.ylabel('Weighted F1-Score')
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2, p.get_y() + p.get_height(), round(p.get_height(),2), fontsize=12, ha="center", va='bottom')
    
plt.show()

## Hyperparameter tuning 

## Random forest 

n_estimators = number of trees in the foreset


max_features = max number of features considered for splitting a node


max_depth = max number of levels in each decision tree


min_samples_split = min number of data points placed in a node before the node is split


min_samples_leaf = min number of data points allowed in a leaf node


bootstrap = method for sampling data points (with or without replacement)

Use the documentation on the random forest i Scikit-Learn. This tells us the most important settings are the number of trees in the forest(n_estimators) and the number of features considered for splitting at each leaf node(max_features). So we going to try a wide range of values and see what works best then try adjusting the set of parameters. 

## RandomizedSearchCV vs GridSearchCV 

RandomizedSearchCV: Only few samples in the data are randomly selected


GridSearchCv: Considers all possible combinations of hyperparameters

## K-fold Cross-validation

In K-fold cross-validation the training dataset is divided into three parts as training data , cross validation data and testing data. This is a way to utilise the training data we have as much as possible.  



In [None]:
from sklearn.ensemble import RandomForestRegressor

# Run RandomizedSearchCV to tune the hyper-parameter
from sklearn.model_selection import RandomizedSearchCV
rfr=RandomForestRegressor()
k_fold_cv = 5 # Stratified 5-fold cross validation
params = {
 'n_estimators' : [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features' : ['auto', 'sqrt'],
 'bootstrap' : [True, False]
 }
random = RandomizedSearchCV(rfr, param_distributions=params, cv=k_fold_cv,
 n_iter = 5, scoring='neg_mean_absolute_error',verbose=2, random_state=42,
 n_jobs=-1, return_train_score=True)
random.fit(X_train, y_train)
# print('Best hyper parameter:’, random.best_params_)
print('Best hyper parameter:', random.best_params_)

## GridSearchCV 

In [None]:
# Run GridSearch to tune the hyper-parameter
from sklearn.model_selection import GridSearchCV
rfr=RandomForestRegressor()
k_fold_cv = 5 # Stratified 5-fold cross validation
grid_params = {
 'n_estimators' : [10, 50,100],
 'max_features' : ['auto', 'sqrt'],
 'bootstrap' : [True, False]
 }
grid = GridSearchCV(rfr, param_grid = grid_params, cv=k_fold_cv, verbose=0, n_jobs =1, return_train_score=True)

grid.fit(X_train, y_train)

print('Best hyper parameter:', grid.best_params_)

Using the RandomizedSearchCV we observed that our model for the random forest classifier improved by 1,05% , where initally our model accuracy was 72% and after applying the new hyperparameters in moved up to 73%. Though from research we know that a RandomizedSearchCV works works extremely well with a large dataset , the concept of hyperparameter tuning was put to the test and yielded a positive outcome.