# US Airline Twitter Sentiment Analysis using Logistic Regression

#### In this notebook, we will implement a simple logistic regression model to classify a tweet as negative, positive, or neutral. 

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

The dataset is publicly available on [Kaggle] (https://www.kaggle.com/crowdflower/twitter-airline-sentiment) and [Data.world](https://data.world/crowdflower/airline-twitter-sentiment).

The goal is not to achieve the highest accuracy but to demonstrate a logistic regression model.


In [1]:
import numpy as np 
import pandas as pd
import csv 
%matplotlib inline

## Step 1: Load the data

In [157]:
datapath = ("/Users/susanchen/Desktop/Airline_sentiment/Airline-Sentiment-2.csv")


In [158]:
#Custom function that takes in a dataframe and extracts the tweets and labels
def preprocess_data(datafile):
    df = pd.read_csv(datafile)
    tweets = np.array(df.text)
    labels = np.array(df.airline_sentiment)
    return tweets, labels 

In [159]:
tweets, labels = preprocess_data(datapath)

In [217]:
tweets

array(['@VirginAmerica What @dhepburn said.',
       "@VirginAmerica plus you've added commercials to the experience... tacky.",
       "@VirginAmerica I didn't today... Must mean I need to take another trip!",
       ...,
       '@AmericanAir Please bring American Airlines to #BlackBerry10',
       "@AmericanAir you have my money, you change my flight, and don't answer your phones! Any other suggestions so I can make my commitment??",
       '@AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?'],
      dtype=object)

## Step 2: Clean up the data

#### We start by removing links, symbols, and @mentions from each tweet

In [17]:
import re
# Removes @mentions, links and symbols expect hashtags
def cleanTweet(tweetArray):
    
    punctuation = '!"#$%&\'()*+,-./:;<=>?[\\]^_`{|}~'
    all_tweets = 'separator'.join(tweetArray)
    all_tweets = all_tweets.lower()
    all_text =''.join([t for t in all_tweets if t not in punctuation])

    tweets_split = all_text.split('separator')
    all_text = ' '.join(tweets_split)

    clean_tweets = []
    for t in tweets_split:
        # remove @mentions
        clean_tweet = re.sub(r'@[A-Za-z0-9_]+', '', t)
        # Remove any links
        clean_tweet = re.sub('https?://[A-Za-z0-9./]+','', t)
        # Remove @Airline mentions 
        clean_tweet = re.sub("@[\w]*", '',  t)
        clean_tweets.append(clean_tweet)
    
    return clean_tweets
    

In [161]:
clean_tweets = cleanTweet(tweets)

In [218]:
# look at the first ten clean tweets 
clean_tweets[:10]

[' what  said',
 ' plus youve added commercials to the experience tacky',
 ' i didnt today must mean i need to take another trip',
 ' its really aggressive to blast obnoxious entertainment in your guests faces amp they have little recourse',
 ' and its a really big bad thing about it',
 ' seriously would pay 30 a flight for seats that didnt have this playing\nits really the only bad thing about flying va',
 ' yes nearly every time i fly vx this ‰ыпear worm‰ыќ won‰ыєt go away ',
 ' really missed a prime opportunity for men without hats parody there httpstcomwpg7grezp',
 ' well i didnt‰ыbut now i do d',
 ' it was amazing and arrived an hour early youre too good to me']

#### Stopwords (such as 'the', 'a', 'an', 'in', 'for') carry no sentimental meaning and can up valuable processing time and space in our database, so it is best pratice to remove them. 

#### NLTK (Natural Language toolkit) in python has a list of stopwords stored in 16 languages. We will only need the english list as our tweets are in English. 

In [160]:
import nltk
# to download stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/susanchen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [83]:
def removeStopWords(tweetArray):
    tokens = [t for t in tweetArray if not t in stopwords.words('english')]
    tokens = np.array(tokens)
    return tokens

In [383]:
tokens = removeStopWords(clean_tweets)

In [384]:
tokens

array([' what  said',
       ' plus youve added commercials to the experience tacky',
       ' i didnt today must mean i need to take another trip', ...,
       ' please bring american airlines to blackberry10',
       ' you have my money you change my flight and dont answer your phones any other suggestions so i can make my commitment',
       ' we have 8 ppl so we need 2 know how many seats are on the next flight plz put us on standby for 4 people on the next flight'],
      dtype='<U168')

#### Encode the labels as -1 (negative), 0 (neutral), 1(postive)

In [164]:
# 1=positive, 0=neutral, -1=negative label conversion
encoded_labels = []
for label in labels:
    if label == 'neutral':
        encoded_labels.append(0)
    elif label == 'negative':
        encoded_labels.append(-1)
    else:
        encoded_labels.append(1)

encoded_labels = np.asarray(encoded_labels)


#### Encode tweeets

In [268]:
from sklearn.feature_extraction.text import CountVectorizer
def vect(token):
    vectorizer = CountVectorizer(
    analyzer = 'word',
    lowercase = True)
    features = vectorizer.fit_transform(token).toarray()
    return features


In [332]:
features = vect(tokens)

## Step 3: Training a Logistic Regression Model on the data

### Part 1

In [307]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(features, encoded_labels, train_size=0.80, test_size =.20, random_state=1)

#### Use GridSearchCV to find the optimal parameters 

In [308]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
param_grid = {'C': [0.001, 0.01, 0.1, .5, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring= 'accuracy')
grid.fit(X_train, y_train)

print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
print("Best estimator: ", grid.best_estimator_)

Best cross-validation score: 0.80
Best parameters:  {'C': 0.5}
Best estimator:  LogisticRegression(C=0.5)


#### Using 5 rounds of experiments, we find that the optimal c is .5 and the best cross-validation accuracy is 80%. We could stop here, but for demonstrative purposes, I will also show how to fit the model without Grid Search. 

### Part 2 
#### Start by initiating a Logistics Regression Model and then training it. Here we set C equal to the optimal .5 (a smaller value means stronger regularization strength)

In [310]:
log_model = LogisticRegression(C=.5, random_state =1, multi_class = 'multinomial', solver = 'lbfgs')
model = log_model.fit(X=X_train, y=y_train)

In [311]:
y_pred = model.predict(X_test)

In [312]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.7971311475409836


#### Our accuracy score is .7971 which is about .8 or 80%. This confirms the results from Grid Search. 

### Part 3 Using Pipelines and Cross Validation

In [313]:
from sklearn.pipeline import Pipeline
my_pipeline = Pipeline(steps=[("model", log_model)])
my_pipeline.fit(X_train, y_train)

Pipeline(steps=[('model',
                 LogisticRegression(C=0.5, multi_class='multinomial',
                                    random_state=1))])

In [314]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
log_model2 = LogisticRegression(C=.5, random_state =1, multi_class = 'multinomial', solver = 'lbfgs')
#model2 = log_model2.fit(X=X_train, y=y_train)
#y_pred2 = model2.predict(X_test)

pipe = Pipeline(steps=[("model", log_model2)])
#pipe.fit(X_train, y_train)
#pipe.score(X_test, y_test)

In [315]:
scores2 = cross_val_score(pipe, X= features, y = labels, cv=5, scoring = 'accuracy')

In [316]:
print('Accuracy scores: \n \n', scores2)
print("Model 2 Average accuracy (across experiments):")
print(scores2.mean().round(4))

Accuracy scores: 
 
 [0.78995902 0.78722678 0.79849727 0.81352459 0.79678962]
Model 2 Average accuracy (across experiments):
0.7972


#### Again the accuracy scores were all very close to .8.

## Step 4: Make predictions on random tweets

### Randomly pick tweets from the test set and predict it's label

In [318]:
import random
j = random.randint(0,len(X_test)-5)
for i in range(j,j+5):
    print('Predition Label:','{:d}'.format(y_pred[i]))
    ind = features.tolist().index(X_test[i].tolist())
    print(tweets[ind].strip())
    print('\n')

Predition Label: -1
@USAirways What is with your lost &amp; found.  My wife lost her phone in Philly, &amp; followed your web instructions to call an 800#.  No answer.


Predition Label: 1
@united thx for update


Predition Label: 0
@JetBlue #489. Flight #589 is departing before we even board


Predition Label: -1
@AmericanAir your a LIAR, no precipitation all day, 47 degrees, don't pacify, you look like a fool.......thats the problem, admit you suck


Predition Label: 1
@AmericanAir thanks... I finally got through this afternoon.  :)




#### As you can see, the model was able to accurately predict the tweets with some degree of error. The third tweet: "@JetBlue #489. Flight #589 is departing before we even board" was labeled as neutral and could be depending on the context. But as most would say, that tweet express more negativity than neutral feelings. 
 

### Test the Model on made-up tweets

In [337]:
def pad_features(features, seq_length):
    ''' Return features of test tweet, where each tweet feature is padded with 0's 
        or truncated to the input seq_length. 
    '''
    # getting the correct rows x cols shape
    features2 = np.zeros((len(features), seq_length), dtype=int)

    # for each test feature, fill with the correct values
    for i, row in enumerate(features):
        features2[i, -len(row):] = np.array(row)[:seq_length]
    
    return features2

In [395]:
test_tweets =[]
pos = 'this flight was on time and I just love flying with #Delta :)'
neg = '@AmericanAirlines do not fly with American Airlines they are the worst airline and their flights are always late'
neut= "time to board in 30 minutes"
test_tweets.append(pos)
test_tweets.append(neg)
test_tweets.append(neut)

test_clean = cleanTweet(test_tweets)
test_token = removeStopWords(test_clean)
test_features = pad_features(vect(test_token), 16107)


print("prediction: {}". format(model.predict(test_features)))


prediction: [1 1 0]


#### Running the prediction on made-up tweets, we can see that the model did not label the negative tweet as negative, but rather as positive. One potential reason for this could be the portion of positive tweets to negative/neutral tweets. If the model is fitted on a dataset with mostly positive tweets, then predicting a negative tweet may require a different model or more training. 