# Homework 4

In our lab, we trained a model to distinguish negative tweets from positive tweets.

What if you worked for an airline and wanted to know *why* your airline is getting bad word of mouth on Twitter? You can probably track late flights and damaged luggage using the airline's own tracking system. But Twitter might give you clues otherwise unavailable about changes in customer service.

How reliably can we identify the reason for a negative tweet?

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_validate
from pathlib import Path

In [2]:
tweetpath = Path('../data/tweets/flight_sentiment.tsv')
tweets = pd.read_csv(tweetpath, sep ='\t')
print(tweets.shape)
tweets.head()

(14640, 5)


Unnamed: 0,airline_sentiment,negativereason,retweet_count,airline,text
0,neutral,,0,Virgin America,@VirginAmerica What @dhepburn said.
1,positive,,0,Virgin America,@VirginAmerica plus you've added commercials t...
2,neutral,,0,Virgin America,@VirginAmerica I didn't today... Must mean I n...
3,negative,Bad Flight,0,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,Can't Tell,0,Virgin America,@VirginAmerica and it's a really big bad thing...


In [3]:
tweets = tweets.loc[~pd.isnull(tweets['negativereason']), : ]
tweets = tweets.reset_index(drop = True)  # try commenting this line out, and see how it
                                          # makes your task more difficult
tweets.shape

(9178, 5)

In [4]:
tweets['negativereason'].value_counts()

Customer Service Issue         2910
Late Flight                    1665
Can't Tell                     1190
Cancelled Flight                847
Lost Luggage                    724
Bad Flight                      580
Flight Booking Problems         529
Flight Attendant Complaints     481
longlines                       178
Damaged Luggage                  74
Name: negativereason, dtype: int64

## Assignment 1

Our mission is to construct a model that can identify "customer service issues" in future tweets about our airline.

We'll do that by building the most accurate LogisticRegression model we can on this dataset.

In order to be confident about accuracy, we will need to go through all the steps we went through in the lab, including

1. Create a term-doc matrix reporting the length of the 4,000 most common words.

2. Factor out the length of tweets and make ```tweetlen``` a separate column.

3. Shuffle row order. Separate a test set of let's say 1500 tweets to be used for final evaluation of the model, from a training set that will be used to train the model.

4. Scale the training set and test set separately, using StandardScaler.

5. Cross-validate a model on the training set, evaluating by f1 score, because we have imbalanced classes.

6. When we're confident that we have the best C parameter (increasing or reducing the parameter produces lower f1 scores), train a model on the *whole* training set using our C parameter.

7. Finally, report

     a. F1 score and
     
     b. accuracy on the test set. Also
     
     c. Print the 10 features that have the largest positive, or smallest negative coefficients in the model.
     

### 1. Create a term-doc matrix

In [5]:
vectorizer = CountVectorizer(max_features = 4000)
sparse_wordcounts = vectorizer.fit_transform(tweets.text)
wordcounts = sparse_wordcounts.toarray()
tweetwords = pd.DataFrame(wordcounts, columns = vectorizer.get_feature_names())
tweetwords.head()

Unnamed: 0,00,000,0016,00pm,02,03,05,05pm,08,10,...,yourselves,yousuck,yr,yrs,yuma,yyz,zero,zkatcher,zone,zurich
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Factor out the length of the tweets and make it a separate column

In [6]:
tweetlengths = tweets['text'].str.len()
tweetlengths[0:10]

0    126
1     55
2    135
3     45
4    137
5    130
6    139
7    112
8    140
9    125
Name: text, dtype: int64

In [7]:
wordfreqs = tweetwords.divide(tweetlengths, axis = 'rows')
wordfreqs['#tweetlen'] = tweetlengths
wordfreqs.head()

Unnamed: 0,00,000,0016,00pm,02,03,05,05pm,08,10,...,yousuck,yr,yrs,yuma,yyz,zero,zkatcher,zone,zurich,#tweetlen
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,55
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,135
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,137


### 3a. Shuffle row order

In [8]:
wordfreqs = wordfreqs.sample(frac = 1)
wordfreqs.head()

Unnamed: 0,00,000,0016,00pm,02,03,05,05pm,08,10,...,yousuck,yr,yrs,yuma,yyz,zero,zkatcher,zone,zurich,#tweetlen
2751,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,74
4551,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,75
4525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,140
1695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18
4781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,133


In [9]:
reorderedtweets = tweets.loc[wordfreqs.index, : ]
reorderedtweets.head()

Unnamed: 0,airline_sentiment,negativereason,retweet_count,airline,text
2751,negative,Late Flight,0,United,@united I just need to get to RIC tonight. I'v...
4551,negative,Late Flight,0,Delta,"@JetBlue No, the flight wasn't until 9:51pm, b..."
4525,negative,Late Flight,0,Delta,@JetBlue this is awful! flight out of jfk for ...
1695,negative,Late Flight,0,United,@united no u don't
4781,negative,Damaged Luggage,0,Delta,@JetBlue so why do you put this at the bottom ...


### 3b. Separate test and train sets

In [10]:
testfreqs = wordfreqs.iloc[0: 1500, : ]
test_y = (reorderedtweets['negativereason'][0: 1500] == 'Customer Service Issue').astype(int)  # try taking the last part out
test_y[0:10]

2751    0
4551    0
4525    0
1695    0
4781    0
94      0
4251    0
7312    0
2597    1
5865    0
Name: negativereason, dtype: int64

In [11]:
trainfreqs = wordfreqs.iloc[1500 : , : ]
train_y = (reorderedtweets['negativereason'][1500: ] == 'Customer Service Issue').astype(int) 

### 4. Scale test and training sets (separately)

In [12]:
trainscaler = StandardScaler()
trainXscaled = trainscaler.fit_transform(trainfreqs)
trainXscaled = pd.DataFrame(trainXscaled, columns = trainfreqs.columns)
trainXscaled.head()

Unnamed: 0,00,000,0016,00pm,02,03,05,05pm,08,10,...,yousuck,yr,yrs,yuma,yyz,zero,zkatcher,zone,zurich,#tweetlen
0,-0.03726,-0.045296,0.0,-0.019662,-0.023934,-0.016043,-0.029554,-0.015938,-0.018771,-0.117487,...,-0.016139,-0.04798,-0.021619,-0.018555,-0.036434,-0.058939,-0.015979,-0.025458,-0.019641,-2.703839
1,-0.03726,-0.045296,0.0,-0.019662,-0.023934,-0.016043,-0.029554,-0.015938,-0.018771,-0.117487,...,-0.016139,-0.04798,-0.021619,-0.018555,-0.036434,-0.058939,-0.015979,-0.025458,-0.019641,0.750763
2,-0.03726,-0.045296,0.0,-0.019662,-0.023934,-0.016043,-0.029554,-0.015938,-0.018771,-0.117487,...,-0.016139,-0.04798,-0.021619,-0.018555,-0.036434,-0.058939,-0.015979,-0.025458,-0.019641,-2.019436
3,-0.03726,-0.045296,0.0,-0.019662,-0.023934,-0.016043,-0.029554,-0.015938,-0.018771,-0.117487,...,-0.016139,-0.04798,-0.021619,-0.018555,-0.036434,-0.058939,-0.015979,-0.025458,-0.019641,0.718173
4,-0.03726,-0.045296,0.0,-0.019662,-0.023934,-0.016043,-0.029554,-0.015938,-0.018771,-0.117487,...,-0.016139,-0.04798,-0.021619,-0.018555,-0.036434,-0.058939,-0.015979,-0.025458,-0.019641,-0.031411


In [13]:
testscaler = StandardScaler()
testXscaled = testscaler.fit_transform(testfreqs)
testXscaled = pd.DataFrame(testXscaled, columns = testfreqs.columns)

### 5. Cross-validate a model to infer the  best C parameter

In [14]:
for c_param in [.0000001, .000001, .00001, .0001, .001, .01, .1, 1]:
    logist = LogisticRegression(C = c_param, max_iter = 1000, class_weight = 'balanced') 
    results = cross_validate(logist, trainXscaled, train_y, cv = 5, scoring = 'f1')
    print('C parameter:', c_param)
    print('Mean f1:', np.mean(results['test_score']))
    print()

C parameter: 1e-07
Mean f1: 0.6738497027636158

C parameter: 1e-06
Mean f1: 0.6863629781388656

C parameter: 1e-05
Mean f1: 0.68610337714619

C parameter: 0.0001
Mean f1: 0.686384863486676

C parameter: 0.001
Mean f1: 0.6593552843445292

C parameter: 0.01
Mean f1: 0.62146665663984

C parameter: 0.1
Mean f1: 0.5895596231571372

C parameter: 1
Mean f1: 0.569826780156079



### 6. Train a  model using that parameter on the whole training set, and apply it to the test set.

In [15]:
logist = LogisticRegression(C = .000001, max_iter = 1000, class_weight = 'balanced') 
logist.fit(trainXscaled, train_y)

# Now apply it to test

predictions = logist.predict(testXscaled)

tp = sum((test_y == 1) & (predictions == 1))
fp = sum((test_y == 0) & (predictions == 1))
fn = sum((test_y == 1) & (predictions == 0))

precision = tp / (tp + fp)
recall = tp / (tp + fn)

F1 = 2 * (precision * recall) / (precision + recall)

print("precision: ", round(precision, 4))
print("recall: ", round(recall, 4))
print("F1: ", round(F1, 4))

precision:  0.7267
recall:  0.7034
F1:  0.7149


### 7a. F1 Score

is 0.7149

In [16]:
accuracy = sum(test_y == predictions) / len(predictions)
accuracy

0.8133333333333334

### 7b. Accuracy

is 0.8133

In [18]:
logist = LogisticRegression(C = .000001, max_iter = 1000, class_weight = 'balanced')
logist.fit(trainXscaled, train_y)
coefficients = [x for x in zip(logist.coef_[0], vectorizer.get_feature_names())]
coefficients.sort()

### 7c. Here are the ten words least predictive of "customer service," followed by the ten most predictive.

In [19]:
coefficients[0:10]

[(-0.0006963088170583416, 'flight'),
 (-0.0005041535424915817, 'delayed'),
 (-0.00046407500899668826, 'plane'),
 (-0.0004277034703547659, 'cancelled'),
 (-0.0003772824009206396, 'bag'),
 (-0.00037669505892600814, 'flightled'),
 (-0.00035140748839154035, 'delay'),
 (-0.00034053505721752997, 'luggage'),
 (-0.0003163258532086148, 'bags'),
 (-0.00031359433958540314, 'gate')]

In [20]:
coefficients[-10: ]

[(0.00036662389298425626, 'calling'),
 (0.00039501262374505987, 'answer'),
 (0.0004244331204068788, 'response'),
 (0.000433050034459522, 'speak'),
 (0.00048301917573489973, 'hung'),
 (0.0007296293558995294, 'call'),
 (0.0007672187685251073, 'phone'),
 (0.0010248655906291584, 'hold'),
 (0.0010800412304275667, 'service'),
 (0.001155599700703033, 'customer')]