In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

# NLP Techniques Lab

In this lab, we'll be practicing a set of advanced NLP techniques using tweets on airline satisfaction ([originally from Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/data)).

The first section asks you to perform LDA on the dataset to summarize the body of tweets. The second section will focus on using this data to predict the sentiment of a given tweet.

Import the data as follows:

In [2]:
df = pd.read_csv('datasets/Tweets.csv')
print(df.shape)
df.head()

(14640, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Use this data to do the following:

#### 1. Use LDA to identify topics in the tweets

Pick a number of topics between 5-20 and use LDA to summarize the corpus of tweets. Print out the top 25 most frequently occuring words in each topic. Do the topics appear cohesive to you? What predominant trends can you find?

In [3]:
cv = CountVectorizer(stop_words='english')
cv.fit(df['text'])
X = cv.transform(df['text'])
feature_names = cv.get_feature_names()

lda = LatentDirichletAllocation(n_components=10)
lda.fit(X)

results = pd.DataFrame(lda.components_,
                      columns=feature_names)

for topic in range(10):
    print('Topic', topic)
    word_list = results.T[topic].sort_values(ascending=False).index
    print(' '.join(word_list[0:25]), '\n')



Topic 0
ve trying change days seats 30 10 book times doesn miles website info time air minutes make ago credit online travel way need called available 

Topic 1
guys aa jfk want response line said better great delays free http status making terrible hung try year san start looking gave away option asked 

Topic 2
gate tonight waiting agent answer clt understand isn departure time ground delayed sitting worse 200 busy waited tarmac wouldn counting changes cool automated hope las 

Topic 3
united plane new hrs crew won http airlines rude fleek does phl able sitting come say update voucher sorry fail planes stop american fly horrible 

Topic 4
americanair flight usairways cancelled united help hours hold flightled thanks amp just need bag flights got late today phone tomorrow weather did number flighted day 

Topic 5
united airline home bags worst http bad customers staff amp time sure says tell fleet pay really ve experience app ok money flying helpful good 

Topic 6
jetblue southwestair

```
Instructor answer:

Looks like there are some trends, some topics discuss baggage or airline delays, while others seem to focus on just one or two specific companies.
```

#### Bonus LDA Question (Tackle if you have time / interest)

Using the `.transform()` method on LDA on the data you fed it will return back a numpy array of shape `(n_rows, n_topics)`. The value in each column will identify the probability that the row in question belongs to that topic. For example, if we were looking at a row of data and an LDA model for three topics, we might see the following:

```python
lda.transform(row_of_data)
>> [[ 0.02, 0.97, 0.01 ]]
```

This would suggest that for that row of data, it is most likely to be in the second topic (compared to the first or third topic).

As a bonus challenge, try the two following questions:

1. For each topic, which tweet most exemplifies (or is most likely to belong to that topic?)
2. Find a recent tweet at an airline that you have used. Can you use the model you have currently to identify what topic does it belongs to?

In [4]:
# Bonus question 1

topic_names = ['topic %s' % topic for topic in range(10)]

results = pd.DataFrame(lda.transform(X),
                      columns=topic_names)
joined = df[['tweet_id', 'text']].join(results)
for topic in topic_names:
    print(topic)
    print(joined.sort_values(by=topic, ascending=False)[['text', topic]].head(1).values)
    print('\n')

topic 0
[[ '@VirginAmerica did you know that suicide is the second leading cause of death among teens 10-24'
  0.9249971729973999]]


topic 1
[[ '@VirginAmerica you guys messed up my seating.. I reserved seating with my friends and you guys gave my seat away ... 😡 I want free internet'
  0.9399980799833922]]


topic 2
[[ '@united - tick, tock, tick, tock, it is rapidly approaching the next dream departure of 1:45pm. When is the next fantasy departure time??'
  0.78405427730716]]


topic 3
[[ '@united how can you not know the weight of our plane after us sitting on the plane for 2.5 hrs? Not convinced your company is safe for flt.'
  0.9249936866887448]]


topic 4
[[ '@VirginAmerica can u help this 👸 @FreyaBevan_Fund needs urgent treatment in🇺🇸2y old battling cancer could u help with flights 💗#freyasfund'
  0.9357105009036262]]


topic 5
[[ '@VirginAmerica has getaway deals through May, from $59 one-way. Lots of cool cities http://t.co/tZZJhuIbCH #CheapFlights #FareCompare'
  0.93076406

In [5]:
# Bonus question 2

tweet = ['@SouthwestAir screwed up my booking and now my gramma can\'t make it home for Thanksgiving . We booked a Senior Fare in her name and it got swtiched to mine. Plase help us get maw-maw home #southwestairlines']

tweet_transformed = cv.transform(tweet)
results = pd.DataFrame(lda.transform(tweet_transformed),
                      columns=topic_names)
print(results)

# Looks like mostly related to topic 1 in this case

    topic 0   topic 1   topic 2   topic 3   topic 4  topic 5   topic 6  \
0  0.007694  0.007692  0.161533  0.100286  0.453558  0.16154  0.084616   

    topic 7   topic 8   topic 9  
0  0.007694  0.007694  0.007692  


#### 2. Use NLP to predict the sentiment of tweets

In this section, please use any of the NLP techniques that we have covered over the last two days to best predict whether a tweet has a negative sentiment or not. Transformation code for your target variable is below.

**Bonus Consideration**: Outside of the text itself, do other factors in the dataset have an effect? Do your results change if you include features like the airline or the timezone of the tweet?

Don't forget to create a training and test set to compare your results. 

In [6]:
df['negative'] = df['airline_sentiment'].apply(lambda x: 1 if x=='negative' else 0)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['text'],
                                                   df['negative'])

tfidf = TfidfVectorizer(stop_words='english')
tfidf.fit(X_train)
X_train_tf = tfidf.transform(X_train)

tsvd = TruncatedSVD(n_components=100)
tsvd.fit(X_train_tf)
X_train_tf_tsvd = tsvd.transform(X_train_tf)

rfc = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc.fit(X_train_tf_tsvd, y_train)
train_predictions = rfc.predict(X_train_tf_tsvd)
print(rfc.score(X_train_tf_tsvd, y_train))
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))

X_test_tf = tfidf.transform(X_test)
X_test_tf_tsvd = tsvd.transform(X_test_tf)
test_predictions = rfc.predict(X_test_tf_tsvd)
print(rfc.score(X_test_tf_tsvd, y_test))
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))

0.755009107468
[[1687 2390]
 [ 300 6603]]
             precision    recall  f1-score   support

          0       0.85      0.41      0.56      4077
          1       0.73      0.96      0.83      6903

avg / total       0.78      0.76      0.73     10980

0.738524590164
[[ 556  829]
 [ 128 2147]]
             precision    recall  f1-score   support

          0       0.81      0.40      0.54      1385
          1       0.72      0.94      0.82      2275

avg / total       0.76      0.74      0.71      3660

