# NLP Techniques Lab

In this lab, we'll be practicing a set of advanced NLP techniques using tweets on airline satisfaction ([originally from Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/data)).

The first section asks you to perform LDA on the dataset to summarize the body of tweets. The second section will focus on using this data to predict the sentiment of a given tweet.

Import the data as follows:

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report


In [4]:
import pandas as pd

df = pd.read_csv('datasets/Tweets.csv')
print(df.shape)
df.head()

(14640, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Use this data to do the following:

#### 1. Use LDA to identify topics in the tweets

Pick a number of topics between 5-20 and use LDA to summarize the corpus of tweets. Print out the top 25 most frequently occuring words in each topic. Do the topics appear cohesive to you? What predominant trends can you find?

In [5]:
cv = CountVectorizer(stop_words='english')
cv.fit(df['text'])
X = cv.transform(df['text'])
feature_names = cv.get_feature_names()

lda = LatentDirichletAllocation(n_components=10)
lda.fit(X)

results = pd.DataFrame(lda.components_,
                      columns=feature_names)

for topic in range(10):
    print('Topic', topic)
    word_list = results.T[topic].sort_values(ascending=False).index
    print(' '.join(word_list[0:25]), '\n')



Topic 0
usairways service customer air thank dca night team amp http travel ve rep amazing great twitter little awful usair flying heard request world complaint mco 

Topic 1
fly http love yes jfk good thanks right trying home won virginamerica info nice book able flying say looking ok flight sorry work dallas year 

Topic 2
weather united getting seats line way 10 times help ve amp flt morning having available reservation booked helpful hung instead upgrade like http 24 wife 

Topic 3
flight united hours late wait people did dm worst hrs phone time problems does boarding waiting min flightr just hour miss check refund booking email 

Topic 4
don guys ll staff want doesn amp called use seat trip free finally hope haven http just half gave possible question soon row plus wifi 

Topic 5
jetblue southwestair flight cancelled flightled flights change just flighted tomorrow online sent hours miles need think hotel like rebook new credit yesterday way phone rebooked 

Topic 6
united thanks d

#### Bonus LDA Question (Tackle if you have time / interest)

Using the `.transform()` method on LDA on the data you fed it will return back a numpy array of shape `(n_rows, n_topics)`. The value in each column will identify the probability that the row in question belongs to that topic. For example, if we were looking at a row of data and an LDA model for three topics, we might see the following:

```python
lda.transform(row_of_data)
>> [[ 0.02, 0.97, 0.01 ]]
```

This would suggest that for that row of data, it is most likely to be in the second topic (compared to the first or third topic).

As a bonus challenge, try the two following questions:

1. For each topic, which tweet most exemplifies (or is most likely to belong to that topic?)
2. Find a recent tweet at an airline that you have used. Can you use the model you have currently to identify what topic does it belongs to?

#### 2. Use NLP to predict the sentiment of tweets

In this section, please use any of the NLP techniques that we have covered over the last two days to best predict whether a tweet has a negative sentiment or not. Transformation code for your target variable is below.

**Bonus Consideration**: Outside of the text itself, do other factors in the dataset have an effect? Do your results change if you include features like the airline or the timezone of the tweet?

Don't forget to create a training and test set to compare your results. 

In [6]:
df['negative'] = df['airline_sentiment'].apply(lambda x: 1 if x=='negative' else 0)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['text'],
                                                   df['negative'])


In [8]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf.fit(X_train)
X_train_tf = tfidf.transform(X_train)


In [9]:
tsvd = TruncatedSVD(n_components=100)
tsvd.fit(X_train_tf)
X_train_tf_tsvd = tsvd.transform(X_train_tf)

In [10]:
rfc = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc.fit(X_train_tf_tsvd, y_train)
train_predictions = rfc.predict(X_train_tf_tsvd)
print(rfc.score(X_train_tf_tsvd, y_train))
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))

0.757103825137
[[1749 2348]
 [ 319 6564]]
             precision    recall  f1-score   support

          0       0.85      0.43      0.57      4097
          1       0.74      0.95      0.83      6883

avg / total       0.78      0.76      0.73     10980



In [12]:
X_test_tf = tfidf.transform(X_test)
X_test_tf_tsvd = tsvd.transform(X_test_tf)


In [13]:
test_predictions = rfc.predict(X_test_tf_tsvd)

In [14]:
test_predictions

array([0, 1, 1, ..., 1, 0, 1])