# Exercise Notebook on Natural Language Processing

`nltk` also provides access to a dataset of tweets from Twitter, it includes a set of tweets already classified as negative or positive.

In this exercise notebook we would like to replicate the sentiment analysis classification performed on the movie reviews corpus on this dataset.

## Exercise 1: Download and inspect the twitter_samples dataset

First we want to download the dataset and inspect it:

In [1]:
import nltk

In [2]:
# DO NOT MODIFY

nltk.download("twitter_samples")
from nltk.corpus import twitter_samples

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/smd/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


First let's check the common `fileids` method of `nltk` corpora:

In [3]:
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

The twitter_samples object has a `tokenized()` method that returns all tweets from a fileid already individually tokenized. Read its documentation and use it to find the number of positive and negative tweets.

In [15]:
number_of_positive_tweets = None
### BEGIN SOLUTION
number_of_positive_tweets = len(twitter_samples.tokenized('positive_tweets.json'))
number_of_positive_tweets
### END SOLUTION

5000

In [16]:
number_of_negative_tweets = None
### BEGIN SOLUTION
number_of_negative_tweets = len(twitter_samples.tokenized('negative_tweets.json'))
number_of_negative_tweets
### END SOLUTION

5000

In [17]:
# DO NOT MODIFY

assert number_of_positive_tweets == 5000, "Make sure you are counting the number of tweets, not the number of words"

## Exercise 2: Build a bag-of-words model function

As in the lecture, we can build a bag-of-words model to train our machine learning algorithm.

In [19]:
import string

First step we define a list of words that we want to filter out of our dataset:

In [25]:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)

In [84]:
test_words = ['name','age','testable','new']

def build_bag_of_words_features_filtered(words):
    """Build a bag of words model"""
    ### BEGIN SOLUTION
    bag = {word:True for word in words if word not in useless_words}
    return bag
    ### END SOLUTION

In [30]:
assert len(build_bag_of_words_features_filtered(["what", "the", "?", ","]))==0, "Make sure we are filtering out both stopwords and punctuation"

0


## Exercise 3: create a list of all words

Before performing sentiment analysis, let's first inspect the dataset a little bit more by creating a list of all words.

In [31]:
# create a list of all the words from each data set, and append them to a master list of words
words = []
for dataset in ["positive_tweets.json", "negative_tweets.json"]:
    for tweet in twitter_samples.tokenized(dataset):
        words.extend(tweet)

In [50]:
L1 = [1,2,3,4]
L2 = [5,6,7,8,9]
L1.extend(L2)
print(L1)
del L1[1]
print(L1)

[1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 3, 4, 5, 6, 7, 8, 9]


Study the code above, see that it is a case of nested loop, for each dataset we are looping through each tweet. Also notice we are using `extend`, how does it differ from `append`? Try it on a simple case, or read the documentation or Google for it!

Now let's filter out punctuation and stopwords:

In [70]:
filtered_words = []
### BEGIN SOLUTION
filtered_words = [w for w in words if w not in useless_words]
print(len(filtered_words))
### END SOLUTION

84672


First we want to filter out `useless_words` as defined in the previous section, this will reduce the lenght of the dataset by more than a factor of 2:

In [71]:
# DO NOT MODIFY 

assert len(filtered_words) == 84672, "Make sure that the filtering is applied correctly"

## Exercise 4: find the most common words


The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list:

In [72]:
# DO NOT MODIFY 

from collections import Counter

counter = Counter(filtered_words)

It also has a `most_common()` method to access the words with the higher count:

In [75]:
most_common_words = None
### BEGIN SOLUTION
most_common_words = counter.most_common()[:10]
print(most_common_words)
### END SOLUTION

[(':(', 4586), (':)', 3693), ('I', 2477), (':-)', 701), (':D', 658), ('...', 622), (':-(', 501), ("I'm", 456), ('like', 402), ('u', 392)]


In [76]:
assert most_common_words[0][0] == ":(", "The most common word should be :("
assert len(most_common_words) == 10, "Make sure you are only getting the first 10"

## Exercise 5: Build the features for machine learning

Using our `build_bag_of_words_features` function we can build separately the negative and positive features.

The format of the positive features should be:

    [
        ( { "here":1, "some":1, "words":1 }, "pos" ),
        ( { "another":1, "tweet":1}, "pos" )
    ]
    
It is a list of tuples, the first element is a dictionary of the words with 1 if that word appears, the second the "pos" or "neg" string.

In [85]:
negative_features = None
### BEGIN SOLUTION
negative_features = [(build_bag_of_words_features_filtered(tweet), 'neg')\
                     for tweet in twitter_samples.tokenized('negative_tweets.json')]
### END SOLUTION

In [86]:
positive_features = None
### BEGIN SOLUTION
positive_features = [(build_bag_of_words_features_filtered(tweet), 'pos')\
                    for tweet in twitter_samples.tokenized('positive_tweets.json')]
### END SOLUTION

In [87]:
positive_features[0][0]

{'#FollowFriday': True,
 '@France_Inte': True,
 '@PKuchly57': True,
 '@Milipol_Paris': True,
 'top': True,
 'engaged': True,
 'members': True,
 'community': True,
 'week': True,
 ':)': True}

In [88]:
assert positive_features[0][1] == "pos", "Make sure the feature is a list of tuples whose second element is pos or neg"
assert positive_features[0][0]["engaged"] == 1, "Make sure that the first element of each tuple is a dictionary of words"

## Exercise 6: Train a NaiveBayesClassifier

In [89]:
from nltk.classify import NaiveBayesClassifier

Let's use 80% of the data for training, the rest for validation:

In [90]:
split = int(len(positive_features) * 0.8)

In [91]:
split

4000

In [92]:
classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])

Let's check the accuracy on the training and on the test sets, make sure to turn those into a percent value

In [94]:
training_accuracy = None
### BEGIN SOLUTION
nltk.classify.util.accuracy(classifier, positive_features[:split] + negative_features[:split])
### END SOLUTION

0.999625

In [96]:
test_accuracy = None
### BEGIN SOLUTION
nltk.classify.util.accuracy(classifier, positive_features[split:]+negative_features[:split])
### END SOLUTION

0.9982

It looks like the accuracy for the test is very high compared to the movie review dataset, check the most informative features below to understand why:

In [97]:
classifier.show_most_informative_features()

Most Informative Features
                      :( = True              neg : pos    =   2362.3 : 1.0
                      :) = True              pos : neg    =   1139.0 : 1.0
                     See = True              pos : neg    =     37.7 : 1.0
                     TOO = True              neg : pos    =     36.3 : 1.0
                  THANKS = True              neg : pos    =     35.0 : 1.0
                    THAT = True              neg : pos    =     27.7 : 1.0
                    miss = True              neg : pos    =     26.4 : 1.0
                     sad = True              neg : pos    =     25.0 : 1.0
                     x15 = True              neg : pos    =     23.7 : 1.0
                   Thank = True              pos : neg    =     22.3 : 1.0
