# Exercise Notebook on Natural Language Processing

`nltk` also provides access to a dataset of tweets from Twitter, it includes a set of tweets already classified as negative or positive.

In this exercise notebook we would like to replicate the sentiment analysis classification performed on the movie reviews corpus on this dataset.

## Exercise 1: Download and inspect the twitter_samples dataset

First we want to download the dataset and inspect it:

In [1]:
import nltk

In [2]:
# DO NOT MODIFY

nltk.download("twitter_samples")
from nltk.corpus import twitter_samples

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Sam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\twitter_samples.zip.


First let's check the common `fileids` method of `nltk` corpora:

In [8]:
print(type(twitter_samples))
twitter_samples.fileids()

<class 'nltk.corpus.reader.twitter.TwitterCorpusReader'>


['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

The twitter_samples object has a `tokenized()` method that returns all tweets from a fileid already individually tokenized. Read its documentation and use it to find the number of positive and negative tweets.

The `.tokenized()` method returns the given file as a list of the text content of Tweets as a list of words, screennames, hashtags, URLs and puncuation symbols.

In [21]:
number_of_positive_tweets = None
### BEGIN SOLUTION
positive_tweets = twitter_samples.tokenized(fileids='positive_tweets.json')
### END SOLUTION
number_of_positive_tweets= len(positive_tweets)
positive_tweets

[['#FollowFriday',
  '@France_Inte',
  '@PKuchly57',
  '@Milipol_Paris',
  'for',
  'being',
  'top',
  'engaged',
  'members',
  'in',
  'my',
  'community',
  'this',
  'week',
  ':)'],
 ['@Lamb2ja',
  'Hey',
  'James',
  '!',
  'How',
  'odd',
  ':/',
  'Please',
  'call',
  'our',
  'Contact',
  'Centre',
  'on',
  '02392441234',
  'and',
  'we',
  'will',
  'be',
  'able',
  'to',
  'assist',
  'you',
  ':)',
  'Many',
  'thanks',
  '!'],
 ['@DespiteOfficial',
  'we',
  'had',
  'a',
  'listen',
  'last',
  'night',
  ':)',
  'As',
  'You',
  'Bleed',
  'is',
  'an',
  'amazing',
  'track',
  '.',
  'When',
  'are',
  'you',
  'in',
  'Scotland',
  '?',
  '!'],
 ['@97sides', 'CONGRATS', ':)'],
 ['yeaaaah',
  'yippppy',
  '!',
  '!',
  '!',
  'my',
  'accnt',
  'verified',
  'rqst',
  'has',
  'succeed',
  'got',
  'a',
  'blue',
  'tick',
  'mark',
  'on',
  'my',
  'fb',
  'profile',
  ':)',
  'in',
  '15',
  'days'],
 ['@BhaktisBanter',
  '@PallaviRuhail',
  'This',
  'one',
  '

In [20]:
number_of_negative_tweets = None
### BEGIN SOLUTION
negative_tweets = twitter_samples.tokenized(fileids='negative_tweets.json')
### END SOLUTION
number_of_negative_tweets = len(negative_tweets)
negative_tweets

[['hopeless', 'for', 'tmr', ':('],
 ['Everything',
  'in',
  'the',
  'kids',
  'section',
  'of',
  'IKEA',
  'is',
  'so',
  'cute',
  '.',
  'Shame',
  "I'm",
  'nearly',
  '19',
  'in',
  '2',
  'months',
  ':('],
 ['@Hegelbon',
  'That',
  'heart',
  'sliding',
  'into',
  'the',
  'waste',
  'basket',
  '.',
  ':('],
 ['“',
  '@ketchBurning',
  ':',
  'I',
  'hate',
  'Japanese',
  'call',
  'him',
  '"',
  'bani',
  '"',
  ':(',
  ':(',
  '”',
  'Me',
  'too'],
 ['Dang', 'starting', 'next', 'week', 'I', 'have', '"', 'work', '"', ':('],
 ['oh',
  'god',
  ',',
  'my',
  'babies',
  "'",
  'faces',
  ':(',
  'https://t.co/9fcwGvaki0'],
 ['@RileyMcDonough', 'make', 'me', 'smile', ':(', '('],
 ['@f0ggstar',
  '@stuartthull',
  'work',
  'neighbour',
  'on',
  'motors',
  '.',
  'Asked',
  'why',
  'and',
  'he',
  'said',
  'hates',
  'the',
  'updates',
  'on',
  'search',
  ':(',
  'http://t.co/XvmTUikWln'],
 ['why',
  '?',
  ':(',
  '"',
  '@tahuodyy',
  ':',
  'sialan',
  ':(',


In [19]:
# DO NOT MODIFY

assert number_of_positive_tweets == 5000, "Make sure you are counting the number of tweets, not the number of words"
assert number_of_negative_tweets == 5000, "Make sure you are counting the number of tweets, not the number of words"

## Exercise 2: Build a bag-of-words model function

As in the lecture, we can build a bag-of-words model to train our machine learning algorithm.

In [22]:
import string

First step we define a list of words that we want to filter out of our dataset. We do this by combining stopwords from nltk.corpus, and punctuation from the string `.punctuation` attribute.

In [23]:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
useless_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Below returns a dictionary of words, indexed to 1 for True if the word is contained in the passed in words, on the premise that the words do not exist in useless_words.

In [97]:
def build_bag_of_words_features_filtered(words):
    """Build a bag of words model"""
    x = {word:1 for word in words \
        if word not in useless_words
        }
    return x
    ### END SOLUTION

In [75]:
#assert len(build_bag_of_words_features_filtered(["what", "the", "?", ","]))==0, "Make sure we are filtering out both stopwords and punctuation"
build_bag_of_words_features_filtered(["what", "the", "?", ",", "yolo"])

{'yolo': 1}

## Exercise 3: create a list of all words

Before performing sentiment analysis, let's first inspect the dataset a little bit more by creating a list of all words.

The Python list method `.extend()` extends a list by appending elements from an iterable. 

In the below example, `for tweet in twitter...` iterates though the tokenized lists of each data set, and appends the elements (strings) from those tweet lists in to the 'words' list.

In [76]:
words = []
for dataset in ["positive_tweets.json", "negative_tweets.json"]:
    for tweet in twitter_samples.tokenized(dataset):
        words.extend(tweet)
len(words)

130099

In the above code, see that it is a case of a nested loop. For each dataset we are looping through each tweet. Also notice we are using `extend` rather than `append`. `extend` differs from `append` in this case in that `append` adds on each list of strings in the tweet on to the list, whereas `extend` adds on the elements themselves (the iterable) in each tweet list on to our words list.

Now let's filter out punctuation and stopwords. We can do this using list comprehension.

In [77]:
filtered_words = None
### BEGIN SOLUTION
filtered_words = [word for word in words if word not in useless_words]
### END SOLUTION
len(filtered_words)

84672

First we want to filter out `useless_words` as defined in the previous section, this will reduce the lenght of the dataset by more than a factor of 2:

In [78]:
# DO NOT MODIFY 

assert len(filtered_words) == 84672, "Make sure that the filtering is applied correctly"

## Exercise 4: find the most common words


The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list.

Remember Counter is a Dict subclass for counting hashable items.  Sometimes called a bag or multiset. Elements (words) are stored as dictionary keys and their counts are stored as dictionary values.

In [79]:
# DO NOT MODIFY 

from collections import Counter

counter = Counter(filtered_words)
counter

Counter({'#FollowFriday': 25,
         '@France_Inte': 1,
         '@PKuchly57': 1,
         '@Milipol_Paris': 1,
         'top': 34,
         'engaged': 7,
         'members': 16,
         'community': 30,
         'week': 109,
         ':)': 3693,
         '@Lamb2ja': 1,
         'Hey': 67,
         'James': 8,
         'How': 46,
         'odd': 3,
         ':/': 16,
         'Please': 44,
         'call': 39,
         'Contact': 2,
         'Centre': 1,
         '02392441234': 1,
         'able': 30,
         'assist': 1,
         'Many': 13,
         'thanks': 196,
         '@DespiteOfficial': 1,
         'listen': 20,
         'last': 93,
         'night': 101,
         'As': 11,
         'You': 123,
         'Bleed': 1,
         'amazing': 59,
         'track': 5,
         'When': 39,
         'Scotland': 4,
         '@97sides': 1,
         'CONGRATS': 2,
         'yeaaaah': 1,
         'yippppy': 1,
         'accnt': 2,
         'verified': 1,
         'rqst': 1,
         'succ

Counter objects also have the `.most_common()` method to access the words with the higher count:

In [80]:
most_common_words = None
### BEGIN SOLUTION
most_common_words = counter.most_common(10)
### END SOLUTION
most_common_words

[(':(', 4586),
 (':)', 3693),
 ('I', 2477),
 (':-)', 701),
 (':D', 658),
 ('...', 622),
 (':-(', 501),
 ("I'm", 456),
 ('like', 402),
 ('u', 392)]

In [81]:
assert most_common_words[0][0] == ":(", "The most common word should be :("
assert len(most_common_words) == 10, "Make sure you are only getting the first 10"

## Exercise 5: Build the features for machine learning

Using our `build_bag_of_words_features` function we can build separately the negative and positive features.

The format of the positive features should be:

    [
        ( { "here":1, "some":1, "words":1 }, "pos" ),
        ( { "another":1, "tweet":1}, "pos" )
    ]
    
It is a list of tuples, the first element is a dictionary of the words with 1 if that word appears, the second the "pos" or "neg" string.

Note in below: We are creating a list of tuples, containing first our build_bag_of_words function, passing in each individual tweet, acquired using the tokenize method, to assign 1 to each unique word from the passed in tweets. Remember this is a dictionary being returned. Along with these dictionaries, we are assigning if the tweet is positive or negative sentiment in the same tuple.

Thus - 'negative_features' as example returns a list of tuples.

In [102]:
negative_features = None        
negative_features = [(build_bag_of_words_features_filtered(tweet), 'neg') \
                    for tweet in twitter_samples.tokenized("negative_tweets.json")
                    ]
### END SOLUTION
negative_features

[({'hopeless': 1, 'tmr': 1, ':(': 1}, 'neg'),
 ({'Everything': 1,
   'kids': 1,
   'section': 1,
   'IKEA': 1,
   'cute': 1,
   'Shame': 1,
   "I'm": 1,
   'nearly': 1,
   '19': 1,
   '2': 1,
   'months': 1,
   ':(': 1},
  'neg'),
 ({'@Hegelbon': 1,
   'That': 1,
   'heart': 1,
   'sliding': 1,
   'waste': 1,
   'basket': 1,
   ':(': 1},
  'neg'),
 ({'“': 1,
   '@ketchBurning': 1,
   'I': 1,
   'hate': 1,
   'Japanese': 1,
   'call': 1,
   'bani': 1,
   ':(': 1,
   '”': 1,
   'Me': 1},
  'neg'),
 ({'Dang': 1, 'starting': 1, 'next': 1, 'week': 1, 'I': 1, 'work': 1, ':(': 1},
  'neg'),
 ({'oh': 1,
   'god': 1,
   'babies': 1,
   'faces': 1,
   ':(': 1,
   'https://t.co/9fcwGvaki0': 1},
  'neg'),
 ({'@RileyMcDonough': 1, 'make': 1, 'smile': 1, ':(': 1}, 'neg'),
 ({'@f0ggstar': 1,
   '@stuartthull': 1,
   'work': 1,
   'neighbour': 1,
   'motors': 1,
   'Asked': 1,
   'said': 1,
   'hates': 1,
   'updates': 1,
   'search': 1,
   ':(': 1,
   'http://t.co/XvmTUikWln': 1},
  'neg'),
 ({':(': 

In [105]:
positive_features = None
### BEGIN SOLUTION
positive_features = [(build_bag_of_words_features_filtered(tweet), "pos")
                     for tweet in twitter_samples.tokenized("positive_tweets.json")
                    ]
### END SOLUTION

In [109]:
positive_features[0]

({'#FollowFriday': 1,
  '@France_Inte': 1,
  '@PKuchly57': 1,
  '@Milipol_Paris': 1,
  'top': 1,
  'engaged': 1,
  'members': 1,
  'community': 1,
  'week': 1,
  ':)': 1},
 'pos')

In [107]:
assert positive_features[0][1] == "pos", "Make sure the feature is a list of tuples whose second element is pos or neg"
assert positive_features[0][0]["engaged"] == 1, "Make sure that the first element of each tuple is a dictionary of words"

## Exercise 6: Train a NaiveBayesClassifier

Documentation here 

https://www.nltk.org/_modules/nltk/classify/naivebayes.html

In [110]:
from nltk.classify import NaiveBayesClassifier

Let's use 80% of the data for training, the rest for validation:

In [111]:
split = int(len(positive_features) * 0.8)

In [112]:
split

4000

In [135]:
classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])
classifier.labels()

['pos', 'neg']

Let's check the accuracy on the training and on the test sets, make sure to turn those into a percent value

We do this using `nltk.classify.util.accuracy(classifier, gold)`

In [129]:
training_accuracy = None
### BEGIN SOLUTION
training_accuracy = nltk.classify.util.accuracy(classifier, positive_features[:split]+negative_features[:split])*100
### END SOLUTION
training_accuracy

99.9625

In [130]:
test_accuracy = None
### BEGIN SOLUTION
test_accuracy = nltk.classify.util.accuracy(classifier, positive_features[split:]+negative_features[split:])*100
### END SOLUTION
test_accuracy

99.35000000000001

It looks like the accuracy for the test is very high compared to the movie review dataset, check the most informative features below to understand why.

Remember, `.show_most_informative_features()` returns a list of the 'most informative' features used a given classifier. For the purpose of this function, the informativeness of a feature ``(fname,fval)`` is equal to the highest value of P(fname=fval|label), for any label, divided by the lowest value of P(fname=fval|label), for any label:

        =  max[ P(fname=fval|label1) / P(fname=fval|label2) ]

In [131]:
classifier.show_most_informative_features()

Most Informative Features
                      :( = 1                 neg : pos    =   2362.3 : 1.0
                      :) = 1                 pos : neg    =   1139.0 : 1.0
                     See = 1                 pos : neg    =     37.7 : 1.0
                     TOO = 1                 neg : pos    =     36.3 : 1.0
                  THANKS = 1                 neg : pos    =     35.0 : 1.0
                    THAT = 1                 neg : pos    =     27.7 : 1.0
                    miss = 1                 neg : pos    =     26.4 : 1.0
                     sad = 1                 neg : pos    =     25.0 : 1.0
                     x15 = 1                 neg : pos    =     23.7 : 1.0
                   Thank = 1                 pos : neg    =     22.3 : 1.0
