# Natural Language Processing (NLP)

Examples below are adapted from ["How To Work with Language Data in Python 3 using the Natural Language Toolkit (NLTK)"](https://www.digitalocean.com/community/tutorials/how-to-work-with-language-data-in-python-3-using-the-natural-language-toolkit-nltk)

In [1]:
#pip install nltk

In [2]:
#from gensim import corpora, models

In [3]:
import nltk
print(nltk.__version__) 
# You should have version 3.2.1 installed
# since we'll use NLTK's Twitter package that requires this version

3.4.1


In [4]:
#nltk.download()

## Load the data



In [5]:
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [6]:
from nltk.corpus import twitter_samples

In [7]:
twitter_samples # has a specific type: TwitterCorpusReader

<TwitterCorpusReader in '/home/jovyan/nltk_data/corpora/twitter_samples'>

NLTK's twitter corpus currently contains a sample of 20,000 tweets retrieved from the Twitter Streaming API. Full tweets are stored as line-separated JSON. We can see how many JSON files exist in the corpus using the `twitter_samples.fileids()` method:

In [8]:
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

Using those file IDs we can then return the tweet strings. 
Let's look at just the first few:

In [9]:
twitter_samples.strings('tweets.20150430-223406.json')[0:3]

['RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP',
 'VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY',
 'RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…']

We now know our corpus was downloaded successefully, so we can start processing the tweets.

Let's count how many adjectives and nouns appear in the positive subset of the `twitter_samples` corpus:

* A **noun**, in its most basic definition, is usually defined as a person, place, or thing (e.g., a movie, a book, and a burger). Counting nouns can help determine how many different _topics_ are being discussed.
* An **adjective** is a word that modifies a noun (or pronoun), for example: a _horrible_ movie, a _funny_ book, or a _delicious_ burger. Counting adjectives can determine what type of language is being used, i.e. opinions tend to include more adjectives than facts.

We could later count positive adjectives (great, awesome, happy, etc.) versus negative adjectives (boring, lame, sad, etc.), which could be used to analyze the sentiment of tweets or reviews about a product or movie, for example. This script provides data that can in turn inform decisions related to that product or movie.


Let's create a `tweets` variable and assign to it the list of tweet strings from the `positive_tweets.json` file.

In [10]:
tweets = twitter_samples.strings('positive_tweets.json')

In [11]:
tweets[0:3]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!']

## Tokenizing Sentences

**Tokenization** is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements, which are called _tokens_. Let's create a new variable called `tweets_tokens`, to which we will assign the tokenized list of tweets:

In [12]:
# Load tokenized tweets
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')

This new variable, `tweets_tokens`, is a list where each element in the list is a list of tokens. 

We can compare the list of tokens to the original tweet that the tokens came from:

In [13]:
print(tweets[0:1])
tweets_tokens[0:1]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)']


[['#FollowFriday',
  '@France_Inte',
  '@PKuchly57',
  '@Milipol_Paris',
  'for',
  'being',
  'top',
  'engaged',
  'members',
  'in',
  'my',
  'community',
  'this',
  'week',
  ':)']]

Now that we have the tokens of each tweet we can tag the tokens with the appropriate parts of speech (POS) tags.


In order to access NLTK's POS tagger, we'll need to import it. (Typically, all import statements must go at the beginning of the script/notebook.)

In [14]:
from nltk.tag import pos_tag_sents
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Now, we can tag each of our tokens. NLTK allows us to do it all at once using: `pos_tag_sents()`. We are going to create a new variable `tweets_tagged`, which we will use to store our tagged lists. 

In [15]:
tweets_tagged = pos_tag_sents(tweets_tokens)

To get an idea of what tagged tokens look like, here is what the first element in our tweets_tagged list looks like:

In [16]:
tweets_tagged[0:1]

[[('#FollowFriday', 'JJ'),
  ('@France_Inte', 'NNP'),
  ('@PKuchly57', 'NNP'),
  ('@Milipol_Paris', 'NNP'),
  ('for', 'IN'),
  ('being', 'VBG'),
  ('top', 'JJ'),
  ('engaged', 'VBN'),
  ('members', 'NNS'),
  ('in', 'IN'),
  ('my', 'PRP$'),
  ('community', 'NN'),
  ('this', 'DT'),
  ('week', 'NN'),
  (':)', 'NN')]]

## Tagging parts of speech (POS)

We can see that our tweet is represented as a list and for each token we have information about its POS tag. Each token/tag pair is saved as a [tuple](https://www.digitalocean.com/community/tutorials/understanding-tuples-in-python-3). The default tagger of `nltk.pos_tag()` uses the Penn Treebank Tag Set. You can check an [alphabetical list of part-of-speech tags used in the Penn Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In NLTK, the abbreviation for **adjective** is `JJ`.

The NLTK tagger marks **singular nouns** (`NN`) with different tags than **plural nouns** (`NNS`). To simplify, we will only count singular nouns by keeping track of the `NN` tag.

In the next step we will count how many times `JJ` and `NN` appear throughout our corpus.

We will keep track of how many times JJ and NN appear using an accumulator (count) variable, which we will continuously add to every time we find a tag. 

After we create the variables, we'll create two for loops. The first loop will iterate through each tweet in the list. The second loop will iterate through each token/tag pair in each tweet. For each pair, we will look up the tag using the appropriate tuple index.

We will then check to see if the tag matches either the string 'JJ' or 'NN' by using conditional statements. If the tag is a match we will add (+= 1) to the appropriate accumulator.

In [17]:
# Set accumulators
JJ_count = 0
NN_count = 0

# Loop through list of tweets
for tweet in tweets_tagged:
    for pair in tweet:
        tag = pair[1]
        if tag == 'JJ':
            JJ_count += 1
        elif tag == 'NN':
            NN_count += 1

After the two loops are complete, we should have the total count for adjectives and nouns in our corpus. 

In [18]:
print('Total number of adjectives = ', JJ_count)
print('Total number of nouns = ', NN_count)

Total number of adjectives =  6094
Total number of nouns =  13180
