<a href="https://colab.research.google.com/github/rachelpopa/2016WoW/blob/master/nlp_dev_day_july_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Dev Day

Welcome to the NLP Dev day. 

Natural Language Processing (NLP) is a subfield of AI which explores how to make computers "understand" natural languages, such as English. 

This tutorial is meant to walk you through some of the basic concepts used in practice to process text documents. The first section is about pre-processing data, the second is mostly about classifying data. We will be using Python NLTK (Natural Language Toolkit) and scikit-learn (or sklearn), a machine learning library. 

Start by making your own copy of this notebook in Google Colab so that you can edit/experiment with any and all of the code snippets (you will need a Google drive account to do this). Reach out in Teams if you have questions or notice a mistrake!

After you've gone through the tutorial, you can spin up a project of your own with a dataset of your choice.  

### Getting Started

Run the snippet below to get set up with some of the libraries & datasets we will be using. 


In [None]:
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("brown")
nltk.download("names")
nltk.download('movie_reviews')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

## Section One: Pre-processing text data

### Tokenization

Tokenization means splitting text into pieces (or 'tokens') that we are interested in analyzing. Tokens might be sentences, or they might be individual words. 

Feel free to experiment with some of the tokenizers below. For the most part, `WordPunctTokenizer` will split special characters (such as apostrophes) into seperate tokens, while `word_tokenize` will try to keep them attached to the relevant words. See [here](https://stackoverflow.com/questions/50240029/nltk-wordpunct-tokenize-vs-word-tokenize) for more about why. 

In [None]:
from nltk.tokenize import sent_tokenize, \
        word_tokenize, WordPunctTokenizer

input_text = "Here's some input text, we can use it to see what tokenization is." 

print("\nSentence tokenizer:")
print(sent_tokenize(input_text))

print("\nWord tokenizer:")
print(word_tokenize(input_text))

print("\nWord punct tokenizer:")
print(WordPunctTokenizer().tokenize(input_text))


Sentence tokenizer:
["Here's some input text, we can use it to see what tokenization is and how it works."]

Word tokenizer:
['Here', "'s", 'some', 'input', 'text', ',', 'we', 'can', 'use', 'it', 'to', 'see', 'what', 'tokenization', 'is', 'and', 'how', 'it', 'works', '.']

Word punct tokenizer:
['Here', "'", 's', 'some', 'input', 'text', ',', 'we', 'can', 'use', 'it', 'to', 'see', 'what', 'tokenization', 'is', 'and', 'how', 'it', 'works', '.']


### Removing Stop Words

**Stop words** are words that are so commonly used that they are useless for most applications. Words such as *the*, *of*, and *is* tell us very little about what a document is about. We'd often like to simply remove them.

There is a pre-defined list of stop words available in NLTK.



In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
example_sent = "Like most sentences, this sentence contains a few stop words that aren't very interesting."

tokens = word_tokenize(example_sent)
 
stop_words = set(stopwords.words('english'))
 
filtered_sentence = [w for w in tokens if not w.lower() in stop_words]
 
print(tokens)
print(filtered_sentence)

['Like', 'most', 'sentences', ',', 'this', 'sentence', 'contains', 'a', 'few', 'stop', 'words', 'that', 'are', "n't", 'very', 'interesting', '.']
['Like', 'sentences', ',', 'sentence', 'contains', 'stop', 'words', "n't", 'interesting', '.']


### Stemming

In linguistics, the **stem** of a word is the part of a word responsible for its lexical meaning. It's the part of the word that's leftover when you remove prefixes and suffixes, and the part of the word that's leftover when you de-conjugate a verb. For example, the stem of *walking* is *walk*, the stem of *quickly* is *quick*. In English, the stem of a word is often also a word, but not always.

Stemming is a common preprocessing step when working with text data. It's useful to ignore prefixes, suffixes, and verb tense in a lot of applications; if someone is searching for documents about "organizing", we might as well return documents that are about "organize", "organized", "organizer", etc. 

`PorterStemmer`, `LancasterStemmer`, and `SnowballStemmer` are three stemmers available in NLTK. Feel free to compare them below. Check out this [stackoverflow article](https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg) for more on the differences between the Porter, Lancaster, and Snowball algorithms. 


In [None]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

input_words = ['chocolate', 'hat', 'walking', 'landed', 'growth', 'messenger', 
        'possibly', 'provision', 'building', 'kept', 'scratchy', 'code', 'lying']

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')


stemmer_names = ['Porter', 'Lancaster', 'Snowball']
formatted_text = '{:>16}' * (len(stemmer_names) + 1)
print('\n', formatted_text.format('Input', *stemmer_names), 
        '\n', '='*68)

for word in input_words:
    output = [word, porter.stem(word), 
            lancaster.stem(word), snowball.stem(word)]
    print(formatted_text.format(*output))


            Input          Porter       Lancaster        Snowball 
       chocolate          chocol          chocol          chocol
             hat             hat             hat             hat
         walking            walk            walk            walk
          landed            land            land            land
          growth          growth            grow          growth
       messenger         messeng         messeng         messeng
        possibly         possibl            poss         possibl
       provision          provis          provid          provis
        building           build           build           build
            kept            kept            kept            kept
        scratchy        scratchi        scratchy        scratchi
            code            code             cod            code
           lying             lie           lying             lie


### Lemmatization

We can take stemming one step further by making sure the result is actually a real word. This is known as **lemmatization**. Lemmatization is slower than stemming, but sometimes it's useful.

The `WordNetLemmatizer` removes prefixes and suffixes only if the resulting word is in its dictionary. It also tries to remove tenses from verbs and convert plural nouns to singular. 


In [None]:
from nltk.stem import WordNetLemmatizer

input_words = ['chocolate', 'hats', 'walking', 'landed', 'women', 'messengers', 
        'possibly', 'provision', 'building', 'kept', 'scratchy', 'code', 'lying', 'Frisco']

lemmatizer = WordNetLemmatizer()

lemmatizer_names = ['Noun Lemmatizer', 'Verb Lemmatizer']
formatted_text = '{:>24}' * (len(lemmatizer_names) + 1)
print('\n', formatted_text.format('Input', *lemmatizer_names), 
        '\n', '='*75)

for word in input_words:
  output = [word, lemmatizer.lemmatize(word, pos='n'), lemmatizer.lemmatize(word, pos='v')]
  print(formatted_text.format(*output))


                    Input         Noun Lemmatizer         Verb Lemmatizer 
               chocolate               chocolate               chocolate
                    hats                     hat                     hat
                 walking                 walking                    walk
                  landed                  landed                    land
                   women                   woman                   women
              messengers               messenger              messengers
                possibly                possibly                possibly
               provision               provision               provision
                building                building                   build
                    kept                    kept                    keep
                scratchy                scratchy                scratchy
                    code                    code                    code
                   lying                   lying

### Part-of-Speech Tagging

A part-of-speech tagger (AKA a **POS tagger**) attaches a part-of-speech tags to words, meaning it labels nouns as nouns, verbs as verbs, etc. Try out NLTK's POS tagger below. Under the hood, a tagger is a machine learning model. When you give it a word, it predicts what type of word it is. 

POS tags are useful in a number of ways. For instance, suppose NLTK runs into a word it's never seen before: *He was scrobbling*. Even though it has no idea of the meaning, it's likely to guess that *scrobbling* is a verb. Additionally, POS tags help us distinguish between homonymns. Consider this sentence: *They refuse to permit us to obtain the refuse permit*. The first *refuse* is a verb, the second *refuse* is a noun. Depending on how picky we are, we might want to consider them as completely different words in our system. 


The example below uses NLTK's Averaged Perceptron Tagger (a *perceptron* is a neural network consisting of only one layer). If you're interested in how it works, [this article](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python) explains how to write an averaged perceptron tagger.  


In [None]:
from nltk.tokenize import word_tokenize

# Uncomment this to see descriptions of all the parts of speech in the tagger
# notice how some of the verbs include extra information, like verb tense (present, progressive, past, etc)
# nltk.help.upenn_tagset()

tokens = word_tokenize("Let's look at part-of-speech tagging.")

print(nltk.pos_tag(tokens))

[('Let', 'VB'), ("'s", 'POS'), ('look', 'VB'), ('at', 'IN'), ('part-of-speech', 'JJ'), ('tagging', 'NN'), ('.', '.')]


### Count Vectorizer

 `CountVectorizer` (from the [sklearn](https://scikit-learn.org/stable/) library) converts a documents into "vectors" of term/token counts.

 CountVectorizer is useful for creating a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix). A document-term matrix is handy when you want to represent your data numerically, and it is often passed to machine learning algorithms (read: we will be using CountVectorizer in later examples). 

 CountVectorizer does a few handy things by default, including: 

*   converts your text to lowercase
*   does word tokenization for you
*   gets rid of single characters (meaning words like 'a' and 'I' are discarded)






 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Each sentence here is considered a 'document'
cat_in_the_hat_docs=[
       "One Cent, Two Cents, Old Cent, New Cent: All About Money (Cat in the Hat's Learning Library",
       "Inside Your Outside: All About the Human Body (Cat in the Hat's Learning Library)",
       "Oh, The Things You Can Do That Are Good for You: All About Staying Healthy (Cat in the Hat's Learning Library)",
       "On Beyond Bugs: All About Insects (Cat in the Hat's Learning Library)",
       "There's No Place Like Space: All About Our Solar System (Cat in the Hat's Learning Library)" 
      ]

cv = CountVectorizer(cat_in_the_hat_docs)
# .fit creates a vocabulary, that is, picks out all the unique words in each document and assigns them an index
vectorizer = cv.fit(cat_in_the_hat_docs)
# .fit_transform creates a document-term matrix, meaning it picks out all the unique words and returns a 2D array where 
# each row represents a document & each column represents a term/word in the vocabulary
count_vector=cv.fit_transform(cat_in_the_hat_docs)

# Print unique words with their indices
print("Vocabulary: ", vectorizer.vocabulary_)

# Print the document-term matrix
print(count_vector.toarray())

Vocabulary:  {'one': 28, 'cent': 8, 'two': 40, 'cents': 9, 'old': 26, 'new': 23, 'all': 1, 'about': 0, 'money': 22, 'cat': 7, 'in': 16, 'the': 37, 'hat': 13, 'learning': 19, 'library': 20, 'inside': 18, 'your': 42, 'outside': 30, 'human': 15, 'body': 4, 'oh': 25, 'things': 39, 'you': 41, 'can': 6, 'do': 10, 'that': 36, 'are': 2, 'good': 12, 'for': 11, 'staying': 34, 'healthy': 14, 'on': 27, 'beyond': 3, 'bugs': 5, 'insects': 17, 'there': 38, 'no': 24, 'place': 31, 'like': 21, 'space': 33, 'our': 29, 'solar': 32, 'system': 35}
[[1 1 0 0 0 0 0 1 3 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0
  0 1 0 0 1 0 0]
 [1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 2 0 0 0 0 1]
 [1 1 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0
  1 2 0 1 0 2 0]
 [1 1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0]
 [1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0 1
  0 1 1 0 0 0 0]]


### Keyword Extraction (TF-IDF)

Keyword extraction is a common pre-processing step and a common standalone task in NLP. It means picking out important words from a document that describe what the document is about. 

**Term Frequency-Inverse Document Frequency (TF-IDF)** is essentially a statistic assigned to a word that indicates how important it is to a document. Words with a high TF-IDF score are considered to be keywords.  

[The first 5 minutes of this video](https://www.youtube.com/watch?v=RPMYV-eb6lI) give a pretty good explanation of how TF-IDF is computed. 

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 

# We should note that TF-IDF works a lot better on larger datasets
docs=["the house had a tiny little mouse", 
"the cat saw the mouse", 
"the mouse ran away from the house", 
"the cat finally ate the mouse", 
"the end of the mouse story"
]
 
tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

# Get the vector for the first document
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] 

# Using a pandas dataframe to pretty print
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)



Unnamed: 0,tfidf
had,0.493562
little,0.493562
tiny,0.493562
house,0.398203
mouse,0.235185
the,0.235185
ate,0.0
away,0.0
cat,0.0
end,0.0


### Task: Pick a dataset you want to work with

Before reading into the next section, it might be helpful to pick a dataset you want to experiment with. Go ahead and search online for a dataset you are interested in using. You'll want to find one that contains raw text data. Keep your dataset in mind when going through the examples in the next section. Which (if any) of the tasks below are applicable to it? 

NLTK contains a set of [built-in datasets](http://www.nltk.org/nltk_data/) for experimentation and learning which might be a good starting point (they are mostly geared towards very specific tasks). There's also [Kaggle](https://www.kaggle.com/datasets) and [Google Dataset Search](https://datasetsearch.research.google.com/). 


## Part two: Classification and Modeling

In this section we will walk through a few examples of classification and one example of topic modelling. 

All of the classification examples below use a Naive Bayes model for classification. For the purposes of this dev day, the model itself isn't important. A lot of what we are learning today is about how to get text data into a useful format for passing to a classifier like Naive Bayes.

### Category Prediction

The example of category prediction below uses the 20 News Groups dataset. It contains around 18000 news articles on 20 topics. The data has already been split into two subsets, one to train our model and one for testing the output of the model. For fun, we are using our own tiny set of test data instead of the provided test data.

A detailed description of the dataset is available [here](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html).

Feel free to play around with the test input data. Although it does work a lot of the time, it's still pretty easy to trick the model. As you might expect, if you write something that isn't in one of the 5 categories it's trained on, it will spit out something that just looks random. 

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

category_map = {'talk.politics.misc': 'Politics', 'rec.autos': 'Autos', 
        'rec.sport.hockey': 'Hockey', 'sci.electronics': 'Electronics', 
        'sci.med': 'Medicine'}

# Get the training dataset
# Shuffling training data is a standard practise in ML, to create more general models and prevent a common problem called overfitting. The thread linked below has a more thorough discussion
# https://datascience.stackexchange.com/questions/24511/why-should-the-data-be-shuffled-for-machine-learning-tasks
training_data = fetch_20newsgroups(subset='train', 
        categories=category_map.keys(), shuffle=True, random_state=5)

# Get a document-term matrix
count_vectorizer = CountVectorizer()
train_term_counts = count_vectorizer.fit_transform(training_data.data)

# We can pass a document-term matrix to tfidf.fit_transform() to get the TF-IDF weights of each word
# Notice we didn't worry about stop words? They are going to have a very low tf-idf weight anyways.
tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_term_counts)

# Train the model 
# For each row in our document-term matrix, we have a corresponding category in training_data.target
classifier = MultinomialNB().fit(train_tfidf, training_data.target)

# Erase my test data and create your own. Keep in mind the model is going to try to classify in one of the 5 categories in category_map
input_data = [
    'You should always be careful if you are driving a car', 
    'A lot of devices are not as secure as you might think',
    'The sports cup was won by a team because they scored the most points at the super cup game, yay',
    'Big election has politicians doing all sorts of stuff to get votes', 
    'Medical experts warn Burrata cheese sold in Quebec is not safe'
]


# Transform input data using count vectorizer
input_term_counts = count_vectorizer.transform(input_data)

# Transform again to get the tf-idf weights
input_tfidf = tfidf.transform(input_term_counts)

# With our data in this format, we can pass it to the classification model and see what it predicts.
predictions = classifier.predict(input_tfidf)

# Print the outputs
for sent, category in zip(input_data, predictions):
    print('\nInput:', sent, '\nPredicted category:', 
            category_map[training_data.target_names[category]])



Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)



Input: You should always be careful if you are driving a car 
Predicted category: Autos

Input: A lot of devices are not as secure as you might think 
Predicted category: Electronics

Input: The sports cup was won by a team because they scored the most points at the super cup game, yay 
Predicted category: Hockey

Input: Big election has politicians doing all sorts of stuff to get votes 
Predicted category: Politics

Input: Medical experts warn Burrata cheese sold in Quebec is not safe 
Predicted category: Medicine


### Gender Identifier

Gender identification is a well-studied task in NLP with many different approaches. In the example below, we will test if the model is able to accurately identify gender given the last couple letters of a first name. 

In classification problems (such as gender identification and category prediction), we often create the model like so: 

`model = whateverModelIAmUsing.fit(X, y)`

or 

`training_data = [({featureName: feature}, target), ({featureName: feature}, target)...]`

`model = whateverModelIAmUsing.train(training_data)`

In the first example, `X` is the set of **features** we think will help the model make accurate predicitions and `y` is the set of **targets**, AKA the answers that the model should ideally come up with. There are many methods and heuristics out there for choosing good features (if you're interested in learning more about this, there's a good tutorial [here](https://www.kaggle.com/learn/feature-engineering)). 

For our purposes, let's simply compare the accuracy between a few different sets of features. We'll train the model based on the last letter of a name, the last two letters, the last three letters, and so on.


In [None]:
import random

from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

# This time, we are only going to pass the last N letters of the word to the model. 
def extract_features(word, N=2):
    last_n_letters = word[-N:]
    return {'lastLetters': last_n_letters.lower()}

if __name__=='__main__':
    # Create training data using labeled names available in NLTK
    # Unfortunately the dataset doesn't yet contain a list of gender-neutral names
    male_list = [(name, 'male') for name in names.words('male.txt')]
    female_list = [(name, 'female') for name in names.words('female.txt')]
    data = (male_list + female_list)

    #Shuffle the data
    random.seed(5)
    random.shuffle(data)

    # Create test data
    input_names = ['Yash', 'Shrimanti', 'Sai Ram', 'Riley', 'Brooke', 'Ashley', 'Robin']

    # Define the number of samples used for train and test
    # It's typical to use an 80/20 split
    num_train = int(0.8 * len(data))

    # Iterate through different lengths to compare the accuracy
    for i in range(1, 6):
        print('\nNumber of end letters:', i)
        features = [(extract_features(n, i), gender) for (n, gender) in data]
        train_data, test_data = features[:num_train], features[num_train:]
        classifier = NaiveBayesClassifier.train(train_data)

        # Compute the accuracy of the classifier 
        accuracy = round(100 * nltk_accuracy(classifier, test_data), 2)
        print('Accuracy = ' + str(accuracy) + '%')

        # Predict outputs for input names using the trained classifier model
        for name in input_names:
            print(name, '=>', classifier.classify(extract_features(name, i)))




Number of end letters: 1
Accuracy = 75.02%
Yash => female
Shrimanti => female
Sai Ram => male
Riley => female
Brooke => female
Ashley => female
Robin => male

Number of end letters: 2
Accuracy = 78.35%
Yash => male
Shrimanti => female
Sai Ram => male
Riley => female
Brooke => male
Ashley => female
Robin => male

Number of end letters: 3
Accuracy = 76.02%
Yash => male
Shrimanti => female
Sai Ram => male
Riley => male
Brooke => female
Ashley => male
Robin => male

Number of end letters: 4
Accuracy = 69.35%
Yash => female
Shrimanti => female
Sai Ram => female
Riley => female
Brooke => female
Ashley => female
Robin => male

Number of end letters: 5
Accuracy = 65.07%
Yash => female
Shrimanti => female
Sai Ram => female
Riley => male
Brooke => female
Ashley => female
Robin => female


### Sentiment Analyzer

Sentiment analysis, or opinion mining, is the practise of creating models that determine the tone of a piece of text (or voice) data, such as whether a review was positive or negative. 

Below is an example of a sentiment analyzer using NLTK's Movie Review toy dataset. 

If you're interested/have time, sentiment analysis of tweets can be a fun project. Here's a [tutorial](https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1) on how to use the twitter API to get a dataset of tweets into python. 


In [None]:
from nltk.corpus import movie_reviews 
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
 
# Extract features from the input list of words
# The format we are using for the features looks like this:
# [({'here': True, 'are': True, 'all': True, 'the': True, 'words': True, 'in': True, 'the': True, 'review': True}, Positive)] 
def extract_features(words):
    return dict([(word, True) for word in words])
 
if __name__=='__main__':
    # Load the reviews from the corpus 
    fileids_pos = movie_reviews.fileids('pos')
    fileids_neg = movie_reviews.fileids('neg')
     
    # Extract the features from the reviews
    features_pos = [(extract_features(movie_reviews.words(
            fileids=[f])), 'Positive') for f in fileids_pos]
    features_neg = [(extract_features(movie_reviews.words(
            fileids=[f])), 'Negative') for f in fileids_neg]
     
    # This is our 80/20 train/test split
    threshold = 0.8
    num_pos = int(threshold * len(features_pos))
    num_neg = int(threshold * len(features_neg))
  
    features_train = features_pos[:num_pos] + features_neg[:num_neg]
    features_test = features_pos[num_pos:] + features_neg[num_neg:]  
     
    # Train a Naive Bayes classifier & get the accuracy
    classifier = NaiveBayesClassifier.train(features_train)
    print('\nAccuracy of the classifier:', nltk_accuracy(
            classifier, features_test))

    # NaiveBayesClassifier can get us the most informative words, that is, words that strongly influence the model
    top_ten_words = classifier.most_informative_features()[:10]
    print('\nTop ten most informative words: ')
    for i, item in enumerate(top_ten_words):
      print(str(i+1) + '. ' + item[0])

    # Let's make up our own test data again
    input_reviews = [
        'I liked the cinematography', 
        'This was a terrible movie, the characters were so dumb',
        'This movie has one of my favorite actors! I loved it!', 
        'This is such an boring movie. Would not recommend.',
        'This movie contains Nicolas Cage'
    ]

    print("\nMovie review predictions:")
    for review in input_reviews:
        print("\nReview:", review)

        # Compute the probabilities
        probabilities = classifier.prob_classify(extract_features(review.split()))

        # Pick the maximum value
        predicted_sentiment = probabilities.max()

        # Print outputs
        print("Predicted sentiment:", predicted_sentiment)
        print("Probability:", round(probabilities.prob(predicted_sentiment), 2))


Accuracy of the classifier: 0.735

Top ten most informative words: 
1. outstanding
2. insulting
3. vulnerable
4. ludicrous
5. uninvolving
6. astounding
7. avoids
8. fascination
9. symbol
10. seagal

Movie review predictions:

Review: I liked the cinematography
Predicted sentiment: Positive
Probability: 0.69

Review: This was a terrible movie, the characters were so dumb
Predicted sentiment: Negative
Probability: 0.92

Review: This movie has one of my favorite actors! I loved it!
Predicted sentiment: Positive
Probability: 0.56

Review: This is such an boring movie. Would not recommend.
Predicted sentiment: Negative
Probability: 0.76

Review: This movie contains Nicolas Cage
Predicted sentiment: Negative
Probability: 0.53


### Topic Modeling

So far, we have seen examples of classificaton, where we have some data and we'd like to make a specific conclusion about it: positive or negative, about sports or about politics, etc. We have predetermined the categories that we want to fit our data into. 

Suppose we want to learn something about some given text data without having any pre-determined categories. One thing we can do is topic modeling, where we generate a statistical model that tells us what a document is about. 

**Latent Dirichlet Allocation** (LDA) is an algorithm for creating *topic vectors*. A topic vector is a set of words which represent an abstract topic. If you are interested in a full description of the algorithm, there is one [here.](https://www.youtube.com/watch?v=DWJYZq_fQ2A). 

We need pass a parameter to the LDA model function that tells it how many topics we want it to return. There is no way of determining how many noteworthy LDA topic vectors a document has; it's far from an exact science and requires some trial and error.

Feel free to play around with the example below.

In [None]:
from nltk.tokenize import RegexpTokenizer  
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

# More of our own test data. 
def get_data():
  return [
          'The recorded history of Scotland begins with the arrival of the Roman Empire in the 1st century.',
          'Then the Viking invasions began, forcing the Picts and Gaels to unite, forming the Kingdom of Scotland.',
          'The Kingdom of Scotland was united under the House of Alpin, whose members fought among each other during frequent disputed successions.',
          'England would take advantage of this questioned succession to launch a series of conquests, resulting in the Wars of Scottish Independence',
          'During the Scottish Enlightenment and Industrial Revolution, Scotland became one of the powerhouses of Europe.',
          'Giraffes usually inhabit savannahs and open woodlands.' ,
          'The giraffe\'s chief distinguishing characteristics are its extremely long neck and legs and its distinctive coat pattern.',
          'Giraffes may be preyed on by lions, leopards, spotted hyenas and African wild dogs.',
          'It is classified as vulnerable to extinction, and has been extirpated from many parts of its former range.',
          'The elongation of the neck appears to have started early in the giraffe lineage.',
  ];


def preprocess(input_text):
    # Regular expression tokenizer, we'd like to ignore punctuation and numbers
    tokenizer = RegexpTokenizer(r'\w+') 
    stop_words = stopwords.words('english')
    stemmer = SnowballStemmer('english')

    tokens = tokenizer.tokenize(input_text.lower()) 
    tokens = [x for x in tokens if not x in stop_words]
    tokens_stemmed = [stemmer.stem(x) for x in tokens]

    return tokens_stemmed
    
if __name__=='__main__':
    data = get_data()

    # Create a list for sentence tokens
    tokens = [preprocess(x) for x in data]
    
    # Create document-term matrix
    # In this case, we are taking the tokenized words and using a bag-of-words format to create the doc-term matrix, because there is always more than one way of doing things
    # doc2bow => given a document, we would like a bag of words, meaning for each token create a tuple with a token ID and the number of times it occurs in the document. 
    # https://en.wikipedia.org/wiki/Bag-of-words_model
    dict_tokens = corpora.Dictionary(tokens)
    doc_term_matrix = [dict_tokens.doc2bow(token) for token in tokens]

    # The number of topics we want the LDA model to give us, I chose 2 because it already looks like there are two topics in the dataset
    # For most real-world applications, the dataset would be too large to guess at the 'right' number of topics. You end up just picking a number. 
    num_topics = 2

    # Generate the LDA model 
    ldamodel = models.ldamodel.LdaModel(doc_term_matrix, 
            num_topics=num_topics, id2word=dict_tokens, passes=25)

    num_words = 5
    print('\nTop ' + str(num_words) + ' contributing words to each topic:')
    for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
        print('\nTopic', item[0])

        # Print the contributing words along with their relative contributions 
        list_of_strings = item[1].split(' + ')
        for text in list_of_strings:
            weight = text.split('*')[0]
            word = text.split('*')[1]
            print(word, '==>', str(round(float(weight) * 100, 2)) + '%')


Top 5 contributing words to each topic:

Topic 0
"giraff" ==> 4.1%
"neck" ==> 2.9%
"scottish" ==> 1.7%
"success" ==> 1.7%
"seri" ==> 1.7%

Topic 1
"scotland" ==> 4.9%
"kingdom" ==> 2.7%
"unit" ==> 2.7%
"success" ==> 1.6%
"disput" ==> 1.6%


## Now go ahead and try things out on a dataset of your choice :)

The tasks explained above are by no means exhaustive, there are a ton of other things you can do in NLTK. Here are some examples of other tasks/tools you might want to google: 


*   Chunking 
*   Bag of words (even thought I kind of snuck it in anyways)
*   Word2Vec (for learning associations between words)
*   Creating summaries
*   Visualizing text data (if you really aren't sure what you want to do, start by making a word cloud)
*   Similarity matching
*   Natural language translation
*   Lots of other stuff



# References

https://kavita-ganesan.com/how-to-use-countvectorizer/#.YOiNuxNKj6M

https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.YO4XURNKj6Y

https://www.udemy.com/course/understand-and-practice-ai-natural-language-processing-in-python

https://www.nltk.org/book/ch05.html


