# Natural Language Processing

In this homework, you will apply the TFIDF technique to text classification as well as use word2vec model to generate the dense word embedding for other NLP tasks. 

## Text Classification
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In this lab, we will experiment different feature extraction on the 20 newgroups dataset, including the count vector and TF-IDF vector. Also, we will apply the Naive Bayes classifier  to this dataset and report the prediciton accuracy.

### Load the explore the 20newsgroup data

20 news group data is part of the sklearn library. We can directly load the data using the following command.

In [3]:
# load the traning data and test data
import numpy as np
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=False)
twenty_test = fetch_20newsgroups(subset='test', shuffle=False)

# print total number of categories
print("Number of training data:" + str(len(twenty_train.data)))
print("Number of categories:" + str(len(twenty_train.target_names)))

# print the first text and its category
print(twenty_train.data[0])
print(twenty_train.target[0])

# You can check the target variable by printing all the categories
twenty_train.target_names

Number of training data:11314
Number of categories:20
From: cubbie@garnet.berkeley.edu (                               )
Subject: Re: Cubs behind Marlins? How?
Article-I.D.: agate.1pt592$f9a
Organization: University of California, Berkeley
Lines: 12
NNTP-Posting-Host: garnet.berkeley.edu


gajarsky@pilot.njin.net writes:

morgan and guzman will have era's 1 run higher than last year, and
 the cubs will be idiots and not pitch harkey as much as hibbard.
 castillo won't be good (i think he's a stud pitcher)

       This season so far, Morgan and Guzman helped to lead the Cubs
       at top in ERA, even better than THE rotation at Atlanta.
       Cubs ERA at 0.056 while Braves at 0.059. We know it is early
       in the season, we Cubs fans have learned how to enjoy the
       short triumph while it is still there.

9


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Build a Naive Bayes Model 

Your task is to build a simple TF-IDF (Term Frequency - Inverse Document Frequency) followed by a Naive Bayes classifier. Note that you can connect the feature generation and model training steps into one by using the pipeline API from sklearn.

Try to use Grid Search to find the best hyper parameter from the following settings (feel free to explore other options as well):

* Differnet ngram range
* Weather or not to remove the stop words
* Weather or not to apply IDF

I am intentionally make the requirement vague to encourage you to further explore different options and find the best solution. After identifying the best model, we use that model to make predictions on the test data and report its accuracy.


### Raw count fetures from text
We can convert the raw text into a vector of counts before feeding into a ML model. In sklearn, we have a API called `CountVectorizer` to the job for us.


In [4]:
# Extracting features from text files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(
    stop_words = 'english',
    max_features = None,
    ngram_range = (1, 1)
)

X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 129796)

### TF-IDF fetures from text
Similar to raw count vector. Sklearn has a API called `TfidfTransformer` which convert raw counts ot TF-IDF feature representation.

In [5]:
# TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(
    norm = 'l2',
    use_idf = True,
    smooth_idf = True
)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 129796)

### Train Naive Bayes Model

Given the TFIDF features for the data, we are ready to train the Naive Bayes classifier.

In [6]:
# Training Naive Bayes (NB) classifier on training data.
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

### Bulid pipeline to connect all the components together
Another way to connect both the feature generation and model training steps into one is to use the pipeline API.

In [7]:
# Building a pipeline: We can write less code and do all of the above, by building a pipeline as follows:
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.
# We will be using the 'text_clf' going forward.
from sklearn.pipeline import Pipeline

text_clf = Pipeline(
    [
        ('vect', CountVectorizer()), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultinomialNB())
    ]
)

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [8]:
# Performance of NB Classifier
import numpy as np
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

### Grid Search
To find the best set of hyper parameters, we can apply grid search on different settings of TFIDF.

In [9]:
# Grid Search
# Here, we are creating a list of parameters for which we would like to do performance tuning. 
# All the parameters name start with the classifier name (remember the arbitrary name we gave). 
# E.g. vect__ngram_range; here we are telling to use unigram and bigrams and choose the one which is optimal.

from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False)}
parameters

{'vect__ngram_range': [(1, 1), (1, 2)], 'tfidf__use_idf': (True, False)}

In [10]:
# Next, we create an instance of the grid search by passing the classifier, parameters 
# and n_jobs=-1 which tells to use multiple cores from user machine.
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1, cv=2)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

In [11]:
# To see the best mean score and the params, run the following code

gs_clf.best_score_
gs_clf.best_params_

# Output for above should be: The accuracy has now increased to ~90.6% for the NB classifier (not so naive anymore! 😄)
# and the corresponding parameters are {‘clf__alpha’: 0.01, ‘tfidf__use_idf’: True, ‘vect__ngram_range’: (1, 2)}.

{'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

In [12]:
# TODO: make predictions using NB classifier and evaluate its accuracy on the test data

predicted = gs_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.765400955921402

---------

## Word Embedding with word2vec

In this assessment, we will experiment with [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model from package [gensim](https://radimrehurek.com/gensim/) and generate word embeddings from a review dataset. 

In [6]:
import gzip
import logging
import warnings
from gensim.models import Word2Vec

warnings.simplefilter(action='ignore', category=FutureWarning)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Load the review data

In [7]:
import gensim
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    print("reading file {0}...this may take a while".format(input_file))
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):
 
            if (i % 10000 == 0):
                print("read {0} reviews".format(i))
            # do some pre-processing and return list of words for each review b text
            yield gensim.utils.simple_preprocess(line)
            
documents = list(read_input('reviews_data.txt.gz'))
logging.info("Done reading data file")

reading file reviews_data.txt.gz...this may take a while
read 0 reviews
read 10000 reviews
read 20000 reviews
read 30000 reviews
read 40000 reviews
read 50000 reviews
read 60000 reviews
read 70000 reviews
read 80000 reviews
read 90000 reviews
read 100000 reviews
read 110000 reviews
read 120000 reviews
read 130000 reviews
read 140000 reviews
read 150000 reviews
read 160000 reviews
read 170000 reviews
read 180000 reviews
read 190000 reviews
read 200000 reviews
read 210000 reviews
read 220000 reviews
read 230000 reviews
read 240000 reviews
read 250000 reviews


2019-07-30 00:04:50,676 : INFO : Done reading data file


### Train the word2vec model

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling introduced in Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. A word2vec tutorial can be found [here](https://rare-technologies.com/word2vec-tutorial/).

In [8]:
# build vocabulary and train model
model = gensim.models.Word2Vec(
    documents,
    size=150,
    window=10,
    min_count=2,
    workers=10)
model.train(documents, total_examples=len(documents), epochs=3)

2019-07-30 00:05:23,079 : INFO : collecting all words and their counts
2019-07-30 00:05:23,082 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-07-30 00:05:23,338 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2019-07-30 00:05:23,594 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2019-07-30 00:05:23,941 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2019-07-30 00:05:24,240 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2019-07-30 00:05:24,631 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2019-07-30 00:05:25,010 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2019-07-30 00:05:25,297 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2019-07-30 00:05:25,555 : INFO : PROG

2019-07-30 00:06:08,807 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-07-30 00:06:08,824 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-07-30 00:06:08,826 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-07-30 00:06:08,827 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-07-30 00:06:08,863 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-07-30 00:06:08,864 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-30 00:06:08,865 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-30 00:06:08,877 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-30 00:06:08,880 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-30 00:06:08,882 : INFO : EPOCH - 1 : training on 41519358 raw words (30346683 effective words) took 36.4s, 833654 effective words/s
2019-07-30 00:06:09,911 : INFO : EPOCH 2 

2019-07-30 00:07:06,013 : INFO : EPOCH 3 - PROGRESS: at 63.22% examples, 836800 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:07:07,015 : INFO : EPOCH 3 - PROGRESS: at 65.90% examples, 833941 words/s, in_qsize 18, out_qsize 1
2019-07-30 00:07:08,024 : INFO : EPOCH 3 - PROGRESS: at 68.47% examples, 830445 words/s, in_qsize 18, out_qsize 1
2019-07-30 00:07:09,036 : INFO : EPOCH 3 - PROGRESS: at 70.98% examples, 828620 words/s, in_qsize 18, out_qsize 1
2019-07-30 00:07:10,067 : INFO : EPOCH 3 - PROGRESS: at 74.08% examples, 829280 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:07:11,070 : INFO : EPOCH 3 - PROGRESS: at 77.02% examples, 832993 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:07:12,070 : INFO : EPOCH 3 - PROGRESS: at 80.01% examples, 836247 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:07:13,072 : INFO : EPOCH 3 - PROGRESS: at 83.02% examples, 838568 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:07:14,073 : INFO : EPOCH 3 - PROGRESS: at 85.98% examples, 841122 words/s,

2019-07-30 00:08:01,031 : INFO : EPOCH 5 - PROGRESS: at 12.40% examples, 805064 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:08:02,056 : INFO : EPOCH 5 - PROGRESS: at 14.78% examples, 802156 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:08:03,059 : INFO : EPOCH 5 - PROGRESS: at 17.44% examples, 822000 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:08:04,083 : INFO : EPOCH 5 - PROGRESS: at 19.79% examples, 827897 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:08:05,085 : INFO : EPOCH 5 - PROGRESS: at 22.46% examples, 836665 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:08:06,093 : INFO : EPOCH 5 - PROGRESS: at 24.58% examples, 835791 words/s, in_qsize 17, out_qsize 2
2019-07-30 00:08:07,093 : INFO : EPOCH 5 - PROGRESS: at 27.98% examples, 842405 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:08:08,144 : INFO : EPOCH 5 - PROGRESS: at 30.83% examples, 836115 words/s, in_qsize 16, out_qsize 3
2019-07-30 00:08:09,161 : INFO : EPOCH 5 - PROGRESS: at 33.75% examples, 834499 words/s,

2019-07-30 00:09:02,577 : INFO : EPOCH 1 - PROGRESS: at 55.34% examples, 651786 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:03,580 : INFO : EPOCH 1 - PROGRESS: at 57.91% examples, 655364 words/s, in_qsize 17, out_qsize 2
2019-07-30 00:09:04,602 : INFO : EPOCH 1 - PROGRESS: at 60.81% examples, 660891 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:05,607 : INFO : EPOCH 1 - PROGRESS: at 64.14% examples, 669738 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:06,610 : INFO : EPOCH 1 - PROGRESS: at 67.34% examples, 679053 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:07,620 : INFO : EPOCH 1 - PROGRESS: at 70.09% examples, 684239 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:08,631 : INFO : EPOCH 1 - PROGRESS: at 71.36% examples, 674740 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:11,140 : INFO : EPOCH 1 - PROGRESS: at 72.12% examples, 632853 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:09:12,884 : INFO : EPOCH 1 - PROGRESS: at 72.46% examples, 605099 words/s,

2019-07-30 00:10:01,296 : INFO : EPOCH 3 - PROGRESS: at 2.84% examples, 880617 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:10:02,311 : INFO : EPOCH 3 - PROGRESS: at 6.05% examples, 923536 words/s, in_qsize 18, out_qsize 1
2019-07-30 00:10:03,321 : INFO : EPOCH 3 - PROGRESS: at 8.82% examples, 902782 words/s, in_qsize 18, out_qsize 1
2019-07-30 00:10:04,332 : INFO : EPOCH 3 - PROGRESS: at 11.45% examples, 918782 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:10:05,340 : INFO : EPOCH 3 - PROGRESS: at 14.20% examples, 929096 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:10:06,351 : INFO : EPOCH 3 - PROGRESS: at 17.04% examples, 938424 words/s, in_qsize 19, out_qsize 0
2019-07-30 00:10:07,354 : INFO : EPOCH 3 - PROGRESS: at 19.61% examples, 942431 words/s, in_qsize 18, out_qsize 1
2019-07-30 00:10:08,371 : INFO : EPOCH 3 - PROGRESS: at 22.40% examples, 942629 words/s, in_qsize 19, out_qsize 2
2019-07-30 00:10:09,373 : INFO : EPOCH 3 - PROGRESS: at 24.86% examples, 940841 words/s, in

(91048370, 124558074)

### Find similar words for a given word
Once the model is built, you can find interesting patterns in the model. For example, can you find the 5 most similar words to word `polite`

In [9]:
# TODO: look up top 5 words similar to 'polite' using most_similar function
model.wv.most_similar(positive=["polite"], topn=5)

2019-07-30 00:10:32,108 : INFO : precomputing L2-norms of word weight vectors


[('courteous', 0.9084916114807129),
 ('cordial', 0.8302523493766785),
 ('friendly', 0.8226535320281982),
 ('curteous', 0.8103367686271667),
 ('courtious', 0.7951087355613708)]

### Compare the word embedding by comparing their similarities
We can also find similarity betwen two words in the embedding space. Can you find the similarities between word `great` and `good`/`horrible`, and also `dirty` and `clean`/`smelly`. Feel free to play around with the word embedding you just learnt and see if they make sense.

In [12]:
# TODO: find similarities between two words using similarity function
print(model.wv.similarity(w1="great", w2="good"))
print(model.wv.similarity(w1="great", w2="horrible"))
print(model.wv.similarity(w1="dirty", w2="smelly"))
print(model.wv.similarity(w1="dirty", w2="clean"))

0.81034255
0.30029735
0.76013786
0.267225
