# Natural Language Processing

In this homework, you will apply the TFIDF technique to text classification as well as use word2vec model to generate the dense word embedding for other NLP tasks. 

## Text Classification
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In this lab, we will experiment different feature extraction on the 20 newgroups dataset, including the count vector and TF-IDF vector. Also, we will apply the Naive Bayes classifier  to this dataset and report the prediciton accuracy.

### Load the explore the 20newsgroup data

20 news group data is part of the sklearn library. We can directly load the data using the following command.

In [1]:
# load the traning data and test data
import numpy as np
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=False)
twenty_test = fetch_20newsgroups(subset='test', shuffle=False)

# print total number of categories
print("Number of training data:" + str(len(twenty_train.data)))
print("Number of categories:" + str(len(twenty_train.target_names)))

# print the first text and its category
print(twenty_train.data[0])
print(twenty_train.target[0])

# You can check the target variable by printing all the categories
twenty_train.target_names

Number of training data:11314
Number of categories:20
From: cubbie@garnet.berkeley.edu (                               )
Subject: Re: Cubs behind Marlins? How?
Article-I.D.: agate.1pt592$f9a
Organization: University of California, Berkeley
Lines: 12
NNTP-Posting-Host: garnet.berkeley.edu


gajarsky@pilot.njin.net writes:

morgan and guzman will have era's 1 run higher than last year, and
 the cubs will be idiots and not pitch harkey as much as hibbard.
 castillo won't be good (i think he's a stud pitcher)

       This season so far, Morgan and Guzman helped to lead the Cubs
       at top in ERA, even better than THE rotation at Atlanta.
       Cubs ERA at 0.056 while Braves at 0.059. We know it is early
       in the season, we Cubs fans have learned how to enjoy the
       short triumph while it is still there.

9


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Build a Naive Bayes Model 

Your task is to build practice an ML model to classify the newsgroup data into different categories. You will try both raw count and TF-IDF for feature extraction and then followed by a Naive Bayes classifier. Note that you can connect the feature generation and model training steps into one by using the [pipeline API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) in sklearn.

Try to use Grid Search to find the best hyper parameter from the following settings (feel free to explore other options as well):

* Differnet ngram range
* Weather or not to remove the stop words
* Weather or not to apply IDF

After building the best model from the training set, we apply that model to make predictions on the test data and report its accuracy.

In [15]:
# TODO
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

params = {
    "vect__ngram_range": [(1, 1), (1, 2)],
    "vect__stop_words": [None, 'english'],
    "tfidf__use_idf": [True, False]
}
text_clf = Pipeline(
    [
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('naive_bayes', MultinomialNB())
    ]
)
# Applying gridsearch
grid_search = GridSearchCV(estimator=text_clf, param_grid=params)
grid_search.fit(twenty_train.data, twenty_train.target)

# creating predictions and comparing results
test_predictions = grid_search.predict(twenty_test.data)

np.mean(test_predictions == twenty_test.target)

[10 16 10 ...  3  3  7]
[10 16 14 ...  4  6  7]


0.8074880509824748

---------

## Word Embedding with word2vec

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. 

In this assessment, we will experiment with [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model from package [gensim](https://radimrehurek.com/gensim/) and generate word embeddings from a review dataset. You can then explore those word embeddings and see if they make sense semantically. 

In [13]:
import gzip
import logging
import warnings
from gensim.models import Word2Vec

warnings.simplefilter(action='ignore', category=FutureWarning)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Load the review data

In [14]:
import gensim

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    print("reading file {0}...this may take a while".format(input_file))
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):
 
            if (i % 10000 == 0):
                print("read {0} reviews".format(i))
            # do some pre-processing and return list of words for each review b text
            yield gensim.utils.simple_preprocess(line)
            
documents = list(read_input('reviews_data.txt.gz'))
logging.info("Done reading data file")

reading file reviews_data.txt.gz...this may take a while
read 0 reviews
read 10000 reviews
read 20000 reviews
read 30000 reviews
read 40000 reviews
read 50000 reviews
read 60000 reviews
read 70000 reviews
read 80000 reviews
read 90000 reviews
read 100000 reviews
read 110000 reviews
read 120000 reviews
read 130000 reviews
read 140000 reviews
read 150000 reviews
read 160000 reviews
read 170000 reviews
read 180000 reviews
read 190000 reviews
read 200000 reviews
read 210000 reviews
read 220000 reviews
read 230000 reviews
read 240000 reviews
read 250000 reviews


2021-02-27 15:02:53,134 : INFO : Done reading data file


### Train the word2vec model

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling introduced in Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. A word2vec tutorial can be found [here](https://rare-technologies.com/word2vec-tutorial/).

In [20]:
# TODO build vocabulary and train model
model = gensim.models.Word2Vec(min_count=1)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

2021-02-27 15:19:59,670 : INFO : collecting all words and their counts
2021-02-27 15:19:59,671 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-02-27 15:19:59,876 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2021-02-27 15:20:00,107 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2021-02-27 15:20:00,361 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2021-02-27 15:20:00,576 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2021-02-27 15:20:00,816 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2021-02-27 15:20:01,069 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2021-02-27 15:20:01,277 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2021-02-27 15:20:01,483 : INFO : PROG

(152196375, 207596790)

### Find similar words for a given word
Once the model is built, you can find interesting patterns in the model. For example, can you find the 5 most similar words to word `polite`

In [21]:
# TODO: look up top 5 words similar to 'polite' using most_similar function
# Feel free to try other words and see if it makes sense.
ms = model.most_similar(positive='polite', topn=5)
print(ms)


  This is separate from the ipykernel package so we can avoid doing imports until
2021-02-27 15:26:50,492 : INFO : precomputing L2-norms of word weight vectors


[('courteous', 0.9390735626220703), ('curteous', 0.9053069949150085), ('cordial', 0.8964544534683228), ('curtious', 0.8819586634635925), ('friendly', 0.8801696300506592)]


### Compare the word embedding by comparing their similarities
We can also find similarity betwen two words in the embedding space. Can you find the similarities between word `great` and `good`/`horrible`, and also `dirty` and `clean`/`smelly`. Feel free to play around with the word embedding you just learnt and see if they make sense.

In [22]:
# TODO: find similarities between two words using similarity function
print(model.similarity('great', 'good'))
print(model.similarity('great', 'horrible'))
print(model.similarity('dirty', 'clean'))
print(model.similarity('dirty', 'smelly'))

0.82164234
0.3833545
0.37658313
0.8059554


  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """


# Conceptual Overview
Term Frequency Inverse Document Frequency (TF-IDF) is a way to count the frequency of words, followed by finding the IDF (log(docs/docs with word)).
The vectorizer aspect allows us to map words to numerical data, which ise useful for ML. The word2vec algorithm apparently uses a neural network to determine the synonyms. Also, like the previous vectorizer, it converts words to numerical data.