# Week 10 Problem 2

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says YOUR CODE HERE. Do not write your answer in anywhere else other than where it says YOUR CODE HERE. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint)

5. When you are ready to submit your assignment, go to Dashboard → Assignments and click the Submit button. Your work is not submitted until you click Submit.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. If your code does not pass the unit tests, it will not pass the autograder.

**Note:** **Start early since Validation will take some time**

## Author: Apurv Garg
### Primary Reviewer: Radhir Kothuri


# Due Date: 6 PM, April 02, 2018

In [1]:
# Set up Notebook

% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
from sklearn.preprocessing import normalize
from sklearn.decomposition import NMF
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import LatentDirichletAllocation
from gensim import models
from gensim import matutils as mat
from gensim import models as md
from gensim import corpora
from gensim.corpora.dictionary import Dictionary
from nose.tools import assert_equal, assert_true, assert_false, assert_almost_equal

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings("ignore")


Using TensorFlow backend.


In [2]:
def get_topics(cv, model):
    # Number of terms per topic to display
    max_topics = 10

    # Number of terms per topic to retain
    max_labels = 5

    topics = []
    feature_names = cv.get_feature_names()

    # Iterate through the matrix components
    for idx, topic in enumerate(model.components_):

        # First we sort the terms in descending order ([::-1])
        # And then retiain only the top terms
        top_topics_idx = topic.argsort()[::-1][:max_topics]

        top_topics = [feature_names[jdx] for jdx in top_topics_idx]

        # Now extract out the terms themselves and display
        top_features = " ".join(top_topics)
        # print('Topic {0:2d}: {1}'.format(idx, top_features))
        topics.append(", ".join(top_topics[:max_labels]))
        
    return(topics)

# Dataset

We will analyze the twenty newsgroup data set. We have removed the headers, quotes and footers. We will be analyzing four-grams only for the 1st two problems. 

The cell below will create a subdirectory under home called `temp_data`. *If you want to the delete the temp_data directory at any point, run this code in a new cell.*  
``` bash
! rm -rf /home/data_scientist/temp_data
```

In [3]:
! mkdir ~/temp_data
HOME = '/home/data_scientist/temp_data'

mkdir: cannot create directory ‘/home/data_scientist/temp_data’: File exists


In [4]:
train = fetch_20newsgroups(HOME, subset='train', remove =('quotes', 'headers', 'footers'), random_state=40)
test = fetch_20newsgroups(HOME, subset='train', remove =('quotes', 'headers', 'footers'), random_state=40)


In [5]:
# Use TD-IDF on newgroup data.
from sklearn.feature_extraction.text import TfidfVectorizer
#We are taking four-grams only
cv = TfidfVectorizer(stop_words = 'english',ngram_range = (4,4),
                     lowercase=True,
                     min_df=2,
                     max_features=1500)
                     
train_data = cv.fit_transform(train['data'])
test_data = cv.transform(test['data'])

# Problem 1

For this problem, complete the function `nmf_rf` which will take `num_topics`, `cv`, `train_data`, `test_data` and `target` as parameters and return the nmf_topics(data clusters after applying get_topics), td_norm(train dataset after transformation and l1 normalization), rfc(RF model after fitting it to train) and ts_preds(predictions for the test dataset using this model)

- Apply non-negative matrix factorization(NMF) to compute topics in a corpus. The parameters to be used inside the NMF are `n_components`=num_topics and `max_iter`=5. Other parameters should be kept as default. Fit this on train_dataset.
- Identify data clusters (or topics) in the corpus using get_topics.
- Transform the dataset and normalize by using l1-norm. 
- Create a with RandomForestClassifier with parameters: `max_features='auto', min_samples_split=4, random_state=23`. Other parameters are left to be default. <br>
- Fit estimator to scaled training data(`target` is the response variable) and predict for the test dataset using this model. (Don't forget to apply nmf transformation to test dataset while/before prediction) 


In [6]:
def nmf_rf(num_topics, cv, train_dataset, test_dataset, target):
    '''           
    Parameters
    ----------
    num_topics : Number of topics
    cv : TF-IDF vectorizer Object
    train_dataset : transformed train-data after TF-IDF vectorization
    test_dataset : transformed test-data after TF-IDF vectorization
    target: target/dependent variable to be used in RF
    
    Returns
    -------
    A tuple of 4 containing nmf_topics(data clusters after applying get_topics), 
    td_norm(train dataset after transformation and l1 normalization), 
    rfc(RF model after fitting it to train) and 
    ts_preds(predictions for the test dataset using this model)
    '''    
    # YOUR CODE HERE

    nmf = NMF(n_components = num_topics, max_iter = 5).fit(train_dataset)
    nmf_topics = get_topics(cv, nmf)
    
    td = nmf.transform(train_dataset)
    td_norm = normalize(td, norm='l1', axis=1)
    
    rf = RandomForestClassifier(max_features='auto', min_samples_split=4, random_state=23)
    rfc = rf.fit(td_norm, target)
    
    tst_data = nmf.transform(test_dataset)
    ts_preds = rfc.predict(tst_data)
    
    return nmf_topics, td_norm, rfc, ts_preds

In [7]:
nmf_topics, td_norm, rfc1, ts_preds = nmf_rf(30, cv, train_data, test_data, train['target'])
assert_equal(isinstance(rfc1, RandomForestClassifier), True)
assert_equal(len(nmf_topics), 30)
assert_equal(rfc1.min_samples_split, 4)
assert_equal(isinstance(ts_preds, np.ndarray), True)
assert_equal(len(ts_preds), 11314)
assert_equal(ts_preds[[1]][0], 15)

In [8]:
print(metrics.classification_report(test['target'], ts_preds,
    target_names = test['target_names']))

                          precision    recall  f1-score   support

             alt.atheism       0.93      0.05      0.10       480
           comp.graphics       0.10      0.01      0.02       584
 comp.os.ms-windows.misc       0.59      0.02      0.03       591
comp.sys.ibm.pc.hardware       0.00      0.00      0.00       590
   comp.sys.mac.hardware       0.03      0.00      0.00       578
          comp.windows.x       0.28      0.02      0.03       593
            misc.forsale       0.15      0.01      0.03       585
               rec.autos       0.00      0.00      0.00       594
         rec.motorcycles       0.42      0.03      0.06       598
      rec.sport.baseball       0.38      0.03      0.05       597
        rec.sport.hockey       0.00      0.00      0.00       600
               sci.crypt       0.77      0.07      0.13       595
         sci.electronics       0.81      0.03      0.06       591
                 sci.med       0.63      0.14      0.23       594
         

# Problem 2

For this problem, complete the function `lda_rf` which will take `num_topics`, `cv`, `train_data`, `test_data` and `target` as parameters and return the lda_topics(data clusters after applying get_topics), lda_train_norm(train dataset after transformation and l1 normalization), rfc(RF model after fitting it to train) and ts_preds(predictions for the test dataset using this model).

- Apply Latent Dirichlet Allocation(LDA) to compute topics in a corpus. The parameters to be used inside the LDA are `n_topics`=num_topics, `max_iter`=5, `learning_method`='online',`learning_offset`=5 and `random_state`=23. Other parameters should be kept as default. Fit this on train_dataset.
- Identify data clusters (or topics) in the corpus using get_topics.
- Transform the dataset and normalize by using l1-norm. 
- Create a with RandomForestClassifier with parameters: `max_features='auto', min_samples_split=4, random_state=23`. Other parameters are left to be default. <br>
- Fit estimator to scaled training data(`target` is the response variable) and predict for the test dataset using this model. (Don't forget to apply lda transformation to test dataset while/before prediction) 


In [11]:
def lda_rf(num_topics, cv, train_dataset, test_dataset, target):
    '''           
    Parameters
    ----------
    num_topics : Number of topics
    cv : TF-IDF vectorizer Object
    train_dataset : transformed train-data after TF-IDF vectorization
    test_dataset : transformed test-data after TF-IDF vectorization
    target: target/dependent variable to be used in RF
    
    Returns
    -------
    A tuple of 4 containing lda_topics(data clusters after applying get_topics), 
    lda_train_norm(train dataset after transformation and l1 normalization), 
    rfc(RF model after fitting it to train) and 
    ts_preds(predictions for the test dataset using this model)
    '''    
    # YOUR CODE HERE
    
    lda = LatentDirichletAllocation(n_topics=num_topics, max_iter=5, learning_method='online',
                                    learning_offset=5, random_state=23).fit(train_dataset)
    
    lda_topics = get_topics(cv, lda)
    
    td = lda.transform(train_dataset)
    lda_train_norm = normalize(td, norm='l1', axis=1)
    
    rf = RandomForestClassifier(max_features='auto', min_samples_split=4, random_state=23)
    rfc = rf.fit(lda_train_norm, target)
    
    tst_data = lda.transform(test_dataset)
    ts_preds = rfc.predict(tst_data)
    
    return lda_topics, lda_train_norm, rfc, ts_preds

In [12]:
lda_topics, lda_train_norm, rfc2, ts_preds2 = lda_rf(30, cv, train_data, test_data, train['target'])

assert_equal(isinstance(rfc2, RandomForestClassifier), True)
assert_equal(rfc2.min_samples_split, 4)
assert_equal(isinstance(ts_preds2, np.ndarray), True)
assert_equal(len(ts_preds2), 11314)
assert_equal(ts_preds2[[1]][0], 15)
assert_equal(ts_preds2[[0]][0], 6)
assert_almost_equal(lda_train_norm[1][1], 0.033333333333333333, 3)
assert_almost_equal(lda_train_norm[0][1], 0.016666666666667971, 3)

In [13]:
print(metrics.classification_report(test['target'], ts_preds2,
    target_names = test['target_names']))

                          precision    recall  f1-score   support

             alt.atheism       0.88      0.09      0.16       480
           comp.graphics       0.54      0.04      0.07       584
 comp.os.ms-windows.misc       0.67      0.08      0.14       591
comp.sys.ibm.pc.hardware       0.82      0.02      0.05       590
   comp.sys.mac.hardware       0.79      0.02      0.04       578
          comp.windows.x       0.73      0.10      0.18       593
            misc.forsale       0.75      0.03      0.06       585
               rec.autos       0.75      0.02      0.04       594
         rec.motorcycles       0.95      0.06      0.11       598
      rec.sport.baseball       0.97      0.05      0.10       597
        rec.sport.hockey       0.91      0.10      0.17       600
               sci.crypt       0.90      0.13      0.23       595
         sci.electronics       0.85      0.03      0.06       591
                 sci.med       0.94      0.16      0.27       594
         

## Dataset Creation(Problem3)

In [14]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health." 

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

en_stop = set('my but for a of the and to in on an is that'.split())

txts = [[word for word in sentance.lower().split() if word not in en_stop]
        for sentance in doc_set]

# Keep only those words appearing more than once
from collections import Counter
frequency = Counter([word for txt in txts for word in txt])

# Now grab tokens that appear more than once
tokens = [[token for token in txt if frequency[token] > 1]
          for txt in txts]



# Problem 3

For this problem, complete the function `gensim_tm` which will take `token`, `model` and `topics` as parameters and return the computed LDA or LSI model for corpus. 
- The parameters to be used inside the LdaModel/LsiModel are `corpus`, `id2word` and `num_topics`. Other parameters should be kept as default.
- Compute a dictionary mapping for given text corpus(`id2word` parameter for model). <br>
- Create corpus as bag of words(can use `doc2bow`) and convert text to TFIDF model(`corpus` parameter for model). <br>
- Construct an Latent Dirichlet Allocation/ Latent Semantic Analysis model of this document using our dictionary mapping object.

If model parameter equals 'lda', then return a LDA model and if model parameter equals 'lsi', return LSA model.

In [17]:
def gensim_tm(token, model, topics):
    '''           
    Parameters
    ----------
    token : tokens for which dictionary mapping has to be created
    model : the model which is to be ceated i.e. 'lda' or 'lsi'
    topics : Number of requested factors(num_topics parameter for Lda/Lsi model) 
    
    Returns
    -------
    The computed LDA/LSI model for corpus    
    '''    
    # YOUR CODE HERE
    # Compute a dictionary mapping for given text corpus
    dict_gensim = corpora.Dictionary(tokens)
    
    # Display sample text string as a bag of words.
    crps = [dict_gensim.doc2bow(txt) for txt in txts]    
    tfidf = models.TfidfModel(crps)
        
    # Compute  model for corpus
    crps_tfidf = tfidf[crps]
    
    if model == 'lda':
        mdl = models.LdaModel(corpus=crps_tfidf, id2word=dict_gensim, num_topics=topics)
    else:
        mdl = md.lsimodel.LsiModel(corpus=crps_tfidf, id2word=dict_gensim, num_topics=topics)
      
    return mdl


In [18]:
lsi_gs=gensim_tm(tokens, 'lsi', 5)
lda_gs=gensim_tm(tokens, 'lda', 5)
assert_equal(lsi_gs.num_topics, 5)
assert_equal(lda_gs.num_topics, 5)
assert_equal(lsi_gs.num_terms, 6)
assert_equal(lda_gs.num_terms, 6)
lsi_gs2=gensim_tm(tokens, 'lsi', 8)
lda_gs2=gensim_tm(tokens, 'lda', 8)
assert_equal(lsi_gs2.num_topics, 8)
assert_equal(lda_gs2.num_topics, 8)
assert_equal(lsi_gs2.num_terms, 6)
assert_equal(lda_gs2.num_terms, 6)

In [19]:
lsi_gs.print_topics(2)

[(0,
  '0.494*"mother" + 0.441*"good" + 0.440*"driving" + 0.388*"health" + 0.346*"brother" + 0.313*"brocolli"'),
 (1,
  '0.575*"mother" + -0.571*"good" + -0.409*"brocolli" + 0.253*"driving" + -0.244*"health" + 0.230*"brother"')]

In [20]:
lda_gs.print_topics(2)

[(0,
  '0.296*"mother" + 0.295*"driving" + 0.195*"brother" + 0.072*"health" + 0.072*"good" + 0.071*"brocolli"'),
 (3,
  '0.384*"good" + 0.230*"brocolli" + 0.162*"brother" + 0.075*"mother" + 0.075*"health" + 0.075*"driving"')]