# Natural Language Processing: Latent Semantic Analysis

Background: 

In NLP, one important application is about Topic Modeling, with which we try to capture the underlying themes that appear in a group of documents. Basically, we have two approaches:

- Latent Dirichlet Allocation (LDA) -  generate k topics by first assigning each word to a random topic, then iteratively updating assignments based on parameters $\alpha$, the mix of topics per document, and $\beta$, the distribution of words per topic.

- Latent Semantic Analysis (LSA) - identifies patterns using TF-IDF scores and reduces data to k dimensions through SVD. In other words, given a corpus of articles, we want to create a term-document-type of matrix, for which we can do SVD analysis.

### In this project, we want to empoly ***LSA***.



In [6]:
import pickle
import os
import time

import numpy as np
import pandas as pd
import scipy.sparse.csr as csr
import scipy.sparse as sparse
from sklearn.base import clone
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline

# I. Representing a corpus of documents as a matrix
## Bag of Words matrix

Following the <a href="http://scikit-learn.org/stable/modules/feature_extraction.html">Sklearn Feature-extraction documentation page</a>

- we start with a given **corpus** of $D$ documents.
- we preprocess each document and convert it into a list of terms (features)
    - by lowercasing first
    - accepting only word patterns (defined via regex)
- then we form the $CV$ Count-Vectorizer term-frequency matrix defined as:

$$
\text{tf}(t, d)\equiv{CF}_{d,t} = \text{# times term }t\text{ occurs in document }d
$$


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
# preprocessing
# a default type of pattern for stop words
tpatterns = [
    '(?u)\\b\\w\\w+\\b', #default tpatterns[0]
    '(?u)\\b[a-zA-Z]\\w+\\b', #tpatterns[1]
    '\\w',#tpatterns[2]
    '\\w+',#tpatterns[3]
    '(?u)\\b[a-zA-Z]\\w+\\b|\\b[0-9]\\b'#tpatterns[4]
]

# instantiate a contvectorizer, which just does the TF transformation
vectorizer = CountVectorizer(token_pattern=tpatterns[0])
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

### Reference for ***regular expression***
https://www.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html

In [8]:
# start with four documents
corpus = [
    'This is the the first first document abra.',
    'This is the second second document cadabra.',
    'And the third one 3.',
    'Is this the first document 4?',
]
# list of string
# map corpus onto a matrix 4x11: 4 documents, 11 unique words/terms
X_corpus_docterm = vectorizer.fit_transform(corpus)
X_corpus_docterm

<4x11 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [9]:
X_corpus_docterm.toarray()

array([[1, 0, 0, 1, 2, 1, 0, 0, 2, 0, 1],
       [0, 0, 1, 1, 0, 1, 0, 2, 1, 0, 1],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [10]:
# note the effect of modifying the token_pattern above....
features = vectorizer.get_feature_names() 
features

['abra',
 'and',
 'cadabra',
 'document',
 'first',
 'is',
 'one',
 'second',
 'the',
 'third',
 'this']

In [11]:
pd.DataFrame(X_corpus_docterm.toarray(),columns = np.array(features))

Unnamed: 0,abra,and,cadabra,document,first,is,one,second,the,third,this
0,1,0,0,1,2,1,0,0,2,0,1
1,0,0,1,1,0,1,0,2,1,0,1
2,0,1,0,0,0,0,1,0,1,1,0
3,0,0,0,1,1,1,0,0,1,0,1


In [12]:
corpus = [
    'This is the the first first document abra.',
    'This is the second second document cadabra.',
    'And the third one 3.',
    'Is this the first document 4?',
]

for tp in tpatterns:
    vectorizer.token_pattern = tp
    print(vectorizer.token_pattern)
    X_corpus_docterm = vectorizer.fit_transform(corpus)
    features = vectorizer.get_feature_names()
    print(features)
    print('\n')

(?u)\b\w\w+\b
['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


(?u)\b[a-zA-Z]\w+\b
['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


\w
['3', '4', 'a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']


\w+
['3', '4', 'abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


(?u)\b[a-zA-Z]\w+\b|\b[0-9]\b
['3', '4', 'abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




In [13]:
vectorizer.token_pattern = tpatterns[0] #back to the default
X_corpus_docterm = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
print(features)

['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [14]:
X_corpus_docterm

<4x11 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [15]:
corpus

['This is the the first first document abra.',
 'This is the second second document cadabra.',
 'And the third one 3.',
 'Is this the first document 4?']

In [16]:
# This is our document-term matrix CV:
CV = X_corpus_docterm.toarray()
CV
# the first document contains 'first' twice, 第一行第五个2

array([[1, 0, 0, 1, 2, 1, 0, 0, 2, 0, 1],
       [0, 0, 1, 1, 0, 1, 0, 2, 1, 0, 1],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

## What if we want to vectorize a document in the test set?

E.g. this could even be a document with words not even encountered before

In [17]:
# just call the .transform() method of the trained vectorizer

# what is we see a term in test set
# that has never appeared before in training set?
def docs2vec(docs, vectorizer):
    return vectorizer.transform(docs)

docs_test = [
    'rain',
    'The a yoghurt',
    'three two one hurray',
    'and abra document is first cadabra',
    'one-third abra is the first cadabra and this document a second',
]

print(vectorizer.get_feature_names())

for doc in docs_test:
    print('\n')
    print(doc)
    print(docs2vec([doc], vectorizer).toarray())

X_docterm_test = docs2vec(docs_test, vectorizer)
print('\n')
print(X_docterm_test.toarray())

['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


rain
[[0 0 0 0 0 0 0 0 0 0 0]]


The a yoghurt
[[0 0 0 0 0 0 0 0 1 0 0]]


three two one hurray
[[0 0 0 0 0 0 1 0 0 0 0]]


and abra document is first cadabra
[[1 1 1 1 1 1 0 0 0 0 0]]


one-third abra is the first cadabra and this document a second
[[1 1 1 1 1 1 1 1 1 1 1]]


[[0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1 0 0 0 0]
 [1 1 1 1 1 1 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1]]


### Problems with Term-Document matrix:

- The document vectors are **not normalized**
    - so can't really compare documents
    
- The document vectors contain many common english words containing **no information**
    - ideally we want to remove those, e.g. 'the', 'is', etc


### Solution: TF-IDF vectorizer (see <a href=http://scikit-learn.org/stable/modules/feature_extraction.html> The TF-idf section in the Scikit-Learn feature extraction manual</a>)

Instead, let's consider the following matrix

$$
\begin{align}
\text{tf-idf}(t,d) &\equiv{\text{tf}}(t,d)\times\text{idf}(t)\\
\text{idf}(t)&\equiv\log\frac{1+n_d}{1+\text{df}(d,t)} + 1
\end{align}
$$

where 

- $\text{df}(d,t)$ is the number of documents containing feature $t$
- the rows of the tf-idf matrix are normalized to have unit norm (either $L_1$ or $L_2$)
    - this way we can compare documents by the norm of their doc2vec overlaps
    
Let's see this in practice

In [20]:
vectorizer = TfidfVectorizer(
    stop_words='english',
    norm='l2', # each output vector has l2 norm equal to 1
    use_idf=True)

# create a corpus with various levels of repetition of its terms
N = 1000
corpus = np.reshape(['blah', 'abra', 'cadabra'] * N, (N,3))
corpus[2:,2] = ''
corpus[int(N/2):,1] = ''
corpus = [' '.join(corpus[i]) for i in range(N)]

print('\n'.join(corpus[:10]))
corpus

blah abra cadabra
blah abra cadabra
blah abra 
blah abra 
blah abra 
blah abra 
blah abra 
blah abra 
blah abra 
blah abra 


['blah abra cadabra',
 'blah abra cadabra',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah abra

In [21]:
X_corpus_tfidf=vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names()) 

# what happens to the weight of 'cadabra' when we vary N?
# what happens to the weight of 'blah' and 'abra'?
# can you explain?
X_corpus_tfidf.toarray()[:10,:]

# whereever, we always get the least weights on 'blah', this is the least relevent word to
# look at
# but 'cadabra' really mean something when it occurs
pd.DataFrame(X_corpus_tfidf.toarray(), columns = vectorizer.get_feature_names())

['abra', 'blah', 'cadabra']


Unnamed: 0,abra,blah,cadabra
0,0.238730,0.141081,0.960783
1,0.238730,0.141081,0.960783
2,0.860906,0.508765,0.000000
3,0.860906,0.508765,0.000000
4,0.860906,0.508765,0.000000
...,...,...,...
995,0.000000,1.000000,0.000000
996,0.000000,1.000000,0.000000
997,0.000000,1.000000,0.000000
998,0.000000,1.000000,0.000000


# II. Latent Semantic Analysis

## Or Truncated SVD on the TF-IDF matrix

The Following code is based on 

- "http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html": Scikit-Learn's Reuters Dataset TF-IDF + K-NN classification example
- along with "http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/": Chris McCormic's LSA tutorial
- and his "https://github.com/chrisjmccormick/LSA_Classification": github page.

The original Reuter's 21578 dataset is part of the "http://archive.ics.uci.edu/ml/machine-learning-databases": UCI-ML repository and can be found "http://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz". However, in this demo we are using the already "https://github.com/chrisjmccormick/LSA_Classification/tree/master/data" pre-processed version in Chris McCormic's github page.


### Let's look at some real data - the Reuters Articles Corpus

In [26]:
# fname = "LSA_Recommender_DS7\\raw_text_dataset.pickle"
# filepath = os.getcwd() + '\\' + fname
filepath = r"C:\Users\xxxli\Desktop\ResumeProjects\Data_Science_Applications\LSA_Recommender_DS7\raw_text_dataset.pickle"
raw_text_dataset = pickle.load(open(filepath, "rb"))
corpus_train, labels_train = raw_text_dataset[0], raw_text_dataset[1] 
corpus_test, labels_test = raw_text_dataset[2], raw_text_dataset[3]
print(type(corpus_train)) # list of string
print('Number of train docs:', len(corpus_train), 'Number of test docs:', len(corpus_test))
print('\nExample train labels:', labels_train[:4])

<class 'list'>
Number of train docs: 4743 Number of test docs: 4858

Example train labels: [['cocoa', 'el-salvador', 'usa', 'uruguay'], ['usa'], ['usa'], ['usa', 'brazil']]


In [27]:
# n = 4610
# n = 1238
n = np.random.choice(len(corpus_train))

print('\nThis is how a article ', n,' looks like:\n\n', 
      corpus_train[n][:500])

print('\nAnd these are its topic labels or tags:\n\n', 
      labels_train[n][:500])


This is how a article  616  looks like:

 GELCO GEL> SEES FLAT 1987 PRETAX OPERATING NET

Gelco Corp said that, excluding the effects of a restructuring plan, it expects pre-tax operating earnings for the year to end July 31, 1987, to be about the same as those of last year. For the year ended July 31, 1986, Gelco reported pre-tax operating earnings of 14.8 mln dlrs, or 1.08 dlrs a share. However, final results will be affected by certain charges including legal and investment advisors fees, preferred stock dividends and other costs of 

And these are its topic labels or tags:

 ['earn', 'usa']


### TF-IDF vectorizer step:
The TfidfVectorizer below does the following:
- TF Step
    - Strips out “stop words”, e.g. frequently occuring english words
    - Filters out terms that occur in more than half of the docs
    (max_df=0.5)
    - Filters out terms that occur in only one document (min_df=2).
    - Selects the 10,000 most frequently occuring words in the corpus.
    - Normalizes the vector to account for the effect of document
    length on the tf-idf values. Here we use l1 norm which normalized
    by the document length
- IDF Step
    - Nomalize each 

In [28]:
vectorizer = TfidfVectorizer(
    max_df=0.5, # ignore terms which occur in more than half of the documents
    max_features=10000,
    min_df=2, # ignore terms which occur in less than 2 documents
    stop_words='english',
    norm='l2',
    use_idf=True, 
    analyzer='word',
#     token_pattern='(?u)\\b\\w\\w+\\b'
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b'
    )
# play around to see what kind of token pattern actually works the best for for this corpus.
# note how changing the token_pattern changes the output below

# train a vectorizer using our corpus training set
X_train_tfidf = vectorizer.fit_transform(corpus_train)
# print(vectorizer.idf_)

# we want to find an article in X_train_tfidf contains a word "cocao':
print(X_train_tfidf.shape)
print('first 10 features:', vectorizer.get_feature_names()[:10])
print('last 10 features:', vectorizer.get_feature_names()[-10:])

pd.DataFrame(X_train_tfidf.toarray(), columns = vectorizer.get_feature_names())

(4743, 10000)
first 10 features: ['a300', 'a330', 'a340', 'aa', 'aaa', 'aapl', 'ab', 'abandon', 'abandoned', 'abandonment']
last 10 features: ['zinc', 'zntl', 'zoete', 'zone', 'zones', 'zorinsky', 'zortman', 'zuckerman', 'zurich', 'zy']


Unnamed: 0,a300,a330,a340,aa,aaa,aapl,ab,abandon,abandoned,abandonment,...,zinc,zntl,zoete,zone,zones,zorinsky,zortman,zuckerman,zurich,zy
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.050365,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4740,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4741,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# let's look at documents that contain a target wordthe word '
# target_word = 'bullish'
target_word = 'cocoa'

# find our which columns of this matrix it belongs
# then we gonna find one of these documents actually contains that word cocoa. 
doc_idx = X_train_tfidf[
    :, vectorizer.vocabulary_.get(target_word)].nonzero()[0].tolist()

print(len(doc_idx), ' documents found for target word: ', target_word)
# And then, we can see there were there were 10 documents that were found to contain cocoa. 
i = np.random.choice(len(doc_idx))

# It's one of them.
print('document ', doc_idx[i], ':')
print(corpus_train[doc_idx[i]])

10  documents found for target word:  cocoa
document  266 :
INDONESIAN TEA, COCOA EXPORTS SEEN UP, COFFEE DOWN

Indonesia's exports of tea and cocoa will continue to rise in calendar 1987 but coffee exports are forecast to dip slightly in 1987/88 (April-March) as the government tries to improve quality, the U.S. Embassy said. The embassy's annual report on Indonesian agriculture forecast coffee output in 1986/87 would be 5.77 mln bags of 60 kilograms each. That is slightly less than the 5.8 mln bags produced in 1985/86. In 1987/88 coffee production is forecast to rise again to 5.8 mln bags, but exports to dip to 4.8 mln from around 5.0 mln in 1986/87. Exports in 1985/86 were 4.67 mln bags. The embassy report says coffee stocks will rise to 1.3 mln tonnes in 1987/88 from 1.15 mln in 1986/87. It bases this on a fall in exports as a result of the "probable" re-introduction of quotas by the International Coffee Organisation. Cocoa production and exports are forecast to rise steadily as the

In [30]:
X_train_tfidf

<4743x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 217725 stored elements in Compressed Sparse Row format>

### Truncated SVD

In [31]:
print("\nPerforming dimensionality reduction using LSA")
t0 = time.time()

# Project the tfidf vectors onto the first N principal components.
# Though this is significantly fewer features than the original tfidf vector,
# they are stronger features, and the accuracy is higher.
svd = TruncatedSVD(
    n_components=200,
    random_state=42,
    algorithm='arpack' 
    # if we use arpack algo, we dont have to specify random_state,
    # because it is a determinisitc algo
    # if you use a random projection algo
)

# making a LSA pipline by transforming X_train_tfidf, and ending with this matrix X_train_lsa
lsa = make_pipeline(
    svd, 
#     Normalizer(copy=False) # try commenting this out. Do you get a better result?
)

# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

print("  done in %.3fsec" % (time.time() - t0))


Performing dimensionality reduction using LSA
  done in 4.357sec


In [32]:
X_train_lsa.shape 
#4743 articals, 200 PCs/factors we inputed. This is the matrix that contains our loadings

(4743, 200)

In [33]:
# what variance we are explaining when we are looking at 200 factors
explained_variance = svd.explained_variance_ratio_.sum()
print("  Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))


# Now apply the transformations to the test data as well.
# note that we are using the transform method only
X_test_tfidf = vectorizer.transform(corpus_test)
X_test_lsa = lsa.transform(X_test_tfidf)

  Explained variance of the SVD step: 40%


In [34]:
X_test_lsa.shape, X_test_tfidf.shape
# the right one is a much larger matrix. And because of that, 
# it means that we're gonna have a lot more noisy output
# because right one is a lot sparser matrix than keft one.

((4858, 200), (4858, 10000))

### use this approach and SVD to create a recommender

-----------------------------------------------------------
# My LSA-based Recommender:

In [35]:
import pickle
import os
import time

import numpy as np
import pandas as pd
import scipy.sparse.csr as csr
import scipy.sparse as sparse
from sklearn.base import clone
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline

### 1. LSA-based Recommender.
In this assignment, you can use and modify the LSA notebook that we covered in class.
Start by downloading the Reuters 10K article [corpus raw_text_dataset.pickle] from https://github.com/chrisjmccormick/LSA_Classification.

In [37]:
# Read Data
filepath = filepath = r"C:\Users\xxxli\Desktop\ResumeProjects\Data_Science_Applications\LSA_Recommender_DS7\raw_text_dataset.pickle"
raw_text_dataset = pickle.load(open(filepath, "rb"))
corpus_train, labels_train = raw_text_dataset[0], raw_text_dataset[1] 
corpus_test, labels_test = raw_text_dataset[2], raw_text_dataset[3]

corpus = corpus_train + corpus_test
corpus_train = corpus # use all the 10K articles corpus

### (a) Create a ``doc2vec(doc, tfidf_vectorizer)`` function corresponding to a TFIDF vectorizerer where:
INPUTS: ``doc``, ``tfidf_vectorizer``
* ``doc``: any string
* ``tfidf_vectorizer``: a TfidfVectorizer instance

OUTPUTS: ``vec``, ``doc_features``, ``doc_counts``
* ``vec``: a vector with L2 norm of 1
* ``doc_features``: the features after tokenization and pre-processing
* ``doc_counts``: the counts of each feature in this document

Train your ``tfidf_vectorizer`` on the Reuters 10K article corpus.

#### To get the counts of each feature/word, I also instantiate a ``CountVectorizer``:

In [38]:
####################### to count terms
count_vectorizer = CountVectorizer(
    max_df=0.5,
    max_features=10000,
    min_df=2, 
    stop_words='english',
    analyzer='word',
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b')

X_train_count = count_vectorizer.fit_transform(corpus_train)
features_counter = count_vectorizer.get_feature_names()
counts_counter = X_train_count.toarray()

######################### TF-IDF
my_tfidf_vectorizer = TfidfVectorizer(
    max_df=0.5, # ignore terms which occur in more than half of the documents
    max_features=10000,
    min_df=2, # ignore terms which occur in less than 2 documents
    stop_words='english',
    norm='l2',
    use_idf=True, 
    analyzer='word',
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b')

# train a vectorizer using our corpus training set
X_train_tfidf = my_tfidf_vectorizer.fit_transform(corpus_train)
features_tfidf = my_tfidf_vectorizer.get_feature_names()

# print results
print('There is '+str(len(corpus_train))+' documents in training set.')
print('-'*20)
print('From the term matrix shape: '+str(X_train_tfidf.shape)+ '\n there is '+str(X_train_tfidf.shape[0])+' documents \n with '+str(X_train_tfidf.shape[1])+' terms or features')
print('-'*20)
print('Some examples: \nfirst 10 features:', features_tfidf[:10])
print('last 10 features:', features_tfidf[-10:])
print('-'*20)
print('Explicitly in pandas:')
counts_df = pd.DataFrame(counts_counter, columns = features_counter)
#counts_df = pd.DataFrame(counts_tfidf, columns = features_tfidf)
counts_df.index.name = 'doc idx'
print(counts_df.shape)
counts_df.head(20)

There is 9601 documents in training set.
--------------------
From the term matrix shape: (9601, 10000)
 there is 9601 documents 
 with 10000 terms or features
--------------------
Some examples: 
first 10 features: ['a300', 'a320', 'a330', 'a340', 'aa', 'aaa', 'aapl', 'ab', 'abandon', 'abandoned']
last 10 features: ['zim', 'zimbabwe', 'zinc', 'ziyang', 'zoete', 'zone', 'zones', 'zorinsky', 'zuckerman', 'zurich']
--------------------
Explicitly in pandas:
(9601, 10000)


Unnamed: 0_level_0,a300,a320,a330,a340,aa,aaa,aapl,ab,abandon,abandoned,...,zim,zimbabwe,zinc,ziyang,zoete,zone,zones,zorinsky,zuckerman,zurich
doc idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
'''
X_words = my_tfidf_vectorizer.inverse_transform(X_train_tfidf) ## this will give you words instead of tfidf where tfidf > 0

tokenizer = my_tfidf_vectorizer.build_tokenizer() ## return tokenizer function used in tfidfvectorizer

for idx,words in enumerate(X_words):
    for word in words:
        count = tokenizer(corpus_train[idx]).count(word)
        print(idx,word,count)
'''
pass

In [40]:
def docs2vec(doc, tfidf_vectorizer):
    vec = tfidf_vectorizer.transform(doc)
    #doc_features = tfidf_vectorizer.get_feature_names()
    #doc_counts = vec.toarray()
    
    # counter
    doc_features = features_counter
    doc_counts = count_vectorizer.transform(doc).toarray()
    
    return vec, doc_features, doc_counts

### (b) For each of the following doc strings, calculate their corresponding vectors
* doc1: “Jabberwocky”
* doc2: “buy MSFT sell AAPL hold Brent”
* doc3: “bullish stocks”
* doc4: “Some random forests produce deterministic losses”

In [42]:
test_sample = ['Jabberwocky',
               'buy MSFT sell AAPL hold Brent',
               'bullish stocks',
               'Some random forests produce deterministic losses']

# print results for each test doc:
for doc in test_sample:
    print('-'*20+doc+'-'*20)
    v,f,c = docs2vec([doc], my_tfidf_vectorizer)
    print('L2 norm vector is')
    print(v)
    
    df2 = pd.DataFrame(c,columns=f,index=['Counts'])
    df3 = df2.sort_values(by='Counts',axis=1,ascending=False)
    print(df3[df3.columns[:7]])

--------------------Jabberwocky--------------------
L2 norm vector is

        a300  plough  pledge  pledged  pledges  plenmeer  plenty
Counts     0       0       0        0        0         0       0
--------------------buy MSFT sell AAPL hold Brent--------------------
L2 norm vector is
  (0, 8065)	0.29271261821048594
  (0, 4172)	0.358000284990265
  (0, 1210)	0.281467434058663
  (0, 1091)	0.5774997449206991
  (0, 6)	0.611085302775489
        sell  brent  aapl  hold  buy  pnc  plessis
Counts     1      1     1     1    1    0        0
--------------------bullish stocks--------------------
L2 norm vector is
  (0, 8609)	0.5821471771382473
  (0, 1176)	0.8130834300057836
        bullish  stocks  a300  plessis  pleased  pledge  pledged
Counts        1       1     0        0        0       0        0
--------------------Some random forests produce deterministic losses--------------------
L2 norm vector is
  (0, 7169)	0.5780527421462479
  (0, 6899)	0.3849996487707055
  (0, 5262)	0.33474244935

In [43]:
# store the result from part b
b_output_vec = docs2vec(test_sample, my_tfidf_vectorizer)[0]

### (c) Implement a function ``recommend(vec, X_model, X_corpus, lsa_instance)`` which projects any document vector onto a given ``X_model`` where 
* ``X_model = {X_train_tfidf, X_train_lsa}``
and returns ``doc_vec``, ``idx_top10``, ``sim_top10``, ``X_top10`` where
* ``doc_vec``: the (sparse) vector of similarity scores of ``vec`` and members of ``X_model``. This vector should be size D × 1: 4743 number of documents in training set
* ``idx_top10``: the indices of the top-10 similarity scores
* ``sim_top10``: the top-10 similarity scores
* ``X_top10``: the top-10 corpus articles most similar to the input model

#### Remark: what we have now are
* ``my_tfidf_vectorizer`` trained.
* ``b_output_vec``: ``docs2vec(test_sample, my_tfidf_vectorizer)[0]`` or ``my_tfidf_vectorizer.transform(test_sample)``


**our target: Find top 10 docs in corpus_train (9601 docs) that are most similar to our ``test_sample`` of 4 strings.**

##### TF-IDF:
* Model_X: ``TrainVectorizationArray_tfidf``
* Test_vec: ``TestVectorizationArray_tfidf``

In [44]:
TrainVectorizationArray_tfidf = X_train_tfidf.toarray()
TestVectorizationArray_tfidf = b_output_vec.toarray()

##### LSA:

In the cell below, we perform **dimensionality reduction using LSA**:
* Project the tfidf vectors ``X_train_tfidf`` onto the first $N=200$ principal components.
* Though this is significantly fewer features (200) than the original tfidf vector (10000), they are stronger features, and the accuracy is higher.


* Model_X: ``TrainVectorizationArray_lsa``
* Test_vec: ``TestVectorizationArray_lsa``

In [45]:
# instantiate svd instance
svd = TruncatedSVD(
    n_components=200,
    random_state=42,
    algorithm='arpack' 
)

# making a LSA pipline by transforming X_train_tfidf, and ending with this matrix X_train_lsa
lsa = make_pipeline(
    svd, 
#   Normalizer(copy=False) # try commenting this out. Do you get a better result?
)

# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)
TrainVectorizationArray_lsa = X_train_lsa
TestVectorizationArray_lsa = lsa.transform(b_output_vec) # lsa.trandorm(vec)

In [46]:
print(np.shape(TrainVectorizationArray_tfidf))
print(np.shape(TrainVectorizationArray_lsa))
print(np.shape(TestVectorizationArray_tfidf))
print(np.shape(TestVectorizationArray_lsa))
print(np.shape(b_output_vec))

(9601, 10000)
(9601, 200)
(4, 10000)
(4, 200)
(4, 10000)


##### Put together and construct the ``recommend()`` method:
* Assume we use **Cosine Similarity score on normalized vectors**
$$ score(A,B) = \frac{A\cdot B}{||A||\cdot ||B||} = A\cdot B$$
* To calculate/project any doc vector ``vec`` onto a given ``X_model``:

``score = vec.dot(X_model)``

In [47]:
X_model = [X_train_tfidf, X_train_lsa]

#### Recommend:
What we already have:
* trained tfidf: ``my_tfidf_vectorizer``
* trained lsa: ``lsa``, input is ``lsa_instant``
* ``vec`` :`` b_output_vec = my_tfidf_vectorizer.transform(test_sample)``
* fitted and transformed trainning sets: ``X_model = [X_train_tfidf, X_train_lsa]``

In [48]:
def recommend(vec, X_model, X_corpus, lsa_instance):
    '''
    Args
    ----------
    vec = b_output_vec = my_tfidf_vectorizer.transform(test_sample)
    X_model = [X_train_tfidf, X_train_lsa]
    X_corpus = corpus_train
    lsa_instance: lsa trained above
    
    '''
    TrainVectorizationArray_tfidf = X_model[0].toarray()
    TrainVectorizationArray_lsa = X_model[1]
    
    TestVectorizationArray_tfidf = vec.toarray()
    TestVectorizationArray_lsa = lsa_instance.transform(vec)
    
    # scores
    score_tfidf = TestVectorizationArray_tfidf.dot(TrainVectorizationArray_tfidf.T)
    score_tfidf_vec = score_tfidf.sum(0)
    score_lsa = TestVectorizationArray_lsa.dot(TrainVectorizationArray_lsa.T)
    score_lsa_vec = score_lsa.sum(0)
    
    df1 = pd.DataFrame(score_tfidf_vec, columns=['score tfidf'])
    df2 = pd.DataFrame(score_lsa_vec, columns=['score lsa'])
    
    ########## tfidf
    # top10 index
    res1 = df1.sort_values(by='score tfidf',ascending=False).head(10)
    idx1 = res1.index.values
    
    # top10 values
    s1 = res1['score tfidf'].values
    
    # top10 articles
    art1 = []
    for i in idx1:
        #print('\n -------This is how a article index ', i,' looks like:-------\n\n', corpus_train[i][:100])
        art1.append(X_corpus[i][:200])
    
    ########## LSA
    res2 = df2.sort_values(by='score lsa',ascending=False).head(10)
    idx2 = res2.index.values
    s2 = res2['score lsa'].values
    art2 = []
    for i in idx2:
        #print('\n -------This is how a article index ', i,' looks like:-------\n\n', corpus_train[i][:100])
        art2.append(X_corpus[i][:200])
    
    '''
    Returns
    ---------
    doc_vec: the (sparse) vector of similarity scores of vec and members of X_model. This vector should be size D × 1: 4743 number of documents in training set
    idx_top10: the indices of the top-10 similarity scores
    sim_top10: the top-10 similarity scores
    X_top10: the top-10 corpus articles most similar to the input model
    
    '''
    doc_vec = (df1, df2)
    idx_top10 = (idx1, idx2)
    sim_top10 = (s1, s2)
    X_top10 = (art1, art2)
    return doc_vec, idx_top10, sim_top10, X_top10

#### What does your ``recommend()`` function output for the ``doc`` vectors in (b)? 

In [49]:
print(test_sample)

['Jabberwocky', 'buy MSFT sell AAPL hold Brent', 'bullish stocks', 'Some random forests produce deterministic losses']


In [50]:
doc_vec, idx_top10, sim_top10, X_top10 =  recommend(b_output_vec, X_model, corpus_train, lsa)
print('---------- similarity scores tfidf ----------')
print(doc_vec[0])
print('---------- similarity scores lsa ----------')
print(doc_vec[1])

---------- similarity scores tfidf ----------
      score tfidf
0        0.000000
1        0.000000
2        0.000000
3        0.052666
4        0.000000
...           ...
9596     0.000000
9597     0.000000
9598     0.000000
9599     0.000000
9600     0.000000

[9601 rows x 1 columns]
---------- similarity scores lsa ----------
      score lsa
0      0.007706
1      0.017369
2     -0.005140
3      0.044632
4      0.009103
...         ...
9596  -0.003198
9597  -0.002799
9598   0.001968
9599  -0.005570
9600   0.012080

[9601 rows x 1 columns]


In [51]:
print('---------- top 10 scores indices tfidf ----------')
print(idx_top10[0])
print('---------- top 10 scores indices lsa ----------')
print(idx_top10[1])

---------- top 10 scores indices tfidf ----------
[1542 4417 4467 1687 3751 7680 6782 8509 9102 7244]
---------- top 10 scores indices lsa ----------
[1682 3751 1687 3747 9219 1190 8659 6013 1202 6020]


In [52]:
print('---------- top 10 similarity scores tfidf ----------')
print(sim_top10[0])
print('---------- top 10 similarity scores lsa ----------')
print(sim_top10[1])

---------- top 10 similarity scores tfidf ----------
[0.34321789 0.31331889 0.31331889 0.25290315 0.24933593 0.24118304
 0.21381491 0.2108215  0.20563376 0.20121979]
---------- top 10 similarity scores lsa ----------
[0.14191987 0.14012116 0.13971319 0.1388139  0.138448   0.13833028
 0.13302639 0.13302639 0.12306137 0.12255294]


#### Compare the articles recommended by each model:

In [53]:
print('---------- top 10 articals tfidf ----------')
for t in X_top10[0]:
    print('\n'+'*'*5)
    print(t)
print('\n\n')
print('---------- top 10 articals lsa ----------')
for t in X_top10[1]:
    print('\n'+'*'*5)
    print(t)    
pd.DataFrame(X_top10, index=['tfidf','lsa'])

---------- top 10 articals tfidf ----------

*****
ANALYST REITERATES BUY ON SOME DRUG STOCKS

Merrill Lynch and Co analyst Richard Vietor said he reiterated a buy recommendation on several drug stocks today. The stocks were Bristol-Myers Co BMY>, whi

*****
SUBROTO SEES OIL MARKET CONTINUING BULLISH

Indonesian Energy Minister Subroto said he sees the oil market continuing bullish, with underlying demand expected to rise later in the year. He told a pres

*****
SUBROTO SEES OIL MARKET CONTINUING BULLISH

Indonesian Energy Minister Subroto said he sees the oil market continuing bullish, with underlying demand expected to rise later in the year. He told a pres

*****
EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK

Distillate fuel stocks held in primary storage fell by 3.4 mln barrels in the week ended Feb 27 to 128.4 mln barrels, the Energy Information Administration

*****
EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK

Distillate fuel stocks held in primary storage fell by 8.8 mln barrels i

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
tfidf,ANALYST REITERATES BUY ON SOME DRUG STOCKS\n\n...,SUBROTO SEES OIL MARKET CONTINUING BULLISH\n\n...,SUBROTO SEES OIL MARKET CONTINUING BULLISH\n\n...,"EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK\n\...","EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK\n\...",WALL STREET SURVIVES TRIPLE EXPIRATIONS\n\nThe...,PARIS TO ADD THREE STOCKS TO CONTINUOUS QUOTAT...,DRAWDOWN SEEN IN U.S. DISTILLATE STOCKS\n\nTon...,TALKING POINT/TOBACCO STOCKS\n\nStocks of toba...,TROPICAL FOREST DEATH COULD SPARK NEW DEBT CRI...
lsa,"EIA SAYS DISTILLATE STOCKS OFF 3.4 MLN BBLS, G...","EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK\n\...","EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK\n\...","EIA SAYS DISTILLATE STOCKS OFF 8.8 MLN, GASOLI...","EIA SAYS DISTILLATE STOCKS OFF 3.2 MLN BBLS, G...","API SAYS DISTILLATE STOCKS OFF 4.4 MLN BBLS, G...","API SAYS DISTILLATE STOCKS OFF 4.07 MLN BBLS, ...","API SAYS DISTILLATE STOCKS OFF 7.35 MLN BBLS, ...","API SAYS DISTILLATE, GAS STOCKS OFF IN WEEK\n\...","API SAYS DISTILLATE, GAS STOCKS OFF IN WEEK\n\..."


##### Compare time cost: 
* Time the process from instantiation of models to the finalization of all the results.
* Copy paste code above since we want to do seperate work here for each of the models.

In [54]:
t0 = time.time()
######################### TF-IDF
my_tfidf_vectorizer = TfidfVectorizer(max_df=0.5,max_features=10000,min_df=2,stop_words='english',norm='l2',
    use_idf=True, analyzer='word',token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b')
X_train_tfidf = my_tfidf_vectorizer.fit_transform(corpus_train)
features_tfidf = my_tfidf_vectorizer.get_feature_names()
b_output_vec = docs2vec(test_sample, my_tfidf_vectorizer)[0]
TrainVectorizationArray_tfidf = X_train_tfidf.toarray()
TestVectorizationArray_tfidf = b_output_vec.toarray()
score_tfidf = TestVectorizationArray_tfidf.dot(TrainVectorizationArray_tfidf.T)
score_tfidf_vec = score_tfidf.sum(0)
df1 = pd.DataFrame(score_tfidf_vec, columns=['score tfidf'])
res1 = df1.sort_values(by='score tfidf',ascending=False).head(10)
idx1 = res1.index.values
s1 = res1['score tfidf'].values
art1 = []
for i in idx1:
    #print('\n -------This is how a article index ', i,' looks like:-------\n\n', corpus_train[i][:100])
    art1.append(corpus_train[i][:100])

print("TF-IDF is done in %.3fsec" % (time.time() - t0))

TF-IDF is done in 2.497sec


In [55]:
t0 = time.time()
######################### LSA/SVD
svd = TruncatedSVD(n_components=200,random_state=42,algorithm='arpack' )
lsa = make_pipeline(svd)
#   Normalizer(copy=False) # try commenting this out. Do you get a better result?

X_train_lsa = lsa.fit_transform(X_train_tfidf)
TrainVectorizationArray_lsa = X_train_lsa
TestVectorizationArray_lsa = lsa.transform(b_output_vec) # lsa.trandorm(vec)
score_lsa = TestVectorizationArray_lsa.dot(TrainVectorizationArray_lsa.T)
score_lsa_vec = score_lsa.sum(0)
df2 = pd.DataFrame(score_lsa_vec, columns=['score lsa'])
res2 = df2.sort_values(by='score lsa',ascending=False).head(10)
idx2 = res2.index.values
s2 = res2['score lsa'].values
art2 = []
for i in idx2:
    #print('\n -------This is how a article index ', i,' looks like:-------\n\n', corpus_train[i][:100])
    art2.append(corpus_train[i][:100])
print("Truncated SVD/LSA is done in %.3fsec" % (time.time() - t0))

Truncated SVD/LSA is done in 7.914sec


### Conclusion:
* Compared to TF-IDF, LSA improves the recommendation giving more relevent articles,
* but it has no improvement in terms of time.

### (d) Extra credit: Repeat the same exercise but instead of the Reuters 10K dataset, use the following corpus of 200K English plaintext jokes: https://github.com/taivop/jokedataset. Does your recommender system actually find similar jokes? Give examples of good and bad recommendations. Provide a list of suggestions of how one could improve upon this recommender.

#### Import data from GitHub:

In [57]:
import requests
import json
import urllib

url1="https://raw.githubusercontent.com/taivop/joke-dataset/master/reddit_jokes.json"
url2='https://raw.githubusercontent.com/taivop/joke-dataset/master/stupidstuff.json'
url3='https://raw.githubusercontent.com/taivop/joke-dataset/master/wocka.json'

urls=[url1,url2,url3]
jokes = []
for idx, url in enumerate(urls):
    r=requests.get(url)
    t=json.loads(r.content)
    for i in range(len(t)):
        if idx == 0:
            jokes.append(t[i]['title']+'\n\n'+t[i]['body'])
        elif idx ==1:
            jokes.append(t[i]['category']+'\n\n'+t[i]['body'])
        else:
            jokes.append(t[i]['title']+'\n\n'+t[i]['body'] +'\n\n'+t[i]['category'])
print('we have totally ',len(jokes), 'jokes')    

we have totally  208345 jokes


##### Let's split the dataset into
* ``joke_universe``: ``jokes[:-1]``
* ``joke_target``: ``jokes[-1:]``

In [58]:
joke_universe = jokes[:-1]
joke_target = jokes[-1:]
print('Here, we have a universe of ',len(joke_universe),' jokes in total.\n And we want to find 10 jokes from the universe that are the most similar to our ',len(joke_target),' target jokes.')
for j in joke_target:
    print('\n'+'*'*15)
    print(j)

Here, we have a universe of  208344  jokes in total.
 And we want to find 10 jokes from the universe that are the most similar to our  1  target jokes.

***************
... And We Wonder Why Everyone Hates Us

Customer: "Are you Hispanic?"

Me: "No."

Customer: "Middle Eastern?"

Me: "No."

Customer: "Egyptian?"

Me: "No."

Customer: "What are you?"

Me: "Chinese."

(customer puts on offended face)

Customer: "I don't appreciate you treating me like I'm dumb."

Me: "Excuse me? I'm being honest."

Customer: "NO CHINESE PERSON WOULD EVER HAVE EYES AS BIG AS YOURS!!!"

Me: *mouth wide open*

Insults


#### We want to find 10 jokes that are most similar with our target set.

#### TF-IDF:

In [60]:
joke_tfidf_vectorizer = TfidfVectorizer(
    max_df=0.5,
    max_features=3000,
    min_df=2,
    stop_words='english',
    norm='l2',
    use_idf=True, 
    analyzer='word',
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b'
    )

joke_X_train_tfidf = joke_tfidf_vectorizer.fit_transform(joke_universe)
joke_features_tfidf = joke_tfidf_vectorizer.get_feature_names()
joke_b_output_vec = joke_tfidf_vectorizer.transform(joke_target)
joke_TrainVectorizationArray_tfidf = joke_X_train_tfidf.toarray()
joke_TestVectorizationArray_tfidf = joke_b_output_vec.toarray()

joke_score_tfidf = joke_TestVectorizationArray_tfidf.dot(joke_TrainVectorizationArray_tfidf.T)
joke_score_tfidf_vec = joke_score_tfidf.sum(0)

joke_df1 = pd.DataFrame(joke_score_tfidf_vec, columns=['score tfidf'])

joke_res1 = joke_df1.sort_values(by='score tfidf',ascending=False).head(10)
joke_idx1 = joke_res1.index.values
joke_s1 = joke_res1['score tfidf'].values
joke_art1 = []
for i in joke_idx1:
    print('\n -------This is how a joke index ', i,' looks like:-------\n\n', joke_universe[i])
    joke_art1.append(joke_universe[i])


 -------This is how a joke index  208332  looks like:-------

 Paging Leonidas To The Front Desk

Customer: "Look! My friend told me I could get this type of hammer at your store! Now go get it for me!"

Cashier: "Sir, I already told you... we don't have ANY hammers back here that aren't already stocked on the shelves."

Customer: "LOOK HERE. F**K YOU! I KNOW YOU'RE TRYING TO SAVE MONEY BY SWITCHING OUT YOUR STOCKS! GET ME THIS HAMMER!"

(At this point, I come to the front of the store, overhearing what's going on; note that I'm the manager.)

Me: "Is there a problem?"

Customer: "Yes sir! Your employee here is not doing what I tell her to!"

Me: "Well, you need to calm down and understand that we don't have what you're looking for. So maybe you should go back to shelves and checkâ"

Customer: "F**K THAT!!! IT'S NOT THERE, OKAY?! YOU NEED TO F**KING GET ME WHAT I ASK FOR!"

Me: "That's it. Get out of my store."

Customer: "What? NO!"

Me: "Sir, get out, or I have to take you out."



#### Truncated SVD/LSA:

In [61]:
joke_svd = TruncatedSVD(n_components=200,random_state=42,algorithm='arpack')
joke_lsa = make_pipeline(joke_svd) 
#   Normalizer(copy=False) # try commenting this out. Do you get a better result?

joke_X_train_lsa = joke_lsa.fit_transform(joke_X_train_tfidf)
joke_TrainVectorizationArray_lsa = joke_X_train_lsa
joke_TestVectorizationArray_lsa = joke_lsa.transform(joke_b_output_vec)
joke_score_lsa = joke_TestVectorizationArray_lsa.dot(joke_TrainVectorizationArray_lsa.T)
joke_score_lsa_vec = joke_score_lsa.sum(0)

joke_df2 = pd.DataFrame(joke_score_lsa_vec, columns=['score lsa'])

joke_res2 = joke_df2.sort_values(by='score lsa',ascending=False).head(10)
joke_idx2 = joke_res2.index.values
joke_s2 = joke_res2['score lsa'].values
joke_art2 = []
for i in joke_idx2:
    print('\n -------This is how a joke index ', i,' looks like:-------\n\n', joke_universe[i])
    joke_art2.append(joke_universe[i])


 -------This is how a joke index  39623  looks like:-------

 what do you call a Chinese person with down syndrome?

Som ting wong

 -------This is how a joke index  22833  looks like:-------

 How do you blindfold a chinese person?

With dental floss

 -------This is how a joke index  22040  looks like:-------

 What do you call a Chinese Millionaire?

Cha Ching

 -------This is how a joke index  137715  looks like:-------

 What do you call a foreigner who is obsessed with Chinese culture?

A zhuologist

 -------This is how a joke index  119795  looks like:-------

 What do you call a Chinese Podiatrist?

Hee Lan To

 -------This is how a joke index  121680  looks like:-------

 What part of your punctuality emancipates the Chinese?

Your Ti"ming"!

 -------This is how a joke index  25922  looks like:-------

 What do you call a Chinese millionaire?

Cha Ching

 -------This is how a joke index  117910  looks like:-------

 What do Chinese lumberjacks do?

Chopsticks

 -------This is

### Conclusion:
* Accidently, my joke target is kind of about sensitive topic under ``insult`` category I guess. Sorry about that.
* But if we assume that I want to find out all of those most impolite (impolite to chinese people) jokes, and want to delete them immediately, which models give me better delete recommendations?
* According to what I printed above, obviously ``SVD/LSA`` model directs me to a better way.
----
* My **target joke** is:

    ... And We Wonder Why Everyone Hates Us
    
    Customer: "Are you Hispanic?"
    
    Me: "No."
    
    Customer: "Middle Eastern?"
    
    Me: "No."
    
    Customer: "Egyptian?"
    
    Me: "No."
    
    Customer: "What are you?"
    
    Me: "Chinese."
    
    (customer puts on offended face)
    
    Customer: "I don't appreciate you treating me like I'm dumb."
    
    Me: "Excuse me? I'm being honest."
    
    Customer: "NO CHINESE PERSON WOULD EVER HAVE EYES AS BIG AS YOURS!!!"
    
    Me: *mouth wide open*
    
    Insults


* A **bad recommendation** from ``TF-IDF``:

    -------This is how a joke index  173384  looks like:-------

     A customer asked me for a good reliable printer...


* A **good recommendation** from ``SVD/LSA``:

    -------This is how a joke index  117910  looks like:-------

     What do Chinese lumberjacks do?

     Chopsticks

### List of suggestions of how one could improve upon this recommender：
* We already see in my data importing cell, there are three different data source and hence three different json dictionary structures. When doing this recommender, I already considered put 'body','title','category' into joke_universe if there are. But notice that, we still have more options to consider, such as 'id','rating'.We can in one way think of adding these two options in our joke universe.
* Another way to improve may be to do our own classification first. Since we had some 'category' field from stupidstuff.json and wocka.json, but not from reddit_jokes.json. So, we can import all fields and do a classification to re-classify all of our jokes in to some self-defined categories (more detailed). Taking this new category into consideration when training the model may improve the performance of our recommender.
* One trivial but maybe important way is to try different parameters for our initial model instances/objects: ``TfidfVectorizer``, ``TruncatedSVD``.