# Natural Language Processing: Topic Modeling with Latent Semantic Analysis

Background: 

In NLP, one important application is about Topic Modeling, with which we try to capture the underlying themes that appear in a group of documents. Basically, we have two approaches:

- Latent Dirichlet Allocation (LDA) -  generate k topics by first assigning each word to a random topic, then iteratively updating assignments based on parameters $\alpha$, the mix of topics per document, and $\beta$, the distribution of words per topic.

- Latent Semantic Analysis (LSA) - identifies patterns using TF-IDF scores and reduces data to k dimensions through SVD. In other words, given a corpus of articles, we want to create a term-document-type of matrix, for which we can do SVD analysis.

Goal:

In this project, we want to empoly ***LSA***, trainig LSA-based recommender systems. And see whether they can promote what we are looking for from a collection of articles and from a collection of jokes:)



### Environment Setup

In [1]:
import pickle
import os
import time

import numpy as np
import pandas as pd
import scipy.sparse.csr as csr
import scipy.sparse as sparse
from sklearn.base import clone
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import KNeighborsClassifier

import matplotlib.pyplot as plt
%matplotlib inline

C:\Users\xxxli\Anaconda3\envs\mysixenv\lib\site-packages\numpy\.libs\libopenblas.IPBC74C7KURV7CB2PKT5Z5FNR3SIBV4J.gfortran-win_amd64.dll
C:\Users\xxxli\Anaconda3\envs\mysixenv\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll
  stacklevel=1)


Let's do some preworks to understand NLP first:
## Bag-of-Words matrix: by term frequency

Following the <a href="http://scikit-learn.org/stable/modules/feature_extraction.html">Sklearn Feature-extraction documentation page</a>

- I start with a given *corpus* of $D = 4$ documents.
- I preprocess each document and convert it into a list of terms (features)
    - by lowercasing first
    - accepting only word patterns (defined via [RegEx](https://www.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html))
- then I form the ***Count-Vectorizer term-frequency*** matrix defined as:

$$
\text{tf}(t, d)\equiv{CF}_{d,t} = \text{number of times that the term }t\text{ occurs in document }d
$$


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# preprocessing
# a default type of pattern for stop words
tpatterns = [
    '(?u)\\b\\w\\w+\\b', #default tpatterns[0]
    '(?u)\\b[a-zA-Z]\\w+\\b', #tpatterns[1]
    '\\w',#tpatterns[2]
    '\\w+',#tpatterns[3]
    '(?u)\\b[a-zA-Z]\\w+\\b|\\b[0-9]\\b'#tpatterns[4]
]

# instantiate a contvectorizer, which just does the TF transformation
vectorizer = CountVectorizer(token_pattern=tpatterns[0])
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [3]:
# start with four documents
corpus = [
    'This is the the first first document abra.',
    'This is the second second document cadabra.',
    'And the third one 3.',
    'Is this the first document 4?',
]
# list of string
# map corpus onto a matrix 4x11: 4 documents, 11 unique words/terms
X_corpus_docterm = vectorizer.fit_transform(corpus)
X_corpus_docterm

<4x11 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [4]:
X_corpus_docterm.toarray()

array([[1, 0, 0, 1, 2, 1, 0, 0, 2, 0, 1],
       [0, 0, 1, 1, 0, 1, 0, 2, 1, 0, 1],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [5]:
# note the effect of modifying the token_pattern above....
features = vectorizer.get_feature_names() 
features

['abra',
 'and',
 'cadabra',
 'document',
 'first',
 'is',
 'one',
 'second',
 'the',
 'third',
 'this']

In [6]:
# visualize the TF matrix I just made
pd.DataFrame(X_corpus_docterm.toarray(),columns = np.array(features))

Unnamed: 0,abra,and,cadabra,document,first,is,one,second,the,third,this
0,1,0,0,1,2,1,0,0,2,0,1
1,0,0,1,1,0,1,0,2,1,0,1
2,0,1,0,0,0,0,1,0,1,1,0
3,0,0,0,1,1,1,0,0,1,0,1


In [7]:
# For each defined regex pattern, take a look the outputs
for tp in tpatterns:
    vectorizer.token_pattern = tp
    print(vectorizer.token_pattern)
    X_corpus_docterm = vectorizer.fit_transform(corpus)
    features = vectorizer.get_feature_names()
    print(features)
    print('\n')

(?u)\b\w\w+\b
['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


(?u)\b[a-zA-Z]\w+\b
['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


\w
['3', '4', 'a', 'b', 'c', 'd', 'e', 'f', 'h', 'i', 'm', 'n', 'o', 'r', 's', 't', 'u']


\w+
['3', '4', 'abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


(?u)\b[a-zA-Z]\w+\b|\b[0-9]\b
['3', '4', 'abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']




In [8]:
# So, this is what we want to do here:
vectorizer.token_pattern = tpatterns[0] #back to the default
X_corpus_docterm = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
CV = X_corpus_docterm.toarray()
print('Our TF matrix: \n', CV)
print('Each column corresponds:\n',features)

Our TF matrix: 
 [[1 0 0 1 2 1 0 0 2 0 1]
 [0 0 1 1 0 1 0 2 1 0 1]
 [0 1 0 0 0 0 1 0 1 1 0]
 [0 0 0 1 1 1 0 0 1 0 1]]
Each column corresponds:
 ['abra', 'and', 'cadabra', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


For example, the first document "This is the the first first document abra." contains the word ***first*** twice, so we have entry (0,4) is 2.

### What if we want to vectorize a document in the test set, in which there could even be a document with words that not even encountered before in the training set?

- Call the `.transform()` method of the trained vectorizer on testing set
- We see that the new word will not appear in our matrix

In [9]:
def docs2vec(docs, vectorizer):
    """ 
    docs: testing set of documents
    vectorizer: trained CountVectorizer
    """
    return vectorizer.transform(docs)

# testing set
docs_test = [
    'rain',
    'The a yoghurt',
    'three two one hurray',
    'and abra document is first cadabra',
    'one-third abra is the first cadabra and this document a second',
]

# for each testing doc, take a look at the result
for doc in docs_test:
    print("doc:",doc)
    print(docs2vec([doc], vectorizer).toarray()[0])
    
# fit on testing set
X_docterm_test = docs2vec(docs_test, vectorizer)
print('\n Resulting test TF matrix:\n', X_docterm_test.toarray())

doc: rain
[0 0 0 0 0 0 0 0 0 0 0]
doc: The a yoghurt
[0 0 0 0 0 0 0 0 1 0 0]
doc: three two one hurray
[0 0 0 0 0 0 1 0 0 0 0]
doc: and abra document is first cadabra
[1 1 1 1 1 1 0 0 0 0 0]
doc: one-third abra is the first cadabra and this document a second
[1 1 1 1 1 1 1 1 1 1 1]

 Resulting test TF matrix:
 [[0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1 0 0 0 0]
 [1 1 1 1 1 1 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1]]


## TF-IDF: term frequency inverse document frequency
### Problems with Bag-of-Words TF matrix:

- The document vectors are **not normalized**
    - so can't really compare documents
- The document vectors contain many common english words containing **no information**
    - ideally we want to remove those, e.g. 'the', 'is', etc


### Solution: TF-IDF vectorizer ([The TF-idf section in the Scikit-Learn feature extraction manual](http://scikit-learn.org/stable/modules/feature_extraction.html))

TF-IDF matrix is defined as (when `smooth_idf=True`)

$$
\begin{align}
\text{tf-idf}(t,d) &\equiv{\text{tf}}(t,d)\times\text{idf}(t)\\
\text{idf}(t)&\equiv\log\frac{1+n_d}{1+\text{df}(d,t)} + 1
\end{align}
$$

where 

- $\text{df}(d,t)$ is the number of documents containing feature $t$
- $n_d$ is the number of documents in our corpus
- the rows of the tf-idf matrix are normalized to have unit norm (either $L_1$ or $L_2$)
    - this way we can compare documents by the norm of their doc2vec overlaps

Let's do this in practice:

Firstly, create a corpus with various levels of repetition of its terms:
- $N$ is the number of documents in our corpus. We can play with it to see why TF-IDF is a useful tool!!

In [10]:
N = 10
corpus = np.reshape(['blah', 'abra', 'cadabra'] * N, (N,3))
corpus[2:,2] = ''
corpus[int(N/2):,1] = ''
corpus = [' '.join(corpus[i]) for i in range(N)]
# print('\n'.join(corpus[:10]))
corpus[:10]

['blah abra cadabra',
 'blah abra cadabra',
 'blah abra ',
 'blah abra ',
 'blah abra ',
 'blah  ',
 'blah  ',
 'blah  ',
 'blah  ',
 'blah  ']

Initiate TF-IDF Vectorizer:

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
    stop_words='english', # add stop words
    norm='l2', # each output vector has l2 norm equal to 1
    use_idf=True)

Train:

In [12]:
X_corpus_tfidf=vectorizer.fit_transform(corpus)

Let's see what is the differnce?

In [13]:
pd.DataFrame(X_corpus_tfidf.toarray(), columns = vectorizer.get_feature_names()).head(10)

Unnamed: 0,abra,blah,cadabra
0,0.539398,0.335836,0.772181
1,0.539398,0.335836,0.772181
2,0.848908,0.528541,0.0
3,0.848908,0.528541,0.0
4,0.848908,0.528541,0.0
5,0.0,1.0,0.0
6,0.0,1.0,0.0
7,0.0,1.0,0.0
8,0.0,1.0,0.0
9,0.0,1.0,0.0


#### We found
- When we vary number of documents $N$:
    - When $N$ is larger, the weight of 'cadabra' becomes larger.
    - WHen $N$ is small, the weight of 'cadabra' becomes smaller.

- Similarly, the weights of 'blah' and 'abra' also change due to the change of $N$. 
    - This is what TF-IDF does! Measuring the importance of words, not only in terms of the word occurences in one document, but also in terms of how frequent the document that contains the words appears in our corpus.

- Whereever, we always get the least weights on 'blah', this is the least relevent word to look at since every document has it, but 'cadabra' really mean something when it occurs since only 2 documents contain it.

---------- 
It's time to construct our Recommendation System! Boom-ya!

## Latent Semantic Analysis (Truncated SVD on the TF-IDF matrix)

References:

- [Scikit-Learn's Reuters Dataset TF-IDF + K-NN classification example](http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html)
- [Chris McCormic's LSA tutorial](http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/)
- [Chris McCormic's GitHub Page]("https://github.com/chrisjmccormick/LSA_Classification")

Dataset: 

- the Reuters Articles Corpus
- The original Reuter's 21578 dataset is part of the [UCI-ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases) and can be found [here]("http://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz). 
- However, in this notebook, I will use the already pre-processed version in [Chris McCormic's github page](https://github.com/chrisjmccormick/LSA_Classification/tree/master/data). Note that for each document, it has several labels. And with this dataset, we can easily get features and labels that are splitted into training and testing sets, respectively as follows.

### Import data & Explore:

In [14]:
filepath = r"C:\Users\xxxli\Desktop\ResumeProjects\Data_Science_Applications\LSA_Recommender_DS7\raw_text_dataset.pickle"
raw_text_dataset = pickle.load(open(filepath, "rb"))
corpus_train, labels_train = raw_text_dataset[0], raw_text_dataset[1] 
corpus_test, labels_test = raw_text_dataset[2], raw_text_dataset[3]
print('Data Importing ...')
print(type(corpus_train), type(labels_train)) # list of string
print('Number of train docs:', len(corpus_train), '\nNumber of test docs:', len(corpus_test))
print('Number of train labels:', len(labels_train), '\nNumber of test labels:', len(labels_test))

Data Importing ...
<class 'list'> <class 'list'>
Number of train docs: 4743 
Number of test docs: 4858
Number of train labels: 4743 
Number of test labels: 4858


Randomly pick a document and take a look at its content and labels/tags:

In [15]:
n = np.random.choice(len(corpus_train))

print('\nThis is how a article ', n,' looks like:\n\n', 
      corpus_train[n][:500])

print('\nAnd these are its topic labels or tags:\n\n', 
      labels_train[n][:500])


This is how a article  2759  looks like:

 EC FARM LIBERALISATION SEEN HURTING THAI TAPIOCA

Any European Community decision to liberalise farm trade policy would hurt Thailand's tapioca industry, said Ammar Siamwalla, an agro-economist at the Thailand Development Research Institute (TDRI). He told a weekend trade seminar here that any EC move to cut tariff protection for EC grains would make many crops more competitive than tapioca in the European market. The EC is the largest buyer of Thai tapioca, absorbing more than two thirds of the

And these are its topic labels or tags:

 ['tapioca', 'meal-feed', 'thailand', 'ec']


### Reconstruct train-test sets:
We want to use all 10k articles as training set. And
- make recommendations based on key words
- make recommendations based on a self-input list of documents

### Setup train set:

In [16]:
corpus = corpus_train + corpus_test
corpus_train = corpus
print('number of documents:', len(corpus_train))

number of documents: 9601


### Setup target word:

In [17]:
target_word = "cocoa"

### Setup our target list of documents:
* doc1: “Jabberwocky”
* doc2: “buy MSFT sell AAPL hold Brent”
* doc3: “bullish stocks”
* doc4: “Some random forests produce deterministic losses”

For each of the following doc strings, calculate their corresponding vectors:

In [18]:
test_sample = ['Jabberwocky',
               'buy MSFT sell AAPL hold Brent',
               'bullish stocks',
               'Some random forests produce deterministic losses']

### TF-IDF vectorizer step:
The TfidfVectorizer below does the following:

- **`stop_words='english'`**: Strips out english “stop words”, e.g. frequently occuring english words.
- **`max_df=0.5`**: Filters out terms that occur in more than half of the docs.
- **`min_df=2`**: Filters out terms that occur in only one document.
- **`max_features=10000`**: Selects 10,000 most frequently occuring words in the corpus.
- **`norm='l2'`**: Normalizes the vector to account for the effect of document length on the tf-idf values. In other words, each output row will have unit norm, either: 
    * 'l2': Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied. 
    * 'l1': Sum of absolute values of vector elements is 1.
- **`use_idf=True`**: Enable inverse-document-frequency reweighting. 
- **`analyzer='word'`**: Whether the feature should be made of word or character n-grams. 
- **`token_pattern='(?u)\\b[a-zA-Z]\\w+\\b'`**: Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp default=r”(?u)\b\w\w+\b” selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

Note: Play around to see what kind of ***`token_pattern`*** actually works the best for for this corpus; How it changes changes the output below!

In [19]:
vectorizer = TfidfVectorizer(
    max_df=0.5,
    max_features=10000,
    min_df=2,
    stop_words='english',
    norm='l2',
    use_idf=True, 
    analyzer='word',
    # token_pattern='(?u)\\b\\w\\w+\\b'
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b'
    )

Train:

In [20]:
X_train_tfidf = vectorizer.fit_transform(corpus_train)
print('IDF vector:', vectorizer.idf_) # The inverse document frequency (IDF) vector; only defined if use_idf is True.
# X_train_tfidf

IDF vector: [8.56028878 8.78343233 8.09028515 ... 8.09028515 8.78343233 7.46167649]


In [21]:
print('first 10 features:', vectorizer.get_feature_names()[:10])
print('last 10 features:', vectorizer.get_feature_names()[-10:])
print('trained TFIDF matrix shape:', X_train_tfidf.shape)
pd.DataFrame(X_train_tfidf.toarray(), columns = vectorizer.get_feature_names()).head()

first 10 features: ['a300', 'a320', 'a330', 'a340', 'aa', 'aaa', 'aapl', 'ab', 'abandon', 'abandoned']
last 10 features: ['zim', 'zimbabwe', 'zinc', 'ziyang', 'zoete', 'zone', 'zones', 'zorinsky', 'zuckerman', 'zurich']
trained TFIDF matrix shape: (9601, 10000)


Unnamed: 0,a300,a320,a330,a340,aa,aaa,aapl,ab,abandon,abandoned,...,zim,zimbabwe,zinc,ziyang,zoete,zone,zones,zorinsky,zuckerman,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.046785,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's find an article in `X_train_tfidf` that contains a word "cocao':

In [22]:
print('target word:', target_word)
# find our which columns of this matrix it belongs
doc_idx = X_train_tfidf[:, vectorizer.vocabulary_.get(target_word)].nonzero()[0].tolist()

# counts the number of documents that contain that target word
print(len(doc_idx), 'documents found.')

# then we gonna ransomly pick one of these documents actually contains that word. 
i = np.random.choice(len(doc_idx))
print('Example:\ndocument index:', doc_idx[i],)
print('-----')
print(corpus_train[doc_idx[i]][:500])

target word: cocoa
34 documents found.
Example:
document index: 6767
-----
COCOA DELEGATES OPTIMISTIC ON BUFFER STOCK RULES

Hopes mounted for an agreement on cocoa buffer stock rules at an International Cocoa Organization, ICCO, council meeting which opened here today, delegates said. Both producer and consumer ICCO members said after the opening session that prospects for an agreement on the cocoa market support mechanism were improving. "The chances are very good as of now of getting buffer stock rules by the end of next week," Ghanaian delegate and producer spokesm


#### What we want to notice is that ***tf-idf matrix is really sparse!***

In [23]:
X_train_tfidf

<9601x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 446797 stored elements in Compressed Sparse Row format>

#### Therefore, when we are doing topic modeling, i.e. extract the most important words from our corpus, we want to perform ***dimension reduction*** upon the matrix, and this step will involve ***SVD*** as follows.

### Truncated SVD:
- Project the tfidf vectors onto the first N principal components. 
- Though this is significantly fewer features than the original tfidf vector, they are stronger features, and the accuracy is higher.


Initiate TruncatedSVD, make a pipeline and train:

- Dimensionality reduction using truncated SVD.
- This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). - Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.
- **`n_componentsint`**: Desired dimensionality of output data.
- **`algorithm`**: SVD solver to use. {‘arpack’, ‘randomized’}
    - This estimator supports two algorithms: a fast randomized SVD solver due to Halko (2009), 
    - Or, “naive” algorithm that uses ARPACK wrapper in SciPy (scipy.sparse.linalg.svds) as an eigensolver on XX' or X'X, whichever is more efficient.
- **`n_iter`**: int,default=5. Number of iterations
    - For randomized SVD solver. Not used by ARPACK. 
    - The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
- **`random_state`**: default=None. Used during randomized svd. 
- **`tol`**: float, default=0, means machine precision. 
    - Tolerance for ARPACK. Ignored by randomized SVD solver.


In [24]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
print("Train TFIDF before SVD:", X_train_tfidf.shape)
print("\nPerforming dimensionality reduction on TFIDF matrix using SVD...")
t0 = time.time()

# use ARPACK
svd = TruncatedSVD(n_components=200, algorithm='arpack', tol=0)

# making a LSA pipline by transforming X_train_tfidf, and ending with this matrix X_train_lsa
lsa = make_pipeline(svd, 
#     Normalizer(copy=False) # try commenting this out. Do you get a better result?
)

# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf)

print("  done in %.3fsec\n" % (time.time() - t0))
print("Train TFIDF after SVD:",X_train_lsa.shape)

Train TFIDF before SVD: (9601, 10000)

Performing dimensionality reduction on TFIDF matrix using SVD...
  done in 8.142sec

Train TFIDF after SVD: (9601, 200)


#### What variance we are explaining when we are looking at 200 factors?

In [25]:
explained_variance = svd.explained_variance_ratio_.sum()
print("  Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))

  Explained variance of the SVD step: 37%


Project on testing set:

In [26]:
X_test_tfidf = vectorizer.transform(test_sample)
pd.DataFrame(X_test_tfidf.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,a300,a320,a330,a340,aa,aaa,aapl,ab,abandon,abandoned,...,zim,zimbabwe,zinc,ziyang,zoete,zone,zones,zorinsky,zuckerman,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.611085,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
X_test_lsa = lsa.transform(X_test_tfidf)
print("Test TFIDF before SVD:",X_test_tfidf.shape)
print("Test TFIDF after SVD:",X_test_lsa.shape)

Test TFIDF before SVD: (4, 10000)
Test TFIDF after SVD: (4, 200)


### **our target: Find top 10 docs in corpus_train that are most similar to our ``test_sample`` of 4 strings.**

### Recommender Implement:

Function: **``recommend(vec, X_model, X_corpus, lsa_instance)``**
- projects any document vector `vec` onto a given ``X_model`` and returns ``doc_vec``, ``idx_top10``, ``sim_top10``, ``X_top10`` where
    * `X_model`: {X_train_tfidf, X_train_lsa}
    * `doc_vec`: the (sparse) vector of similarity scores of ``vec`` and members of ``X_model``. This vector should be size D × 1: 14459 number of documents in training set
    * ``idx_top10``: the indices of the top-10 similarity scores
    * ``sim_top10``: the top-10 similarity scores
    * ``X_top10``: the top-10 corpus articles most similar to the input model

### TF-IDF:
* Model_X: ``X_train_tfidf``
* Test_vec: ``X_test_tfidf``
### LSA:

* Model_X: ``X_train_lsa``
* Test_vec: ``X_test_lsa``

In [28]:
print(np.shape(X_train_tfidf))
print(np.shape(X_train_lsa))
print(np.shape(X_test_tfidf))
print(np.shape(X_test_lsa))

(9601, 10000)
(9601, 200)
(4, 10000)
(4, 200)


### Similarity Scores:
* Assume we use **Cosine Similarity score on normalized vectors (unit)**
$$ score(A,B) = \frac{A\cdot B}{||A||\cdot ||B||} = A\cdot B$$
* To calculate/project any doc vector ``vec`` onto a given ``X_model``:
    ```
    score = vec.dot(X_model)

    ```

In [29]:
def recommend(vec, X_model, X_corpus, lsa_instance):
    '''
    Args
    ----------
    vec: X_test_tfidf = trained_tfidf_vectorizer.transform(test_sample)
    X_model: {X_train_tfidf, X_train_lsa}
    X_corpus: corpus_train
    lsa_instance: lsa trained above
    
    Returns
    ---------
    doc_vec: the (sparse) vector of similarity scores of vec and members of X_model. Dx1
    idx_top10: the indices of the top-10 similarity scores
    sim_top10: the top-10 similarity scores
    X_top10: the top-10 corpus articles most similar to the input model

    '''
    X_train_tfidf_array = X_model[0].toarray()
    X_train_lsa_array = X_model[1]
    
    X_test_tfidf_array = vec.toarray()
    X_test_lsa_array = lsa_instance.transform(vec)

    ########## tfidf
    # score
    score_tfidf = X_test_tfidf_array.dot(X_train_tfidf_array.T)
    score_tfidf_vec = score_tfidf.sum(0)
    df1 = pd.DataFrame(score_tfidf_vec, columns=['score tfidf'])
    # top10 index
    res1 = df1.sort_values(by='score tfidf',ascending=False).head(10)
    idx1 = res1.index.values
    # top10 values
    s1 = res1['score tfidf'].values
    # top10 articles
    art1 = []
    for i in idx1:
        art1.append(X_corpus[i][:500])
    
    ########## LSA
    # score
    score_lsa = X_test_lsa_array.dot(X_train_lsa_array.T)
    score_lsa_vec = score_lsa.sum(0)
    df2 = pd.DataFrame(score_lsa_vec, columns=['score lsa'])

    # top10 index
    res2 = df2.sort_values(by='score lsa',ascending=False).head(10)
    idx2 = res2.index.values

    # top10 values
    s2 = res2['score lsa'].values

    # top10 articles
    art2 = []
    for i in idx2:
        art2.append(X_corpus[i][:500])

    doc_vec = (df1, df2)
    idx_top10 = (idx1, idx2)
    sim_top10 = (s1, s2)
    X_top10 = (art1, art2)
    return doc_vec, idx_top10, sim_top10, X_top10

### Run ``recommend()`` function for the target ``test_sample`` vectors? 

In [30]:
test_sample

['Jabberwocky',
 'buy MSFT sell AAPL hold Brent',
 'bullish stocks',
 'Some random forests produce deterministic losses']

In [31]:
vec = X_test_tfidf
X_model = [X_train_tfidf, X_train_lsa]
corpus_train = corpus_train
lsa = lsa
doc_vec, idx_top10, sim_top10, X_top10 =  recommend(vec, X_model, corpus_train, lsa)

In [32]:
print('LSA: similarity scores between test_sample and each article in corpus')
pd.DataFrame(doc_vec[1]).head(10)

LSA: similarity scores between test_sample and each article in corpus


Unnamed: 0,score lsa
0,0.007706
1,0.017369
2,-0.00514
3,0.044632
4,0.009103
5,0.003794
6,-0.00419
7,-0.004525
8,0.002533
9,0.007176


In [33]:
print('LSA: indices of the top 10 articles with highest scores')
pd.DataFrame({'article_idx':idx_top10[1], 'similarity_score': sim_top10[1]}).set_index('article_idx')

LSA: indices of the top 10 articles with highest scores


Unnamed: 0_level_0,similarity_score
article_idx,Unnamed: 1_level_1
1682,0.14192
3751,0.140121
1687,0.139713
3747,0.138814
9219,0.138448
1190,0.13833
8659,0.133026
6013,0.133026
1202,0.123061
6020,0.122553


### Make 10 recommendations using LSA:


In [34]:
print('LSA: Top 10 articals')
for t in X_top10[1]:
    print('\n'+'*'*5)
    print(t)

LSA: Top 10 articals

*****
EIA SAYS DISTILLATE STOCKS OFF 3.4 MLN BBLS, GASOLINE OFF 100,000, CRUDE UP 3.2 MLN




*****
EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK

Distillate fuel stocks held in primary storage fell by 8.8 mln barrels in the week ended March six to 119.6 mln barrels, the Energy Information Administration (EIA) said. In its weekly petroleum status report, the Department of Energy agency said gasoline stocks were off 500,000 barrels in the week to 251.0 mln barrels and refinery crude oil stocks fell 1.2 mln barrels to 331.8 mln. The EIA said residual fuel stocks fell 1.5 mln barrels to 36.4 mln ba

*****
EIA SAYS DISTILLATE, GAS STOCKS OFF IN WEEK

Distillate fuel stocks held in primary storage fell by 3.4 mln barrels in the week ended Feb 27 to 128.4 mln barrels, the Energy Information Administration (EIA) said. In its weekly petroleum status report, the Department of Energy agency said gasoline stocks were off 100,000 barrels in the week to 251.5 mln barrels and ref

###Furthermore, let's compare these LSA results with the articles recommended by TFIDF only:

In [35]:
print('TFIDF: Top 10 articals')
for t in X_top10[0]:
    print('\n'+'*'*5)
    print(t)

TFIDF: Top 10 articals

*****
ANALYST REITERATES BUY ON SOME DRUG STOCKS

Merrill Lynch and Co analyst Richard Vietor said he reiterated a buy recommendation on several drug stocks today. The stocks were Bristol-Myers Co BMY>, which rose 2-1/4 to 101, Schering-Plough Corp SGP> 2-7/8 to 97 and Syntex Corp SYN> 1-3/8 to 82. Vietor described these stocks as a "middle group" of performers. Vietor said the prices of these stocks, "look pretty cheap relative to the leading performers in the drug group, such as Upjohn Co UPJ>, Merc

*****
SUBROTO SEES OIL MARKET CONTINUING BULLISH

Indonesian Energy Minister Subroto said he sees the oil market continuing bullish, with underlying demand expected to rise later in the year. He told a press conference in Jakarta at the end of a two-day meeting of South-East Asian Energy Ministers that he saw prices stabilizing around 18 dlrs a barrel. "The sentiment in the market is bullish and I think it will continue that way as demand will go up in the third o

### Compared to TF-IDF, LSA improves the recommendation giving more relevent articles!!!!

-------
## Recommender For Jokes

Data/Corpus:
- 200K English plaintext jokes: https://github.com/taivop/jokedataset. 

Goal:
- Construct a recommender system that can find similar jokes. 
- Give examples of good and bad recommendations. 
- Provide a list of suggestions of how one could improve upon this recommender.

### Import data:

In [36]:
import requests
import json
import urllib

url1="https://raw.githubusercontent.com/taivop/joke-dataset/master/reddit_jokes.json"
url2='https://raw.githubusercontent.com/taivop/joke-dataset/master/stupidstuff.json'
url3='https://raw.githubusercontent.com/taivop/joke-dataset/master/wocka.json'

urls=[url1,url2,url3]
jokes = []
for idx, url in enumerate(urls):
    r=requests.get(url)
    t=json.loads(r.content)
    for i in range(len(t)):
        if idx == 0:
            jokes.append(t[i]['title']+'\n\n'+t[i]['body'])
        elif idx ==1:
            jokes.append(t[i]['category']+'\n\n'+t[i]['body'])
        else:
            jokes.append(t[i]['title']+'\n\n'+t[i]['body'] +'\n\n'+t[i]['category'])
print('Total number of jokes:',len(jokes))

Total number of jokes: 208345


Let's read some:

In [37]:
for i,j in enumerate(jokes[:3]):
    print('\n'+'*'*15+'\n',j,'\n')


***************
 I hate how you cant even say black paint anymore

Now I have to say "Leroy can you please paint the fence?" 


***************
 What's the difference between a Jew in Nazi Germany and pizza ?

Pizza doesn't scream when you put it in the oven .

I'm so sorry. 


***************
 I recently went to America....

...and being there really helped me learn about American culture. So I visited a shop and as I was leaving, the Shopkeeper said "Have a nice day!" But I didn't so I sued him. 



### Setup:
* ``N``: number of recommendations we want to make
* ``joke_universe``: our database for searching
* ``joke_target``: we want some joke recommendations that are similar to this one

In [38]:
N = 10
joke_universe = jokes[:-1]
joke_target = jokes[-1:]
print(f'We have {len(joke_universe)} jokes to recommend from.\nWe want to find {N} jokes that are the most similar to our \ntarget joke below:\n')
print('*'*15)
print(joke_target[0])

We have 208344 jokes to recommend from.
We want to find 10 jokes that are the most similar to our 
target joke below:

***************
... And We Wonder Why Everyone Hates Us

Customer: "Are you Hispanic?"

Me: "No."

Customer: "Middle Eastern?"

Me: "No."

Customer: "Egyptian?"

Me: "No."

Customer: "What are you?"

Me: "Chinese."

(customer puts on offended face)

Customer: "I don't appreciate you treating me like I'm dumb."

Me: "Excuse me? I'm being honest."

Customer: "NO CHINESE PERSON WOULD EVER HAVE EYES AS BIG AS YOURS!!!"

Me: *mouth wide open*

Insults


### Recommender System:
- Perform LSA (TruncatedSVD on TF-IDF) on joke universe
- Project on testingset: target joke 

In [39]:
joke_tfidf_vectorizer = TfidfVectorizer(
    max_df=0.5,
    max_features=3000,
    min_df=2,
    stop_words='english',
    norm='l2',
    use_idf=True, 
    analyzer='word',
    token_pattern = '(?u)\\b[a-zA-Z]\\w+\\b'
    )

joke_X_train_tfidf = joke_tfidf_vectorizer.fit_transform(joke_universe)
joke_features_tfidf = joke_tfidf_vectorizer.get_feature_names()
joke_X_test_tfidf = joke_tfidf_vectorizer.transform(joke_target)
joke_X_train_tfidf_array = joke_X_train_tfidf.toarray()
joke_X_test_tfidf_array = joke_X_test_tfidf.toarray()

joke_score_tfidf = joke_X_test_tfidf_array.dot(joke_X_train_tfidf_array.T)
joke_score_tfidf_vec = joke_score_tfidf.sum(0)

joke_df1 = pd.DataFrame(joke_score_tfidf_vec, columns=['score tfidf'])
joke_res1 = joke_df1.sort_values(by='score tfidf',ascending=False).head(10) # top 10
joke_idx1 = joke_res1.index.values # top 10 idx
joke_s1 = joke_res1['score tfidf'].values # top 10 scores
joke_art1 = [] # top 10 jokes
for i in joke_idx1:
    joke_art1.append(joke_universe[i])

Visualize TD-IDF recommendations:

In [40]:
for j in joke_art1:
    print('\n'+'*'*15+'\n',j,'\n')


***************
 Paging Leonidas To The Front Desk

Customer: "Look! My friend told me I could get this type of hammer at your store! Now go get it for me!"

Cashier: "Sir, I already told you... we don't have ANY hammers back here that aren't already stocked on the shelves."

Customer: "LOOK HERE. F**K YOU! I KNOW YOU'RE TRYING TO SAVE MONEY BY SWITCHING OUT YOUR STOCKS! GET ME THIS HAMMER!"

(At this point, I come to the front of the store, overhearing what's going on; note that I'm the manager.)

Me: "Is there a problem?"

Customer: "Yes sir! Your employee here is not doing what I tell her to!"

Me: "Well, you need to calm down and understand that we don't have what you're looking for. So maybe you should go back to shelves and checkâ"

Customer: "F**K THAT!!! IT'S NOT THERE, OKAY?! YOU NEED TO F**KING GET ME WHAT I ASK FOR!"

Me: "That's it. Get out of my store."

Customer: "What? NO!"

Me: "Sir, get out, or I have to take you out."

Customer: "Then do it!"

(I go around the coun

#### Truncated SVD/LSA:

In [41]:
joke_svd = TruncatedSVD(n_components=200,random_state=42,algorithm='arpack')
joke_lsa = make_pipeline(joke_svd) 
#   Normalizer(copy=False) # try commenting this out. Do you get a better result?

joke_X_train_lsa = joke_lsa.fit_transform(joke_X_train_tfidf)
joke_X_train_lsa_array = joke_X_train_lsa

joke_X_test_lsa = joke_lsa.transform(joke_X_test_tfidf)
joke_score_lsa = joke_X_test_lsa.dot(joke_X_train_lsa_array.T)
joke_score_lsa_vec = joke_score_lsa.sum(0)

joke_df2 = pd.DataFrame(joke_score_lsa_vec, columns=['score lsa'])
joke_res2 = joke_df2.sort_values(by='score lsa',ascending=False).head(10)
joke_idx2 = joke_res2.index.values
joke_s2 = joke_res2['score lsa'].values
joke_art2 = []
for i in joke_idx2:
    joke_art2.append(joke_universe[i])

Visualize LSA recommendations:

In [42]:
for j in joke_art2:
    print('\n'+'*'*15+'\n',j,'\n')


***************
 what do you call a Chinese person with down syndrome?

Som ting wong 


***************
 How do you blindfold a chinese person?

With dental floss 


***************
 What do you call a Chinese Millionaire?

Cha Ching 


***************
 What do you call a foreigner who is obsessed with Chinese culture?

A zhuologist 


***************
 What do you call a Chinese Podiatrist?

Hee Lan To 


***************
 What part of your punctuality emancipates the Chinese?

Your Ti"ming"! 


***************
 What do you call a Chinese millionaire?

Cha Ching 


***************
 What do Chinese lumberjacks do?

Chopsticks 


***************
 what do you call a chinese millionaire?

Cha-Ching 


***************
 What do you call a chinese millionaire?

Cha-Ching! 



### Conclusion:
* Accidently, my joke target is kind of about sensitive topic under ``insult`` category I guess. Sorry about that.
* But if we assume that I want to find out all of those most impolite jokes, and want to delete them immediately, which models give me better delete recommendations?
* According to what I printed above, obviously ``SVD/LSA`` model directs me to a better way.
----
* My **target joke** is:

    ... And We Wonder Why Everyone Hates Us
    
    Customer: "Are you Hispanic?"
    
    Me: "No."
    
    Customer: "Middle Eastern?"
    
    Me: "No."
    
    Customer: "Egyptian?"
    
    Me: "No."
    
    Customer: "What are you?"
    
    Me: "Chinese."
    
    (customer puts on offended face)
    
    Customer: "I don't appreciate you treating me like I'm dumb."
    
    Me: "Excuse me? I'm being honest."
    
    Customer: "NO CHINESE PERSON WOULD EVER HAVE EYES AS BIG AS YOURS!!!"
    
    Me: *mouth wide open*
    
    Insults


* A **bad recommendation** from ``TF-IDF``:

    -------This is how a joke index  173384  looks like:-------

     A customer asked me for a good reliable printer...


* A **good recommendation** from ``SVD/LSA``:

    -------This is how a joke index  117910  looks like:-------

     What do Chinese lumberjacks do?

     Chopsticks

### List of suggestions of how one could improve upon this recommender：
* We already see in my data importing cell, there are three different data source and hence three different json dictionary structures. When doing this recommender, I already considered put 'body','title','category' into joke_universe if there are. But notice that, we still have more options to consider, such as 'id','rating'.We can in one way think of adding these two options in our joke universe.
* Another way to improve may be to do our own classification first. Since we had some 'category' field from stupidstuff.json and wocka.json, but not from reddit_jokes.json. So, we can import all fields and do a classification to re-classify all of our jokes in to some self-defined categories (more detailed). Taking this new category into consideration when training the model may improve the performance of our recommender.
* Besides, according to LSA results, we can see some duplicated jokes, which means we need clean the dataset first. i.e. delete repeated ones.
* One trivial but maybe important way is to try different parameters for our initial model instances/objects: ``TfidfVectorizer``, ``TruncatedSVD``.

-------------------------------------------------
## Appendix: More about NLP

NLP transforms human language into machine-usable code.

### Processing Techs & Terms

- Tokenization: splitting text into individual words (tokens)
- Lemmatization: reduces words into its base form based on dictionary definition (am, is, are -> be)
- Stemming: reduces words to its base form without context (ended -> end)
- Stop words: remove common and irrelevant words (the, is)

`n-gram`: predicts the next term in a sequence of n terms based on Markov Chains (stochastic and memoryless process that predicts future events based only on the current state)

`bag-of-words`: represents text using word frequencies without context or order

`tf-idf`: term-frequency inverse-document-frequency. Measures word importance for a document in a collection of documents (corpus), by multiplying the term frequency (occurences of a term in a document) with the inverse document frequency, in which way it penalizes common terms across a corpus.

`Cosine Similarity`: meansure the similarity between vectors, calculated as $$\text{cos}(\theta) = \frac{A\cdot B}{\text{||}A\text{||}\text{ ||}B\text{||}}$$

### Applications & Models

#### Word Embedding
Maps words and phrases to numerical vectors

- **word2vec**: 
    - trains iterativesly over local word context windows, places similar words close together, and embeds sub-relationships directly into vectors. 
    - E.g. *king - man + woman = queen*
    - Two approaches:
        - Continuous bag-of-words (CBOW), predicting the word given its context
        - skip-gram, predicting the context given a word

- **GloVe**:
    - combines both global and local word co-occurance data to learn word similarity

- **BERT**:
    - accounts for word order and trains on subwords, and unlike word2vec and GloVe, BERT outputs different vectors for different uses of words
    - E.g. *cell phone VS blood cell*

#### Sentiment Analysis
Extracts the attitudes and emotionis from text.

- **Polarity**: measures positive, negative, or neutral opinions
    - Valence shifters: capture amplifiers or negators such as "***really*** fun" or "*hardly* fun"

- **Sentiment**: measures the emotional states such as happy or sad
- **Subject-Object Identification**: classifies sentences as either S or O


#### Topic Modeling
Captures the underlying themes taht appear in documents.

- **Latent Dirichlet Allocation (LDA)**: generate k topics by first assigning each word to a random topic, then iteratively updating assignments based on parameters $\alpha$, the mix of topics per document, and $\beta$, the distribution of words per topic.

- **Latent Semantic Analysis (LSA)**: identifies patterns using TF-IDF scores and reduces data to k dimensions through SVD. In other words, given a corpus of articles, we want to create a term-document-type of matrix, for which we can do SVD analysis.