In [None]:
!rm -rf dso-560-nlp-text-analytics && git clone https://github.com/ychennay/dso-560-nlp-text-analytics

Cloning into 'dso-560-nlp-text-analytics'...
remote: Enumerating objects: 2891, done.[K
remote: Counting objects: 100% (2891/2891), done.[K
remote: Compressing objects: 100% (2725/2725), done.[K
remote: Total 2891 (delta 259), reused 2765 (delta 162), pack-reused 0[K
Receiving objects: 100% (2891/2891), 70.70 MiB | 26.05 MiB/s, done.
Resolving deltas: 100% (259/259), done.


In [None]:
%cd dso-560-nlp-text-analytics

/content/dso-560-nlp-text-analytics/dso-560-nlp-text-analytics


In [None]:
len(open("datasets/good_amazon_toy_reviews.txt").readlines()) + len(
    open("datasets/poor_amazon_toy_reviews.txt").readlines())

114917

# Problem Statement

We want to understand what types of topics (subject matter) comprise the content in our documents. We can use **topic modelling** - using statistical methods to discovering the abstract/latent “topics” from a particular corpus.

## Load Data

### BBC News Dataset

We will load a BBC News dataset with documents divided between politics, entertainment, business, sport, and tech.

### BBC Sport Dataset
We will be loading in the BBC Sport news dataset. It is divided between 5 distinct sports topics - football, athletics, cricket, rugby and tennis.

Both datasets are courtesy of *D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.* ([Link](http://mlg.ucd.ie/datasets/bbc.html))


In [None]:
import pandas as pd
from typing import List, Tuple

#### Define Function to Load in BBC Datasets

In [None]:
def load_bbc_corpus(directory: str, topics: List[str], num_docs: int)-> pd.DataFrame:
  articles: List[Tuple[str, str]] = [(f"datasets/{directory}/{topic}/{str(i).zfill(3)}.txt", topic) for i in range(1,num_docs + 1) for topic in TOPICS]

  data = []
  for article, topic in articles:
      with open(article, encoding="latin1") as article: # open each sports article
        content = article.read()
        data.append({"topic": topic, "text": content})

  # generate a dataframe
  df = pd.DataFrame(data)
  df.text = df.text.apply(lambda text: text.replace("\n", " "))
  return df

#### Load BBC Sport Corpus

In [None]:
TOPICS = ["football", "athletics", "cricket", "rugby", "tennis"]
sports_corpus_df = load_bbc_corpus("bbcsport", TOPICS, num_docs=100)
sports_corpus_df.head(5)

Unnamed: 0,topic,text
0,football,Man Utd stroll to Cup win Wayne Rooney made a...
1,athletics,Claxton hunting first major medal British hur...
2,cricket,Hayden sets up Australia win Second one-day i...
3,rugby,Hodgson shoulders England blame Fly-half Char...
4,tennis,Henman overcomes rival Rusedski Tim Henman sa...


#### Load in BBC News Corpus

In [None]:
documents = []
TOPICS = ["business", "sport", "entertainment", "tech", "politics"]
news_corpus_df = load_bbc_corpus("bbc", TOPICS, num_docs=350)
news_corpus_df.head()

Unnamed: 0,topic,text
0,business,Ad sales boost Time Warner profit Quarterly p...
1,sport,Claxton hunting first major medal British hur...
2,entertainment,Gallery unveils interactive tree A Christmas ...
3,tech,Ink helps drive democracy in Asia The Kyrgyz ...
4,politics,Labour plans maternity pay rise Maternity pay...


### Technique 1: Non-Negative Matrix Factorization

We can think of our corpus as a two-dimensional table - rows being the documents, and columns being the features (ie. in a count-based vectorizer, each column being a unique token).

In Non-Negative Matrix Factorization, we try to find two matrices `W` and `H`, that contain only nonnegative values and when multiplied together, will reconstruct `X`. 

We need to select a variable `K`, which is the number of components/topics we wish to use.

If we want to model topics for a $N \times M$ matrix `X`, where each value is non-negative, then NMD will produce a $K \times M$ matrix `H` and a $N \times K$ matrix `W`.
![NMF](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/nmf.png)

#### Step 1: Vectorize The Corpus

In [None]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(2,2),
                             min_df=0.01, max_df=0.4, stop_words="english")

news_vectorizer = TfidfVectorizer(ngram_range=(3,3), min_df=3,
                            max_df=0.4, stop_words="english")

X_sport, sports_terms = vectorizer.fit_transform(sports_corpus_df.text), vectorizer.get_feature_names_out()
X_news, news_terms = news_vectorizer.fit_transform(news_corpus_df.text), news_vectorizer.get_feature_names_out()
sport_tf_idf = pd.DataFrame(X_sport.toarray(), columns=sports_terms)
news_tf_idf = pd.DataFrame(X_news.toarray(), columns=news_terms)
print(f"News TF-IDF: {news_tf_idf.shape}")
print(news_tf_idf.head(5))
print(f"Sports TF-IDF: {sport_tf_idf.shape}")
sport_tf_idf.head(5)

News TF-IDF: (1750, 3354)
   000 133 000  000 300 000  ...  yukos filed chapter  yulia tymoshenko said
0          0.0          0.0  ...                  0.0                    0.0
1          0.0          0.0  ...                  0.0                    0.0
2          0.0          0.0  ...                  0.0                    0.0
3          0.0          0.0  ...                  0.0                    0.0
4          0.0          0.0  ...                  0.0                    0.0

[5 rows x 3354 columns]
Sports TF-IDF: (500, 866)


Unnamed: 0,10 000m,10 days,10 minutes,100m champion,100m silver,12 august,12 march,12 months,12th man,13 march,17 december,17 january,17 year,18 february,18 year,19 13,2002 2003,2003 world,2005 season,200m champion,21 year,22 year,23 year,24 year,25 year,26 year,27 year,28 year,29 year,30 year,31 year,32 year,34 year,45 minutes,4x100m relay,50 overs,60m hurdles,800m 1500m,ab villiers,abdul razzaq,...,won gold,won grand,won race,won toss,worked hard,world anti,world champion,world champions,world championships,world class,world cricket,world cross,world cup,world indoor,world junior,world number,world rankings,world record,world said,world year,wrist injury,yann delaigue,yannick nyanga,yasir hameed,year ago,year ban,year deal,year old,year said,year tournament,years ago,years time,yelena isinbayeva,york marathon,young players,younis khan,yousuf youhana,yuvraj singh,zaheer khan,zurich premiership
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094351,0.0,0.0,0.0,0.0,0.0,0.0,0.232726,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.434564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.419498,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.244613,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.198341,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.183094,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.186653,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.208396,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Step 2: Fit NMF Model

In [None]:
nmf = NMF(n_components=5)
W_sport = nmf.fit_transform(X_sport)
H_sport = nmf.components_
print(f"Original shape of X sports is {X_sport.shape}")
print(f"Decomposed W sports matrix is {W_sport.shape}")
print(f"Decomposed H sports matrix is {H_sport.shape}")

Original shape of X sports is (500, 866)
Decomposed W sports matrix is (500, 5)
Decomposed H sports matrix is (5, 866)




In [None]:
W_news = nmf.fit_transform(X_news)
H_news = nmf.components_
print(f"Original shape of X news is {X_news.shape}")
print(f"Decomposed W news matrix is {W_news.shape}")
print(f"Decomposed H news matrix is {H_news.shape}")



Original shape of X news is (1750, 3354)
Decomposed W news matrix is (1750, 5)
Decomposed H news matrix is (5, 3354)




In [None]:
W_sport

array([[0.        , 0.        , 0.00342976, 0.        , 0.28288182],
       [0.        , 0.        , 0.00940626, 0.21608535, 0.01120029],
       [0.28590209, 0.        , 0.        , 0.        , 0.        ],
       ...,
       [0.06027493, 0.03948009, 0.00446075, 0.02547794, 0.01791412],
       [0.00203575, 0.        , 0.        , 0.        , 0.17144832],
       [0.        , 0.        , 0.10406689, 0.        , 0.        ]])

In [None]:
H_sport

array([[0.        , 0.        , 0.01354525, ..., 0.03774792, 0.07550214,
        0.00515541],
       [0.        , 0.00230093, 0.00776469, ..., 0.00707621, 0.00834391,
        0.00135614],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.15286071, 0.01231997, 0.        , ..., 0.00046535, 0.        ,
        0.00049971],
       [0.        , 0.05316587, 0.04089902, ..., 0.0019962 , 0.        ,
        0.03927989]])

#### Step 3: Report Results For Each Topic

In [None]:
from typing import List
import numpy as np

def get_top_tf_idf_tokens_for_topic(H: np.array, feature_names: List[str], num_top_tokens: int = 5):
  """
  Uses the H matrix (K components x M original features) to identify for each
  topic the most frequent tokens.
  """
  for topic, vector in enumerate(H):
    print(f"TOPIC {topic}\n")
    total = vector.sum()
    top_scores = vector.argsort()[::-1][:num_top_tokens]
    token_names = list(map(lambda idx: feature_names[idx], top_scores))
    strengths = list(map(lambda idx: vector[idx] / total, top_scores))
    
    for strength, token_name in zip(strengths, token_names):
      print(f"\b{token_name} ({round(strength * 100, 1)}%)\n")
    print(f"=" * 50)

get_top_tf_idf_tokens_for_topic(H_sport, sport_tf_idf.columns.tolist(), 5)
print(f"BBC News Topics:\n\n")
get_top_tf_idf_tokens_for_topic(H_news, news_tf_idf.columns.tolist(), 10)

TOPIC 0

new zealand (7.4%)

ricky ponting (1.6%)

sri lanka (1.4%)

stephen fleming (1.4%)

brett lee (1.4%)

TOPIC 1

south africa (7.9%)

michael vaughan (2.4%)

graeme smith (2.2%)

nicky boje (2.0%)

ab villiers (2.0%)

TOPIC 2

australian open (4.0%)

world number (2.7%)

davis cup (2.6%)

grand slam (2.4%)

french open (1.9%)

TOPIC 3

cross country (2.8%)

year old (2.1%)

lewis francis (1.9%)

european indoor (1.8%)

world record (1.6%)

TOPIC 4

champions league (3.1%)

world cup (3.1%)

manchester united (2.4%)

told bbc (1.7%)

fa cup (1.6%)

BBC News Topics:


TOPIC 0

told bbc news (14.4%)

bbc news website (12.8%)

bbc news online (1.6%)

spokesman told bbc (1.1%)

personal digital assistants (0.8%)

new york state (0.7%)

university new york (0.7%)

senior vice president (0.6%)

vice president technology (0.6%)

digital music players (0.6%)

TOPIC 1

told bbc sport (22.2%)

coach andy robinson (3.1%)

referee jonathan kaplan (2.6%)


#### Get the Top Documents For Each Topic

We can also use the `W` matrix to grab top documents per topic (ie. the document that had the greatest percentage of of each topic).

In [None]:
import numpy as np
def get_top_documents_for_each_topic(W: np.array, documents: List[str], num_docs: int = 5):
  sorted_docs = W.argsort(axis=0)[::-1]
  top_docs = sorted_docs[:num_docs].T
  per_document_totals = W.sum(axis=1)
  for topic, top_documents_for_topic in enumerate(top_docs):
    print(f"Topic {topic}")
    for doc in top_documents_for_topic:
      score = W[doc][topic]
      percent_about_topic = round(score / per_document_totals[doc] * 100, 1)
      print(f"{percent_about_topic}%", documents[doc])
    print("=" * 50)

In [None]:
get_top_documents_for_each_topic(W_sport, sports_corpus_df.text.tolist())

In [None]:
get_top_documents_for_each_topic(W_news, news_corpus_df.text.tolist(), num_docs=10)

Topic 0
100.0% Online games play with politics  After bubbling under for some time, online games broke through onto the political arena in 2004.  The US presidential election provided a showcase for many, aimed at talking directly to a generation that has grown up with joysticks and gamepads. Experts say this reflects how video games are becoming a mainstream part of culture and society. The first official political campaign game was technically launched during the last week of 2003: the Iowa Game, commissioned by the Democrat hopeful Howard Dean. More than 20 followed suit, including Frontrunner, eLections, President Forever and The Political Machine, which allowed players to run an entire presidential campaign, including having to cope with the media. Others helped raise the stakes during the Bush/Kerry contest by highlighting a candidate's virtues or his vices.  The phenomenon has astonished the forefathers of political games, a handful of multi-discipline games enthusiasts keen to 

### Approach 2: LSA (Latent Semantic Analysis)

We can also leverage a dimensionality reduction technique that we've ecnountered before - **Singular Value Decomposition (SVD)** to perform topic modelling.

The following diagram and code snippet is from *Blueprints for Text Analytics Using Python*, Albrecht et al.

Remember that for SVD, we can take our original matrix and decompose it into three matrices.

$$
V = U \times \Sigma \times V^{\star}
$$
We can use these three decomposed matrices for different purposes:
![SVD Topic Modelling](https://raw.githubusercontent.com/ychennay/dso-560-nlp-text-analytics/main/images/svd_topic_modeling.png) 

The $U$ matrix will provide a signal for what the topic composition of our documents are.

The diagonal elements of the $\Sigma$ matrix can be used to estimate the "strength" of each topic. 

The $V^{star}$ matrix can be used to find the top associated words with each topic.

In [None]:
from sklearn.decomposition import TruncatedSVD

# we need to select a K (the number of topics)
K = 5

svd = TruncatedSVD(n_components=K)
U = svd.fit_transform(X_news)
V_star = svd.components_

In [None]:
print(f"U shape is {U.shape}")
get_top_documents_for_each_topic(U, news_corpus_df.text.tolist())

U shape is (1750, 5)
Topic 0
109.3% Putting a face to 'Big Brother'  Literally putting a face on technology could be one of the keys to improving our interaction with hi-tech gadgets.  Imagine a surveillance system that also presents a virtual embodiment of a person on a screen who can react to your behaviour, and perhaps even alert you to new e-mails. Basic versions of these so-called avatars already exist. Together with speech and voice recognition systems, they could replace the keyboard and mouse in the near future. Some of these ideas have been showcased at the London's Science Museum, as part of its Future Face exhibition.  One such avatar is Jeremiah. It is a virtual man, which you can download for free and install in your computer.  His creator, Richard Bowden, lecturer at the Centre for Vision, Speech and Signal Processing at the University of Surrey, refers to Jeremiah as "him", rather than it. "Jeremiah is a virtual face that attempts to emulate humans in the way it responds

### Approach 3: Latent Dirichlet Allocation

The final approach uses a probablistic sampling method by viewing each document as consistenting of mixture of different topics, which are themselves mixtures of different words. 

Each of the mixtures (documents as mixtures of topics, topics as mixtures of words) are modelled using a [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution).

The algorithm creates initial Dirichlet distributions from each topic and word and tries to recreate the original words used for a document using sampling.

It first attempts to construct the representative words for a topic ([A Beginner's Guide to Latent Dirichlet Allocation, Ria Kulshrestha](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2)):
![https://miro.medium.com/max/1222/1*NjeMT281GMduRYvPIS8IjQ.png](https://miro.medium.com/max/1222/1*NjeMT281GMduRYvPIS8IjQ.png)

The algorithm looks like this:

![LDA](https://miro.medium.com/max/494/1*VTHd8nB_PBsDtd2hd87ybg.png) 

(from [LDA Topic Modeling: An Explanation](https://towardsdatascience.com/lda-topic-modeling-an-explanation-e184c90aadcd))

* $\alpha$ is the per-document topic distributions
* $\phi$ is the word distribution for a given topic
* $\beta$ is the per-topic word distributions
* $\theta$ is the topic distribution for the m-th document
* $Z$ is the topic assigned to the n-th word of the m-th document

We can only observe $W$. β is the table in the above screenshot (each row is a topic, each column is a word). 

We randomly initialize the initial topic distribution, and update iteratively until we converge to a solution or exceed the maximum number of iterations.

The New York Times [highlighed an example of a recommendation system based off of LDA](https://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/).

#### Assumptions of LDA
* Bag of words - each document is a bag of words where sequence, part of speech, etc. are not considered. 
* The number of topics is pre-determined and known (or guesstimated).

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=5)
W = lda.fit_transform(X_sport)
get_top_documents_for_each_topic(W, sports_corpus_df.text.tolist(), 10)

Topic 0
89.3% Flintoff fit to bowl at Wanderers  Fourth Test, Wanderers: South Africa v England  Plays starts Thursday, 0830 GMT  There had been concerns his rib muscle injury would restrict him to playing as a specialist batsman in the match. Captain Michael Vaughan said: "He's had a bowl and came out fine so he is fully fit to play as an all-rounder. "We will see how he bowls. In Cape Town I thought he was our best bowler and in Durban probably our second best." Flintoff sent down around 20 deliveries at three-quarter pace during Wednesday's practice session. The 27-year-old incurred a side strain during the 196-run defeat in Cape Townlast week and did not bowl again until the eve of the Johannesburg Test. Vaughan said he would not necessarily shield Flintoff from bowling the same heavy workload he endured in the first three Tests. The skipper commented: "We will just have to judge who is bowling well on any given day and on the given surface to see how much we use him.  "But as a fa

### Collaborative Filters

We can also leverage NLP models as part of a collaborative filter in order to generate product recommendations for a given user/customer.

A common approach is constructing an **User-Item** Iteraction Matrix, where there are many sparse elements. We can then iteratively fill compute the User and Item matrices that will minimize the least square error. This class of algorithms is called **Alternating Least Square**. 

See [Spark example]().

![https://miro.medium.com/max/1400/1*xMxQL_V9CWeLggrk-Uyzmg.png](https://miro.medium.com/max/1400/1*xMxQL_V9CWeLggrk-Uyzmg.png)

#### Appendix: Mathematical Theory for Non-Negative Matrix Factorization

Derivations are from [Source Separation Tutorial Mini-Series II: Introduction to Non-Negative Matrix Factorization](https://ccrma.stanford.edu/~njb/teaching/sstutorial/part2.pdf).

We attempt to minimize the divergence $D$, between the original matrix $X$ and the product of the deconstructed $W$ and $H$ matrices:
$$
min(D(V||\hat{V}))
$$
For NMF, this means
$$
min_{W, H >= 0}(D(V||W\times H))
$$
This is read as *we want to select non-negative values for $W$ and $H$ that will minimize $D$*.

There are many functions we can choose to approximate $D$ for examle, **Euclidean Distance**:
$$
D(V||\hat{V}) = \sum_{i,j}{(V_{ij} - \hat{V}_{ij})^{2}}
$$
However, in practice, we commonly select **[Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)** to use for the divergence metric:      
$$
D(V||\hat{V}) = \sum_{i,j}{(V_{ij}log(\frac{V_{ij}}{\hat{V}_{ij}}) - V_{ij} + \hat{V}_{ij})}
$$
We can rewrite this as (by substituting $V_{ij}$ with $W\times H$):
$$
D(V||\hat{V}) = \sum_{i,j}{(V_{ij}log(\frac{V_{ij}}{W\times H}) - V_{ij} + W\times H)}
$$

From here, we can use Jensen's Inequality to rewrite this as 
$$
H^{\star}_{kj} = \frac{\sum{V_{ij}\pi_{ijk}}}{\sum{W_{ik}}}
$$
Here, $\pi_{ijk}$ is how much of the component $k$ to assign to the i-th document's j-th feature.
