# Lab 06 - Topic Modelling
In this lab we will look into building topic models, but will also examine dimensionality reduction and other relevant subjects.

## Latent Semantic Analysis (LSA)
Based on: [Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using Python)](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/)

### Data reading and inspection
Let’s load the required libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_colwidth", 200)

We will use the ’20 Newsgroup’ dataset from sklearn.

OPTIONALY: You can also download the dataset [here](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups), if you want to look at it

In [2]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
len(documents)

11314

Let's look at the 20 labels from the dataset

In [3]:
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Data Preprocessing
To start with, we will try to clean our text data as much as possible.

### Challenge 01

In [4]:
news_df = pd.DataFrame({'document':documents})

# TODO: Remove everything except alphabets
news_df['clean_doc'] = ...

# TODO: Remove short words (words with len < 4) from `clean_doc` column.
news_df['clean_doc'] = ...

# TODO: Make all text lowercase
news_df['clean_doc'] = ...

Tokenise and remove the stop-words... eventually we will stitch the text back together

### Challenge 02

In [6]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# TODO: Tokenization
tokenized_doc = ...

# TODO: Remove stop-words
tokenized_doc = ...

# de-tokenization
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc

### TFIDF Document-Term Matrix
This is the first step towards topic modeling. We will use sklearn’s TfidfVectorizer to create a document-term matrix with 1,000 terms.

### Challenge 03

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
max_features= 1000, # keep top 1000 terms 
max_df = 0.5, 
smooth_idf=True)

# TODO: Use fit_transform func to transform `clean_doc` with TfidfVectorizer
X = ...

X.shape # check shape of the document-term matrix

We could have used all the terms to create this matrix but that would need quite a lot of computation time and resources. Hence, we have restricted the number of features to 1,000. If you have the computational power, I suggest trying out all the terms.

### Topic Modeling
The next step is to represent each and every term and document as a vector. We will use the document-term matrix and decompose it into multiple matrices. We will use sklearn’s TruncatedSVD to perform the task of matrix decomposition.

Since the data comes from 20 different newsgroups, let’s try to have 20 topics for our text data. The number of topics can be specified by using the n_components parameter.

In [10]:
from sklearn.decomposition import TruncatedSVD

# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)

# TODO: Fit svd_model on X
...

len(svd_model.components_)

The components of svd_model are our topics, and we can access them using svd_model.components_. Finally, let’s print a few most important words in each of the 20 topics and see how our model has done.

In [11]:
terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0]," ")

### Topics Visualization
To find out how distinct our topics are, we should visualize them. Of course, we cannot visualize more than 3 dimensions, but there are techniques like t-SNE which can help us visualize high dimensional data into lower dimensions. Here we will use a relatively new technique called UMAP (Uniform Manifold Approximation and Projection).

In [12]:
# pip install umap-learn
import umap.umap_ as umap

X_topics = svd_model.fit_transform(X)
embedding = umap.UMAP(n_neighbors=150, min_dist=0.5, random_state=12).fit_transform(X_topics)

plt.figure(figsize=(14,10))
plt.scatter(embedding[:, 0], embedding[:, 1], 
            c = dataset.target,
            s = 10, # size
            edgecolor='none')
plt.show()