# LSA Exercise

The purpose of this exercise is to use LSA in order to run unsupervised topic extraction on texts and compare the results to the target variable. We are not going to use the target variable to train a model but only to assess if the topics found by LSA are similar to the classes that would have been used for supervised classification.

1. Let's begin and import the libraries we will be using

In [81]:
import numpy as np
import pandas as pd

import plotly.figure_factory as ff
from sklearn.metrics import confusion_matrix

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS

2. Load the 20 news dataset into an object news

In [66]:
news = fetch_20newsgroups()

3. Display the data description using the DESCR key

In [67]:
print(news["DESCR"])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

Classes                     20
Samples total            18846
Dimensionality               1
Features                  text

4. Store the object news.data in a DataFrame and call the column text. Extract a sample of 5000 rows to begin with. Add the target variable to this dataframe in order to run analysis later.

In [68]:
corpus = pd.DataFrame(news.data, columns=["text"]).sample(5000)
corpus["target"] = news.target[corpus.index]
corpus

Unnamed: 0,text,target
4977,From: engp2254@nusunix1.nus.sg (SOH KAM YUNG)\...,12
8241,From: brandt@cs.unc.edu (Andrew Brandt)\nSubje...,7
7059,From: wade@nb.rockwell.com (Wade Guthrie)\nSub...,7
5647,From: Rick_Granberry@pts.mot.com (Rick Granber...,15
4355,From: frank@D012S658.uucp (Frank O'Dwyer)\nSub...,0
...,...,...
5427,From: Howard Frederick <hfrederick@igc.apc.org...,17
1496,From: shz@mare.att.com (Keeper of the 'Tude)\n...,8
2555,From: simon@dcs.warwick.ac.uk (Simon Clippingd...,0
7910,From: lvc@cbnews.cb.att.com (Larry Cipriani)\n...,16


5. Create a column text_clean containing only alphanumerical characters and change all characters to lowercase. Also only keep the tex that is after the string "Subject:"

In [69]:
corpus["text_clean"] = corpus['text'].apply(lambda x: x.split("Subject:")[1])
corpus["text_clean"] = corpus["text_clean"].str.replace(r"[^A-Za-z0-9 ]+", " ", regex = True)
corpus["text_clean"] = corpus["text_clean"].apply(lambda x : x.lower())
corpus.head()

Unnamed: 0,text,target,text_clean
4977,From: engp2254@nusunix1.nus.sg (SOH KAM YUNG)\...,12,re protection of serial rs232 lines organi...
8241,From: brandt@cs.unc.edu (Andrew Brandt)\nSubje...,7,4runner and pathfinder recent changes organiz...
7059,From: wade@nb.rockwell.com (Wade Guthrie)\nSub...,7,re curious about the porsche i drove organiz...
5647,From: Rick_Granberry@pts.mot.com (Rick Granber...,15,pastoral authority reply to rick granberry p...
4355,From: frank@D012S658.uucp (Frank O'Dwyer)\nSub...,0,re societally acceptable behavior organizati...


6. Create an object nlp with ```en_core_web_sm.load```

In [70]:
nlp = en_core_web_sm.load()

7. Tokenize the cleaned sentences and remove english stopwords

In [71]:
corpus["text_tokenized"] = corpus["text_clean"].apply(lambda x: [token.lemma_ for token in nlp(x) if token.text not in STOP_WORDS])
corpus.head()

Unnamed: 0,text,target,text_clean,text_tokenized
4977,From: engp2254@nusunix1.nus.sg (SOH KAM YUNG)\...,12,re protection of serial rs232 lines organi...,"[ , , protection, serial, , rs232, , line, ..."
8241,From: brandt@cs.unc.edu (Andrew Brandt)\nSubje...,7,4runner and pathfinder recent changes organiz...,"[ , 4runner, pathfinder, recent, change, organ..."
7059,From: wade@nb.rockwell.com (Wade Guthrie)\nSub...,7,re curious about the porsche i drove organiz...,"[ , , curious, porsche, drive, organization, ..."
5647,From: Rick_Granberry@pts.mot.com (Rick Granber...,15,pastoral authority reply to rick granberry p...,"[ , pastoral, authority, reply, , rick, granb..."
4355,From: frank@D012S658.uucp (Frank O'Dwyer)\nSub...,0,re societally acceptable behavior organizati...,"[ , , societally, acceptable, behavior, organ..."


8. Detokenize the tokenized sentences and store them in an ```nlp_ready``` column

In [72]:
corpus["nlp_ready"] = corpus["text_tokenized"].apply(lambda x: ' '.join(x))
corpus.head()

Unnamed: 0,text,target,text_clean,text_tokenized,nlp_ready
4977,From: engp2254@nusunix1.nus.sg (SOH KAM YUNG)\...,12,re protection of serial rs232 lines organi...,"[ , , protection, serial, , rs232, , line, ...",protection serial rs232 line organizat...
8241,From: brandt@cs.unc.edu (Andrew Brandt)\nSubje...,7,4runner and pathfinder recent changes organiz...,"[ , 4runner, pathfinder, recent, change, organ...",4runner pathfinder recent change organizatio...
7059,From: wade@nb.rockwell.com (Wade Guthrie)\nSub...,7,re curious about the porsche i drove organiz...,"[ , , curious, porsche, drive, organization, ...",curious porsche drive organization rockw...
5647,From: Rick_Granberry@pts.mot.com (Rick Granber...,15,pastoral authority reply to rick granberry p...,"[ , pastoral, authority, reply, , rick, granb...",pastoral authority reply rick granberry pt...
4355,From: frank@D012S658.uucp (Frank O'Dwyer)\nSub...,0,re societally acceptable behavior organizati...,"[ , , societally, acceptable, behavior, organ...",societally acceptable behavior organizatio...


9. Use sklearn to calculate the tf-idf

In [73]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(corpus["nlp_ready"])
X

<5000x70371 sparse matrix of type '<class 'numpy.float64'>'
	with 509638 stored elements in Compressed Sparse Row format>

10. Use the truncatedSVD model in order to create a topic model with 20 different topics

In [74]:
svd = TruncatedSVD(n_components=20)
lsa = svd.fit_transform(X)

In [75]:
list_topic = ["topic_"+str(i) for i in range(1, 21)]

topics = pd.DataFrame(abs(lsa), columns=list_topic, index=corpus.index)
topics["text"] = corpus['nlp_ready']
topics.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,...,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,topic_20,text
4977,0.142929,0.053448,0.035099,0.023163,0.002161,0.029262,0.012541,0.042809,0.017865,0.019203,...,0.004814,0.040882,0.031298,0.007641,0.047676,0.011411,0.034943,0.009218,0.017512,protection serial rs232 line organizat...
8241,0.111785,0.041383,0.048258,0.020143,0.000175,0.037299,0.023966,0.011719,0.016638,0.010117,...,0.022951,0.023598,0.016123,0.022858,0.007519,0.016313,5.1e-05,0.015427,0.022912,4runner pathfinder recent change organizatio...
7059,0.112317,0.015785,0.019795,0.017478,0.039096,0.028607,0.034628,0.061418,0.049407,0.050359,...,0.041106,0.028901,0.03481,0.004494,0.044431,0.029328,0.015747,0.032578,0.046997,curious porsche drive organization rockw...
5647,0.085716,0.035191,0.005349,0.035218,0.016343,0.011214,0.000624,0.008806,0.030857,0.005104,...,0.002823,0.014532,0.024135,0.019111,0.030874,0.018221,0.004062,0.030102,0.043387,pastoral authority reply rick granberry pt...
4355,0.182962,0.1092,0.011292,0.012677,0.033767,0.00793,0.001253,0.021094,0.090732,0.024778,...,0.05231,0.01056,0.056753,0.029792,0.023864,0.15611,0.012823,0.003593,0.003614,societally acceptable behavior organizatio...


11. Assign each document to the topic it is the most linked to :

In [None]:
topics["class_pred"] = [np.argmax(topic) for topic in lsa]
topics["class_pred"].value_counts()

class_pred
0     4021
2      124
1      119
6      111
10      94
5       82
12      78
4       51
14      48
16      48
11      47
18      41
13      40
17      33
15      22
3       12
7       11
9        9
19       9
Name: count, dtype: int64

12. Add the target variable to thetopic model dataframe and print the confusion matrix for the topic against the target variable :

In [80]:
topics["target"] = news.target[corpus.index]
topics.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,...,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,topic_20,text,class_pred,target
4977,0.142929,0.053448,0.035099,0.023163,0.002161,0.029262,0.012541,0.042809,0.017865,0.019203,...,0.031298,0.007641,0.047676,0.011411,0.034943,0.009218,0.017512,protection serial rs232 line organizat...,0,12
8241,0.111785,0.041383,0.048258,0.020143,0.000175,0.037299,0.023966,0.011719,0.016638,0.010117,...,0.016123,0.022858,0.007519,0.016313,5.1e-05,0.015427,0.022912,4runner pathfinder recent change organizatio...,0,7
7059,0.112317,0.015785,0.019795,0.017478,0.039096,0.028607,0.034628,0.061418,0.049407,0.050359,...,0.03481,0.004494,0.044431,0.029328,0.015747,0.032578,0.046997,curious porsche drive organization rockw...,0,7
5647,0.085716,0.035191,0.005349,0.035218,0.016343,0.011214,0.000624,0.008806,0.030857,0.005104,...,0.024135,0.019111,0.030874,0.018221,0.004062,0.030102,0.043387,pastoral authority reply rick granberry pt...,0,15
4355,0.182962,0.1092,0.011292,0.012677,0.033767,0.00793,0.001253,0.021094,0.090732,0.024778,...,0.056753,0.029792,0.023864,0.15611,0.012823,0.003593,0.003614,societally acceptable behavior organizatio...,0,0


In [82]:
cm = confusion_matrix(y_true = topics["target"], y_pred = topics["class_pred"]) / 5000

fig = ff.create_annotated_heatmap(cm.round(2), x = [i for i in range(20)], y = [i for i in range(20)])
fig.update_layout(width = 1200)

fig.show()

Conclusion : the topics found by LSA are very different from the target ! Here we can see that topic 0 is very frequent among the documents and spans accross lots of the target categories.
LSA is very convenient to find some structure among a text corpus, but it usually creates topics that are quite different from the categories that would have been determined by a human.

Reminder : contrary to supervised classification and unsupervised clustering, LSA is based on the hypothesis that a given document can be related to several topics. This makes the interpretation of the model's output more complicated, but allows to create topic models that are more realistic (because in real life, a document is often related to different topics !)