# HW03: Distance and Topic Model

Remember that these homework work as a completion grade. **You can skip one section without losing credit.**

## Load and Pre-process Text

In [307]:
# Import the AG news dataset (same as hw01)
# Download them from here
# !wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df.head()

Unnamed: 0,label,title,lead,text
0,business,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
1,business,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
2,business,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
3,business,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."
4,business,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...,"Stocks End Up, But Near Year Lows (Reuters) Re..."


In [308]:
import spacy
df = df.sample(500)
nlp = spacy.load('en_core_web_sm')

In [309]:
# Tokenize the text of each document in the corpus, the tokenized version gets saved into the `tokenized` column of the dataframe
df["tokenized"] = df["text"].apply(lambda x: nlp(x))
df.head()

Unnamed: 0,label,title,lead,text,tokenized
118634,sport,Mets challenge intrigues Martinez,The former Red Sox ace formalized a 53 (m) mil...,Mets challenge intrigues Martinez The former R...,"(Mets, challenge, intrigues, Martinez, The, fo..."
90336,sci/tech,How Science Abuses Politics,Patrick J. Michaels is a senior fellow at the ...,How Science Abuses Politics Patrick J. Michael...,"(How, Science, Abuses, Politics, Patrick, J., ..."
50212,business,Ofgem exposes gas supply problems,Energy regulator Ofgem blames high oil prices ...,Ofgem exposes gas supply problems Energy regul...,"(Ofgem, exposes, gas, supply, problems, Energy..."
30395,business,TI plans a buyback and boosts dividend,No.1 maker of chips used in cell phones announ...,TI plans a buyback and boosts dividend No.1 ma...,"(TI, plans, a, buyback, and, boosts, dividend,..."
64632,world,Quarter of All Afghan Votes Counted; Karzai Ahead,KABUL (Reuters) - Afghan President Hamid Karz...,Quarter of All Afghan Votes Counted; Karzai Ah...,"(Quarter, of, All, Afghan, Votes, Counted, ;, ..."


In [310]:
# TODO print the first sentence of the first document in your sample
print (list(df.iloc[0]["tokenized"].sents)[0])

Mets challenge intrigues


In [311]:
# TODO pre-process text as you did in HW02 ->
# tokenized the text into sentences, convert each token into lowercase (x.lower()), without punctuation tokens (x.is_punct) nor stopwords (x.is_stop) nor a number
def pre_process(tokenized_text):
	return [w.lemma_.lower() for w in tokenized_text if not w.is_stop and not w.is_punct and w.pos_ != 'NUM']

df["preprocessed"] = df["tokenized"].apply(lambda x: pre_process(x))

In [312]:
# Save the dataframe, as the pre_processing takes a long time to compute
df.to_pickle('cleaned_news.pkl', compression='gzip')

In [313]:
# Check out the results of the pre-processing
df.head()

Unnamed: 0,label,title,lead,text,tokenized,preprocessed
118634,sport,Mets challenge intrigues Martinez,The former Red Sox ace formalized a 53 (m) mil...,Mets challenge intrigues Martinez The former R...,"(Mets, challenge, intrigues, Martinez, The, fo...","[met, challenge, intrigue, martinez, red, sox,..."
90336,sci/tech,How Science Abuses Politics,Patrick J. Michaels is a senior fellow at the ...,How Science Abuses Politics Patrick J. Michael...,"(How, Science, Abuses, Politics, Patrick, J., ...","[science, abuses, politics, patrick, j., micha..."
50212,business,Ofgem exposes gas supply problems,Energy regulator Ofgem blames high oil prices ...,Ofgem exposes gas supply problems Energy regul...,"(Ofgem, exposes, gas, supply, problems, Energy...","[ofgem, expose, gas, supply, problem, energy, ..."
30395,business,TI plans a buyback and boosts dividend,No.1 maker of chips used in cell phones announ...,TI plans a buyback and boosts dividend No.1 ma...,"(TI, plans, a, buyback, and, boosts, dividend,...","[ti, plan, buyback, boost, dividend, no.1, mak..."
64632,world,Quarter of All Afghan Votes Counted; Karzai Ahead,KABUL (Reuters) - Afghan President Hamid Karz...,Quarter of All Afghan Votes Counted; Karzai Ah...,"(Quarter, of, All, Afghan, Votes, Counted, ;, ...","[quarter, afghan, vote, count, karzai, ahead, ..."


In [314]:
# TODO vectorize the pre-processed text using CountVectorizer
# See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html for documentation
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer takes a text as input, so we need to convert our pre-processed list back to a string
corpus = df['preprocessed'].apply(lambda x: ' '.join(x))

vectorizer = CountVectorizer(ngram_range=(1, 1))
X = vectorizer.fit_transform(corpus.values)

In [315]:
# despite filtering out numbers, the first few entries (0 up to 19) are still consisting of numbers (dates, etc.)
vectorizer.get_feature_names_out()

array(['000', '06', '10', ..., 'zimbabwe', 'zone', 'zoom'], dtype=object)

## Cosine Similarity and PCA

In [317]:
from sklearn.metrics.pairwise import cosine_similarity

# TODO compute the cosine similarity for the first 200 snippets and for the first snippet, show the three most similar snippets and their respective cosine similarity scores
# This will give us a 500x500 matrix
cosine_sim = cosine_similarity(X)
cosine_sim[:5] # print the first column

array([[1.        , 0.        , 0.        , ..., 0.0758098 , 0.04166667,
        0.03608439],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       [0.06375767, 0.        , 0.        , ..., 0.        , 0.        ,
        0.08282364],
       [0.        , 0.        , 0.        , ..., 0.09284767, 0.        ,
        0.        ]])

In [318]:
import numpy as np

def get_argmax_ignore_self(c, index):

	if index == 0:
		return np.argmax(c[1:])

	if index == len(c) - 1:
		return np.argmax(c[:-1])

	# skip pairing with itself
	argmax1 = np.argmax(c[:index])
	argmax2 = np.argmax(c[(index + 1):])

	return argmax1 if c[argmax1] >= c[argmax2] else argmax2

def print_most_similar(data_corpus, similarity_matrix):

	ranked_sentences = []

	for i, col in enumerate(similarity_matrix):
		argmax = get_argmax_ignore_self(col, i)

		if i == argmax: continue

		if col[argmax] > 0:
			ranked_sentences.append([col[argmax], i, argmax])

	# sort by similarity
	ranked_sentences.sort(key=lambda triple: triple[0], reverse=True)

	print('Similar sentences ranked:')
	for s in ranked_sentences:
		print('\tSimilarity: ', s[0])
		print('\t ', s[1], ': ', data_corpus.iloc[s[1]])
		print('\t ', s[2], ': ', data_corpus.iloc[s[2]])
		print('\n')

In [319]:
print_most_similar(corpus, similarity_matrix=cosine_sim)

Similar sentences ranked:
	Similarity:  0.6009819973837497
	  113 :  india ruling congress party win key state election india ruling congress party retain power legislative election country second large state maharashtra
	  107 :  india congress set form govt key state   bombay reuters india rule congress party win power   key state saturday emerge large   group election major political test   victory national poll may.


	Similarity:  0.583115436899835
	  396 :  rio fights rampant street crime tourists fly lt;p&gt;&lt;/p&gt;&lt;p&gt andrei khalip&lt;/p&gt;&lt;p&gt rio de janeiro brazil reuters police arm riflesand wear flak jacket manned checkpoint rio dejaneiro thursday prevent mugging violence astourist flock ocean city holiday season.&lt;/p&gt
	  166 :  supreme court pass riaa network sharing case u.s. supreme court decline tuesday hear appeal case concern right entertainment industry subpoena file trader telecommunication company share network startup competitors.&lt;p&gt;advertis

In [320]:
# PCA does not support sparse input. See TruncatedSVD for a possible alternative.

from sklearn.decomposition import TruncatedSVD
pca = TruncatedSVD(n_components=3)

##TODO reduce the vectorized data using PCA
X_reduced = pca.fit_transform(X)
X_reduced

array([[ 0.42751035, -0.88025744, -0.02949573],
       [ 0.01639626, -0.05535975, -0.02922931],
       [ 0.20217027, -0.29618551, -0.33268   ],
       ...,
       [ 0.491484  , -0.55278025, -0.8018407 ],
       [ 0.29619728, -0.34988223,  0.11253941],
       [ 0.33695398, -0.48354134, -0.20780809]])

In [321]:
# TODO compute again cosine similarity with the reduced version for the first 200 snippets

from sklearn.metrics.pairwise import cosine_similarity

# This will give us a 200x200 matrix
cosine_sim = cosine_similarity(X_reduced)

##TODO for the first snippet, show again its three most similar snippets
print_most_similar(corpus, similarity_matrix=cosine_sim)

Similar sentences ranked:
	Similarity:  0.9999998777175254
	  142 :  greek bus hijacker demand ransom plane armed hijacker release man bus athens fourth batch bring total number release hostage include man woman police say wednesday
	  57 :  karaoke creator win ig nobel prize description daisuke inoue inventor karaoke award ig nobel peace prize night   quot;for invent karaoke provide entirely new way people learn tolerate


	Similarity:  0.9999998281371473
	  144 :  sasser writer get job german teenager write sasser worm train security software programmer give job sven jaschan take securepoint computer outfit northern germany
	  49 :  valero buy kaneb services pipe line pipeline operator valero lp say monday agree purchase kaneb services llc kaneb pipe line partners lp \$2.8 create large terminal operator second large petroleum liquid pipeline operator united


	Similarity:  0.9999994080672646
	  398 :  doctor offers assurances astronauts will hungry nasa official ask astronaut aboard 

Compare the cosine similarity before and after PCA reduction. Did the results change?

Yes, the results changed drastically, we now get similarities in the 99% area, where as before the max similarity was between 50% and 60%

## Clustering

In [322]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer takes a text as input, so we need to convert our pre-processed list back to a string
corpus = df['preprocessed'].apply(lambda x: ' '.join(x))

vectorizer = CountVectorizer(ngram_range=(1, 3))
X = vectorizer.fit_transform(corpus.values)

# TODO create the clusters found with k-means clustering nd 10 clusters
kmeans = KMeans(n_clusters=10, random_state=12)
kmeans.fit(X)
clustered_X = kmeans.predict(X)

for i, doc in enumerate(clustered_X):

	print('\nLabel: ', df.iloc[i]['label'])
	print('Cluster: ', doc)
	print('Label: ', df.iloc[i]['text'])


Label:  sport
Cluster:  4
Label:  Mets challenge intrigues Martinez The former Red Sox ace formalized a 53 (m) million dollar, four-year contract with the New York Mets today (Thursday) and embraced the idea of helping rebuild a team that has fallen on hard times.

Label:  sci/tech
Cluster:  4
Label:  How Science Abuses Politics Patrick J. Michaels is a senior fellow at the Cato Institute and author of Meltdown: The Predictable Distortion of Global Warming by Scientists, Politicians, and the Media.

Label:  business
Cluster:  4
Label:  Ofgem exposes gas supply problems Energy regulator Ofgem blames high oil prices and problems with the UK's supply of gas for a steep rise in wholesale gas prices.

Label:  business
Cluster:  4
Label:  TI plans a buyback and boosts dividend No.1 maker of chips used in cell phones announces \$1B stock repurchase and raises payout nearly 18. NEW YORK (CNN/Money) - Texas Instruments, the world #39;s largest maker of chips used in cell phones, announced 

La

In [323]:
##TODO find the optimal number of clusters in a range from 2 to 50 using the silhouette score

range_n_clusters = range(2, 51)

scores = []

for n_clusters in range_n_clusters:

	# Initialize the clusterer with n_clusters value and a random generator
	# seed of 10 for reproducibility.
	clusterer = KMeans(n_clusters=n_clusters, random_state=12)
	cluster_labels = clusterer.fit_predict(X)

	# The silhouette_score gives the average value for all the samples.
	# This gives a perspective into the density and separation of the formed
	# clusters
	silhouette_avg = silhouette_score(X, cluster_labels)
	scores.append([silhouette_avg, n_clusters])

scores.sort(key=lambda pair: pair[0], reverse=True)

print('silhouette_score', 'n_clusters')
scores

silhouette_score n_clusters


[[0.1645036746873116, 2],
 [0.032145666012992694, 3],
 [0.019813300805131075, 5],
 [0.013888919071575868, 16],
 [0.013707021884542094, 11],
 [0.009837686319537955, 12],
 [0.007202515109708647, 17],
 [0.007086835662338002, 7],
 [0.00425035930763325, 9],
 [0.004195991360808, 14],
 [0.003350143400377345, 6],
 [0.0027585887397284293, 19],
 [-0.0013584385494960307, 13],
 [-0.00211653008543583, 21],
 [-0.006083926120016455, 31],
 [-0.0063007647810396465, 29],
 [-0.006696605515677224, 50],
 [-0.007005319940733917, 4],
 [-0.007314620211392795, 43],
 [-0.0073451851733283635, 41],
 [-0.007853666274786234, 30],
 [-0.008729190778641961, 18],
 [-0.008898006210370008, 35],
 [-0.00986452479883211, 38],
 [-0.020808230768674454, 10],
 [-0.021187899316840623, 27],
 [-0.022399444977435255, 26],
 [-0.03028397567771695, 42],
 [-0.03440994272689798, 36],
 [-0.04029871971999248, 22],
 [-0.04253979679363504, 8],
 [-0.044472789320857874, 37],
 [-0.04529242398961809, 45],
 [-0.04807764600003183, 28],
 [-0.04932

In [324]:
# TODO create the clusters using the opitmal number of clusters obtained before

kmeans = KMeans(n_clusters=2, random_state=12)
kmeans.fit(X)
clustered_X = kmeans.predict(X)

for i, doc in enumerate(clustered_X):

	print('\nLabel: ', df.iloc[i]['label'])
	print('Cluster: ', doc)
	print('Label: ', df.iloc[i]['text'])

# TODO compare the documents in cluster "1" under the two specifications, does the cluster look cleaner after having searched for the optimal number of clusters?

# Now we get two clear clusters, one containing nearly all documents article and a second one containing all the documents with encoding errors and URLs
# This is a nice separation but not the result we hopped for!


Label:  sport
Cluster:  1
Label:  Mets challenge intrigues Martinez The former Red Sox ace formalized a 53 (m) million dollar, four-year contract with the New York Mets today (Thursday) and embraced the idea of helping rebuild a team that has fallen on hard times.

Label:  sci/tech
Cluster:  1
Label:  How Science Abuses Politics Patrick J. Michaels is a senior fellow at the Cato Institute and author of Meltdown: The Predictable Distortion of Global Warming by Scientists, Politicians, and the Media.

Label:  business
Cluster:  1
Label:  Ofgem exposes gas supply problems Energy regulator Ofgem blames high oil prices and problems with the UK's supply of gas for a steep rise in wholesale gas prices.

Label:  business
Cluster:  1
Label:  TI plans a buyback and boosts dividend No.1 maker of chips used in cell phones announces \$1B stock repurchase and raises payout nearly 18. NEW YORK (CNN/Money) - Texas Instruments, the world #39;s largest maker of chips used in cell phones, announced 

La

## Topic Modeling: LDA

For this part you will need to use LDA Mallet. If you cannot have Mallet run, you can use the simple LDA algorithm 

In [327]:
from gensim.corpora import Dictionary
from gensim.models.wrappers import LdaMallet
from gensim.models.coherencemodel import CoherenceModel

##TODO create a dictionary with the pre-processed tokenized text and filter it according to frequencies and keeping 1000 vocabularies

df

##TODO create the doc_term_matrix


ModuleNotFoundError: No module named 'gensim.models.wrappers'

In [None]:
##TODO train a LDA Mallet model with 5, 10 and 15 topics
##TODO compute the coherence score for each of these model and print the topics from the model with highest coherence score

In [None]:
import pyLDAvis.gensim
##TODO using LDAvis visualize the topics using the optimal number of topics