## Creating the distance matrix

In this notebook we do the following:
* Take a dictionary **D1** of words with gensim, using the 20 newsgroups corpus as input dataset
* Create a gensim representation of the 20newsgroups corpus based on this new dictionary 
* Save the results using pickle


In [1]:
# sphinx_gallery_thumbnail_number = 2
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### First, clean up the 20 Newsgroups dataset. We will use it to fit LDA.

In [2]:
from string import punctuation
from nltk import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pandas as pd

newsgroups = fetch_20newsgroups()

Downloading 20news dataset. This may take a few minutes.
2021-05-27 11:26:25,928 : INFO : Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
2021-05-27 11:26:25,931 : INFO : Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
targets_to_keep = [0, 1, 2, 13, 15]
nb_of_documents_class = 200
newsgroups.target

array([7, 4, 4, ..., 3, 1, 8])

In [4]:
[newsgroups.target_names[t] for t in targets_to_keep]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'sci.med',
 'soc.religion.christian']

In [5]:
data = pd.DataFrame({"text":newsgroups.data, "target": newsgroups.target})
    

In [6]:
data.head()

Unnamed: 0,text,target
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


In [7]:
corpus_5classes = data[data["target"].isin(targets_to_keep)]

In [8]:
corpus_5classes.shape

(2848, 2)

In [9]:
corpus_5classes.target.value_counts()

15    599
13    594
2     591
1     584
0     480
Name: target, dtype: int64

In [10]:
corpus_5classes_1000docs = corpus_5classes.groupby("target").sample(n=200)

In [11]:
corpus_5classes_1000docs

Unnamed: 0,text,target
9688,From: naren@tekig1.PEN.TEK.COM (Naren Bala)\nS...,0
2046,From: keith@cco.caltech.edu (Keith Allan Schne...,0
4204,Subject: Re: A visit from the Jehovah's Witnes...,0
1325,From: ingles@engin.umich.edu (Ray Ingles)\nSub...,0
3885,From: bcash@crchh410.NoSubdomain.NoDomain (Bri...,0
...,...,...
4381,From: reedr@cgsvax.claremont.edu\nSubject: Re:...,15
11093,From: marka@hcx1.ssd.csd.harris.com (Mark Ashl...,15
6019,From: sun075!Gerry.Palo@uunet.uu.net (Gerry Pa...,15
1972,From: ss6349@csc.albany.edu (Steven H. Schimmr...,15


In [12]:
corpus_5classes_1000docs = corpus_5classes_1000docs.sample(frac=1)

In [13]:
corpus_5classes_1000docs

Unnamed: 0,text,target
3493,From: cpage@two-step.seas.upenn.edu (Carter C....,15
1410,From: Petch@gvg47.gvg.tek.com (Chuck Petch)\nS...,15
9069,From: jaeger@buphy.bu.edu (Gregg Jaeger)\nSubj...,0
6389,From: mhembruc@tsegw.tse.com (Mattias Hembruch...,2
9033,From: rolfe@junior.dsu.edu (Tim Rolfe)\nSubjec...,15
...,...,...
51,From: dlecoint@garnet.acns.fsu.edu (Darius_Lec...,15
4343,From: hodge@iccgcc.decnet.ab.com\nSubject: Re:...,2
4713,From: vbv@r2d2.eeap.cwru.edu (Virgilio (Dean) ...,15
7568,From: jbrandt@NeoSoft.com (J Brandt)\nSubject:...,1


In [65]:
eng_stopwords = set(stopwords.words('english'))

tokenizer = RegexpTokenizer(r'\s+', gaps=True)
stemmer = PorterStemmer()
translate_tab = {ord(p): u" " for p in punctuation}

def text2tokens(raw_text):
    """Split the raw_text string into a list of stemmed tokens."""
    clean_text = raw_text.lower().translate(translate_tab)
    tokens = [token.strip() for token in tokenizer.tokenize(clean_text)]
    tokens = [token for token in tokens if token not in eng_stopwords]
    # stemmed_tokens = [stemmer.stem(token) for token in tokens]
    # return [token for token in stemmed_tokens if len(token) > 2]  # skip short tokens
    return [token for token in tokens if len(token) > 2]  # skip short tokens

dataset = [text2tokens(txt) for txt in list(corpus_5classes_1000docs['text'].values)]  # convert a documents to list of tokens

from gensim.corpora import Dictionary
dictionary = Dictionary(documents=dataset, prune_at=None)
dictionary.filter_extremes(no_below=5, no_above=0.3, keep_n=None)  # use Dictionary to remove un-relevant tokens
dictionary.compactify()


2021-02-12 16:59:18,370 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-02-12 16:59:18,713 : INFO : built Dictionary(30750 unique tokens: ['allan', 'alters', 'animal', 'animals', 'argue']...) from 1000 documents (total 192272 corpus positions)
2021-02-12 16:59:18,763 : INFO : discarding 26356 tokens: [('alters', 1), ('behaviors', 2), ('bred', 2), ('com', 401), ('domesticated', 2), ('domestication', 1), ('edu', 683), ('exhibit', 2), ('host', 353), ('lines', 997)]...
2021-02-12 16:59:18,764 : INFO : keeping 4394 tokens which were in no less than 5 and no more than 300 (=30.0%) documents
2021-02-12 16:59:18,777 : INFO : resulting dictionary: Dictionary(4394 unique tokens: ['allan', 'animal', 'animals', 'argue', 'atheists']...)


In [66]:
d2b_dataset = [dictionary.doc2bow(doc) for doc in dataset]  # convert list of tokens to bag of word representation

## Save the corpus and the dictionary

In [67]:
import pickle
# Save the corpus representation in gensim format
# and the corresponding dictionary
with open("./20newsgroup_corpus_gensim_5classes_1000docs.pickle", 'wb') as f:
    pickle.dump(d2b_dataset, f)

with open("./dictionary_20newsgroups_5classes_1000docs.pickle", 'wb') as f:
    pickle.dump(dictionary, f)

In [69]:
corpus_5classes_1000docs.to_csv("./20newsgroup_corpus_5classes_1000docs.csv")