## Creating the distance matrix

In this notebook we do the following:
* Take a dictionary **D1** of words with gensim, using the 20 newsgroups corpus as input dataset
* Create a gensim representation of the 20newsgroups corpus based on this new dictionary 
* Save the results using pickle


In [1]:
# sphinx_gallery_thumbnail_number = 2
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### First, clean up the 20 Newsgroups dataset. We will use it to fit LDA.

In [2]:
from string import punctuation
from nltk import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pandas as pd

newsgroups = fetch_20newsgroups()

In [3]:
newsgroups.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

We will keep the following 6 groups:
* comp.graphics (1)
* comp.os.ms-windows.misc (2)
* rec.autos (7)
* rec.motorcycles (8)
* sci.med (13)
* sci.space (14)

In [4]:
targets_to_keep = [1, 2, 7, 8, 13, 14]
nb_of_documents_class = 100
newsgroups.target

array([7, 4, 4, ..., 3, 1, 8])

In [5]:
[newsgroups.target_names[t] for t in targets_to_keep]

['comp.graphics',
 'comp.os.ms-windows.misc',
 'rec.autos',
 'rec.motorcycles',
 'sci.med',
 'sci.space']

In [6]:
data = pd.DataFrame({"text":newsgroups.data, "target": newsgroups.target})
    

In [7]:
data.head()

Unnamed: 0,text,target
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14


In [8]:
corpus_6classes = data[data["target"].isin(targets_to_keep)]

In [9]:
corpus_6classes.shape

(3554, 2)

In [10]:
corpus_6classes.target.value_counts()

8     598
7     594
13    594
14    593
2     591
1     584
Name: target, dtype: int64

In [11]:
# Take the first 100 documents from each class
corpus_6classes_600docs = corpus_6classes.groupby("target").head(n=100)

In [12]:
corpus_6classes_600docs.head()

Unnamed: 0,text,target
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14
6,From: bmdelane@quads.uchicago.edu (brian manni...,13
8,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...,2


In [13]:
eng_stopwords = set(stopwords.words('english'))

tokenizer = RegexpTokenizer(r'\s+', gaps=True)
stemmer = PorterStemmer()
translate_tab = {ord(p): u" " for p in punctuation}

def text2tokens(raw_text):
    """Split the raw_text string into a list of stemmed tokens."""
    clean_text = raw_text.lower().translate(translate_tab)
    tokens = [token.strip() for token in tokenizer.tokenize(clean_text)]
    tokens = [token for token in tokens if token not in eng_stopwords]
    # stemmed_tokens = [stemmer.stem(token) for token in tokens]
    # return [token for token in stemmed_tokens if len(token) > 2]  # skip short tokens
    return [token for token in tokens if len(token) > 2]  # skip short tokens

dataset = [text2tokens(txt) for txt in list(corpus_6classes_600docs['text'].values)]  # convert a documents to list of tokens

from gensim.corpora import Dictionary
dictionary = Dictionary(documents=dataset, prune_at=None)
dictionary.filter_extremes(no_below=5, no_above=0.3, keep_n=None)  # use Dictionary to remove un-relevant tokens
dictionary.compactify()


2021-06-05 15:36:26,377 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2021-06-05 15:36:26,653 : INFO : built Dictionary(28969 unique tokens: ['60s', '70s', 'addition', 'anyone', 'body']...) from 600 documents (total 108075 corpus positions)
2021-06-05 15:36:26,710 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(28969 unique tokens: ['60s', '70s', 'addition', 'anyone', 'body']...) from 600 documents (total 108075 corpus positions)", 'datetime': '2021-06-05T15:36:26.654238', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 19:58:26) \n[GCC 7.3.0]', 'platform': 'Linux-5.8.0-53-generic-x86_64-with-glibc2.10', 'event': 'created'}
2021-06-05 15:36:26,768 : INFO : discarding 26488 tokens: [('60s', 3), ('70s', 4), ('bricklin', 2), ('edu', 368), ('enlighten', 4), ('funky', 4), ('host', 268), ('lerxst', 2), ('lines', 597), ('neighborhood', 4)]...
2021-06-05 15:36:26,769 : INFO : keeping 2481 tokens which were in no less than 5 and no more than 180 (=30.0%

In [14]:
d2b_dataset = [dictionary.doc2bow(doc) for doc in dataset]  # convert list of tokens to bag of word representation

## Save the corpus and the dictionary

In [15]:
import pickle
# Save the corpus representation in gensim format
# and the corresponding dictionary
with open("./20newsgroup_corpus_gensim_6classes_600docs.pickle", 'wb') as f:
    pickle.dump(d2b_dataset, f)

with open("./dictionary_20newsgroups_6classes_600docs.pickle", 'wb') as f:
    pickle.dump(dictionary, f)

In [16]:
corpus_6classes_600docs.to_csv("./20newsgroup_corpus_6classes_600docs.csv")