# Other Tools: Gensim

DS 5001 Text as Data

# Overview

Gensim -- which bills itself as "topic modeling for humans" -- provides a number of useful modeling tools, but (unfortunately) is based on a data model constructed by Python lists and dictionaries.

This notebook shows how to use Gensim's data model to make use of it's host of tools.

# Set Up

## Config

In [1]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
local_lib = config['DEFAULT']['local_lib']

In [2]:
num_topics = 100
data_dir = f"{data_home}/newsgroups/20news-18828" # Get the archive 20news-18828.tar.gz from Dropbox

## Imports

In [3]:
import pandas as pd
import numpy as np
from gensim import corpora, models
from collections import defaultdict
import plotly_express as px
from glob import glob
import re 

## Import Data

We import data from a newsgroup collection.

Get the archive `20news-18828.tar.gz` from the course's Dropbox site.

In [4]:
def import_data():
    data = []
    for d in glob(data_dir+"/*"):
        label = d.split("/")[-1]
        print(label)
        for f in glob(d+"/*"):
            fid = f.split("/")[-1]
            flines = open(f, 'r', encoding="latin-1").read().split("\n")
            from_line = ':'.join(flines[0].split(':')[1:])
            subj_line = ':'.join(flines[1].split(':')[1:])
            data.append((fid, label, from_line, subj_line, ' '.join(flines[2:])))
    LIB = pd.DataFrame(data, columns=['doc_id','doc_label','doc_from', 'doc_subj', 'doc_content'])
    LIB.doc_id = LIB.doc_id.astype('int')
    LIB = LIB.set_index(['doc_label','doc_id'])
    return LIB

In [5]:
LIB = import_data()

alt.atheism
talk.politics.mideast
rec.autos
comp.os.ms-windows.misc
rec.motorcycles
talk.politics.guns
sci.electronics
rec.sport.baseball
rec.sport.hockey
sci.med
comp.graphics
sci.space
comp.windows.x
misc.forsale
comp.sys.mac.hardware
talk.religion.misc
sci.crypt
comp.sys.ibm.pc.hardware
talk.politics.misc
soc.religion.christian


In [6]:
LIB

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_from,doc_subj,doc_content
doc_label,doc_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
alt.atheism,53400,"acooper@mac.cc.macalstr.edu (Turin Turambar, ...",Re: free moral agency,In article <735295730.25282@minster.york.ac.u...
alt.atheism,53099,sandvik@newton.apple.com (Kent Sandvik),Re: some thoughts.,In article <bissda.4.734849678@saturn.wwc.edu...
alt.atheism,53363,"""Robert Knowles"" <p00261@psilink.com>",Re: Amusing atheists and agnostics,>DATE: 20 Apr 93 05:23:15 GMT >FROM: Bake...
alt.atheism,53314,cjhs@minster.york.ac.uk,Re: free moral agency,: Are you saying that their was a physical Ad...
alt.atheism,54243,ed@wente.llnl.gov (Ed Suranyi),Re: Asimov stamp,In article <C61H4H.8D4@dcs.ed.ac.uk> pdc@dcs....
...,...,...,...,...
soc.religion.christian,20818,shellgate!llo@uu4.psi.com (Larry L. Overacker),Re: Easter: what's in a name? (was Re: New Te...,In article <Apr.14.03.09.10.1993.5497@athos.r...
soc.religion.christian,20977,dxf12@po.cwru.edu (Douglas Fowler),Re: Christian Parenting,"Sorry for posting this, but my e-mail k..."
soc.religion.christian,20784,marka@hcx1.ssd.csd.harris.com (Mark Ashley),Re: hearing sinners,In article <Apr.21.03.24.19.1993.1271@geneva....
soc.religion.christian,20871,news@cbnewsk.att.com,Re: An agnostic's question,In article <Apr.17.01.11.16.1993.2265@geneva....


In [7]:
LIB.to_csv(f"{output_dir}/newsgroups-LIB.csv")

# Pre-Process the Gensim Way

## Stopwords

We create a set of frequent words. Of course, we can grab a premade list from somewhere else, such as NLTK.

In [8]:
stoplist = set('for a of the and to in is i that it you this be on are'.split(' '))

## Corpus

A corpus is just **a list of lists of words.**

We loop through the list of docs and do some parsing and shaping on the fly. 

Again, we could do better with tools from NLTK.

Here we lowercase each document, split it by white space, remove non-alphanumeric characters, and filter out stopwords

In [9]:
texts = [[re.sub(r"[\W_]+", "", word) for word in document.lower().split() if word not in stoplist]
         for document in LIB.doc_content.values]

## Term Frequencies

We count term frequencies by iteration.

We use `defaultdict` to allow for the dynamic creation of keys.

We do this now in order to filter out low-frequency words.

In [10]:
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

## Filtered Corpus

Now we filter our coprus by frequency, removing words that appear once.

We could of course use any threshhold we want.

In [11]:
filtered_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

## Dictionary

The vocabulary table (VOCAB) is a "dictionary" which associates a term string with a numeric identifier.

Gensim provides a function to create this.

In [12]:
dictionary = corpora.Dictionary(filtered_corpus)

In [41]:
# dictionary.most_common()

## BOW

Gensim also provides a function to create a bag-of-words.

Note that since there is no OHCO, the bag is always the list of lists.

We create the BOW corpus from the text using the dictionary.

In [13]:
bow_corpus = [dictionary.doc2bow(text) for text in filtered_corpus]

In [14]:
# bow_corpus[0]

# Train models

Now we can apply Gensim's models.

## TFIDF

In [15]:
tfidf = models.TfidfModel(bow_corpus)

In [17]:
lda_model = models.LdaModel(bow_corpus, id2word=dictionary, num_topics=num_topics)

In [18]:
hdp_model = models.HdpModel(bow_corpus, id2word=dictionary)

# Convert to Pandas

### VOCAB

In [19]:
VOCAB = pd.DataFrame([(k, v) for k, v in dictionary.token2id.items()], columns=['term_str','term_id']) #.set_index('term_id')
VOCAB['n'] = VOCAB.term_str.map(lambda x: frequency[x])
VOCAB = VOCAB.set_index('term_id').sort_index()

In [20]:
VOCAB.sample(5)

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
23434,seeker,9
20551,shtendal,3
55560,lightwave,24
56132,mannikin,4
62409,suppposed,2


### TFIDF

In [21]:
tfidf_data = []
for doc_id, doc in enumerate(bow_corpus):
    for term in tfidf[doc]:
        tfidf_data.append((doc_id, term[0], term[1]))
TFIDF = pd.DataFrame(tfidf_data, columns=['doc_id','term_id', 'tfidf']).set_index(['doc_id','term_id'])

In [22]:
TFIDF.tfidf.unstack(fill_value=0)

term_id,0,1,2,3,4,5,6,7,8,9,...,79154,79155,79156,79157,79158,79159,79160,79161,79162,79163
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.071297,0.119306,0.21114,0.195145,0.329207,0.158006,0.045368,0.022169,0.018710,0.040743,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
1,0.029987,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.016722,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
2,0.033071,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
3,0.023712,0.134912,0.00000,0.220670,0.248178,0.178674,0.051302,0.025069,0.000000,0.046073,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
4,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.024819,0.041894,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18823,0.006743,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.023761,0.010027,0.021835,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.243452,0.000000
18824,0.027113,0.000000,0.00000,0.000000,0.000000,0.000000,0.032589,0.031849,0.000000,0.029267,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.163157
18825,0.017462,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.018460,0.015580,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
18826,0.000000,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.011747,0.009914,0.021589,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000


### BOW

In [23]:
bow_data = []
for i, doc in enumerate(bow_corpus):
    for term in doc:
        bow_data.append((i, term[0], term[1]))
BOW = pd.DataFrame(bow_data, columns=['doc_id','term_id', 'n']).set_index(['doc_id','term_id'])     
DTM = BOW.n.unstack(fill_value=0)

In [24]:
BOW.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,n
doc_id,term_id,Unnamed: 2_level_1
0,0,17
0,1,1
0,2,1
0,3,1
0,4,3


In [25]:
DTM.head()

term_id,0,1,2,3,4,5,6,7,8,9,...,79154,79155,79156,79157,79158,79159,79160,79161,79162,79163
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,17,1,1,1,3,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,8,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,1,0,1,2,1,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,2,0,...,0,0,0,0,0,0,0,0,0,0


### LDA

#### PHI

In [26]:
PHI = pd.DataFrame(lda_model.get_topics()).T
PHI.index.name = 'term_id'

In [27]:
PHI

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.005223,0.012919,6.400142e-02,5.819460e-02,0.002032,0.002667,1.923255e-02,5.289441e-03,0.006496,2.934387e-01,...,0.009142,0.003940,0.005411,2.043424e-02,0.003967,0.004276,5.863382e-02,0.004488,0.014855,7.094335e-03
1,0.000007,0.000002,1.736738e-04,9.071235e-07,0.000006,0.001498,1.567797e-07,8.412390e-08,0.000006,9.698412e-06,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,2.513385e-06,0.027087,0.000003,7.884551e-04
2,0.000007,0.000002,2.092211e-07,9.071235e-07,0.000006,0.000006,1.539227e-07,8.412390e-08,0.000006,2.791612e-08,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,3.403703e-07,0.000007,0.000003,5.500826e-07
3,0.000009,0.000002,2.092211e-07,9.071235e-07,0.000006,0.000006,1.565939e-07,8.412390e-08,0.000006,2.791612e-08,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,3.235131e-07,0.000007,0.000003,5.694360e-07
4,0.000008,0.000038,2.337896e-05,8.724458e-06,0.000017,0.000020,2.878167e-06,6.345465e-07,0.000006,3.573125e-05,...,0.000097,0.000063,0.000001,9.630675e-07,0.000005,0.000011,1.090347e-03,0.000022,0.000005,5.349657e-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79159,0.000007,0.000002,2.092211e-07,9.071235e-07,0.000006,0.000006,1.539227e-07,8.412390e-08,0.000006,2.791612e-08,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,9.501137e-08,0.000007,0.000003,1.479872e-07
79160,0.000007,0.000002,2.092211e-07,9.071235e-07,0.000006,0.000006,1.539227e-07,8.412390e-08,0.000006,2.791612e-08,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,9.501137e-08,0.000007,0.000003,1.479872e-07
79161,0.000007,0.000002,2.092211e-07,9.071235e-07,0.000006,0.000006,1.539227e-07,8.412390e-08,0.000006,2.791612e-08,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,9.501137e-08,0.000007,0.000003,1.479872e-07
79162,0.000007,0.000002,2.092211e-07,9.071235e-07,0.000006,0.000006,1.539227e-07,8.412390e-08,0.000006,2.791612e-08,...,0.000002,0.000002,0.000001,2.772506e-08,0.000005,0.000005,9.501137e-08,0.000007,0.000003,1.479872e-07


#### THETA

In [28]:
theta_data = []
for doc_id, doc_bow in enumerate(bow_corpus):
    for topic in lda_model.get_document_topics(doc_bow):
        theta_data.append((doc_id, topic[0], topic[1]))
THETA = pd.DataFrame(theta_data, columns=['doc_id', 'topic_id', 'topic_weight']).set_index(['doc_id','topic_id']).unstack(fill_value=0)

In [29]:
THETA

Unnamed: 0_level_0,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight,topic_weight
topic_id,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
doc_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.253033,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.283775,0.013299,0.000000,0.000000
1,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.176715,...,0.000000,0.0,0.0,0.670422,0.0,0.0,0.000000,0.000000,0.000000,0.000000
2,0.0,0.0,0.644239,0.0,0.0,0.0,0.0,0.000,0.0,0.000000,...,0.028767,0.0,0.0,0.000000,0.0,0.0,0.083418,0.000000,0.000000,0.000000
3,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.000000,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.318804,0.019107,0.000000,0.000000
4,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.000000,...,0.000000,0.0,0.0,0.203613,0.0,0.0,0.125377,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18823,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.000000,...,0.000000,0.0,0.0,0.242695,0.0,0.0,0.000000,0.000000,0.000000,0.000000
18824,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.024523,...,0.000000,0.0,0.0,0.000000,0.0,0.0,0.059752,0.000000,0.000000,0.013536
18825,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000,0.0,0.000000,...,0.000000,0.0,0.0,0.030481,0.0,0.0,0.312731,0.000000,0.010349,0.000000
18826,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.014,0.0,0.000000,...,0.000000,0.0,0.0,0.067477,0.0,0.0,0.000000,0.000000,0.000000,0.020792


#### TOPIC

In [30]:
topic_data = []
for t in range(num_topics):
    for term_rank, term in enumerate(lda_model.get_topic_terms(t)):
        term_id = term[0]
        topic_data.append((t, term_rank, dictionary.id2token[term_id]))

In [31]:
TOPIC = pd.DataFrame(topic_data, columns=['topic_id', 'term_rank', 'term_str'])\
    .set_index(['topic_id','term_rank']).term_str.unstack()

In [32]:
TOPIC.head(20)

term_rank,0,1,2,3,4,5,6,7,8,9
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,host,cents,gates,1st,155,bright,gate,protector,nights,3rd
1,pope,blue,rc,authentic,conceived,shadow,,beauty,dump,genetic
2,,sin,have,writes,with,trinity,or,not,my,if
3,x,,tower,c,301,0,calvin,amen,games,your
4,ports,assembly,external,alignment,thompson,wandering,na,programme,hatch,implementation
5,la,550,baptist,desperately,holly,lodge,freemasonry,170,73,disappointment
6,,god,by,as,not,with,who,spiritual,we,have
7,not,have,if,they,or,we,as,but,an,one
8,cables,rd,descent,french,operational,allocated,teh,sink,commissioner,jobs
9,,writes,with,have,article,if,not,or,but,my


# Dataflow

In [34]:
!dot -Tpng gensim.dot -o gensim.png

![](gensim.png)