# Topic Modeling with Gensim

A **topic model** is an abstraction of the major topics contained in a corpus of texts. "Topic" in this context simply means a pattern of co-occurring words. The assumption is that if there are clearly identified patterns of co-occurring words, those patterns of co-occurring words reveal a latent structure in the corpus of texts. In short, a topic model is a representation of the major themes or structures of a corpus of texts.

`Gensim` is a popular Python library for building topic models. In this notebook we will use `Gensim` to build a topic model of Gibbon's _Decline and Fall of the Roman Empire_. After building a topic model, we will then use `pyLDAvis` to visualize the model so we can evaluate its usefulness.

I highly recommend that you read through `Gensim`'s [documentation](https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here). Much of the code below is adapted from that source.

## Set up

**NOTE**: one of the Python libraries we are using (`pyLDAvis`) can cause problems. Be sure to do the installations in the order that you see them below.

In [1]:
! pip install funcy

Collecting funcy
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy
Successfully installed funcy-2.0


In [2]:
! pip install tzdata



In [3]:
! pip install --no-dependencies pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
                                              0.0/2.6 MB ? eta -:--:--
                                              0.0/2.6 MB ? eta -:--:--
                                              0.0/2.6 MB ? eta -:--:--
                                              0.0/2.6 MB 326.8 kB/s eta 0:00:08
     --                                       0.2/2.6 MB 1.2 MB/s eta 0:00:03
     ------                                   0.4/2.6 MB 2.4 MB/s eta 0:00:01
     -----------                              0.7/2.6 MB 3.2 MB/s eta 0:00:01
     ----------------                         1.1/2.6 MB 4.2 MB/s eta 0:00:01
     ----------------------                   1.5/2.6 MB 5.2 MB/s eta 0:00:01
     ---------------------------              1.8/2.6 MB 5.4 MB/s eta 0:00:01
     -------------------------------          2.0/2.6 MB 5.6 MB/s eta 0:00:01
     ---------------------------------------  2.6/2.6 MB 6.6 MB/s eta 0:00:01
     ---

In [4]:
! pip install wget



In [7]:
! pip install gensim

Collecting gensim
  Downloading gensim-4.3.2-cp311-cp311-win_amd64.whl (24.0 MB)
                                              0.0/24.0 MB ? eta -:--:--
                                              0.0/24.0 MB ? eta -:--:--
                                              0.0/24.0 MB ? eta -:--:--
                                             0.0/24.0 MB 326.8 kB/s eta 0:01:14
                                              0.2/24.0 MB 1.3 MB/s eta 0:00:19
                                              0.6/24.0 MB 3.3 MB/s eta 0:00:08
     -                                        1.0/24.0 MB 4.4 MB/s eta 0:00:06
     --                                       1.5/24.0 MB 5.9 MB/s eta 0:00:04
     ---                                      2.2/24.0 MB 7.6 MB/s eta 0:00:03
     ----                                     2.9/24.0 MB 8.8 MB/s eta 0:00:03
     ------                                   3.7/24.0 MB 10.2 MB/s eta 0:00:02
     -------                                  4.6/24.0 MB 11.9 MB/s e

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyldavis 3.4.1 requires numexpr, which is not installed.


In [30]:
from collections import defaultdict
import wget
from gensim import corpora, models
import pandas as pd
import pyLDAvis.gensim
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import requests
import json
import math

## Upload data

### Class example

In [8]:
# url = 'https://raw.githubusercontent.com/msaxton/nlp-data/main/gibbon_sections.csv'
# file_name = wget.download(url)
# df = pd.read_csv(file_name)
# df.head()

### Upload your own data sets

If you are using Google Colab:

In [None]:
# uploaded = files.upload()

In [None]:
# file_name = # <-- for example: 'data.csv'
# df = pd.read_csv(io.BytesIO(uploaded[file_name]))

If you are using Jupyter Notebooks:

In [26]:
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext="China"+"Chinese"&x=0&y=0&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)
raw = response.text
results = json.loads(raw)
results.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

In [28]:
print(results['items'][0])

{'sequence': 62, 'county': [None], 'edition': None, 'frequency': 'Daily', 'id': '/lccn/sn83045462/1944-08-13/ed-1/seq-62/', 'subject': ['Washington (D.C.)--fast--(OCoLC)fst01204505', 'Washington (D.C.)--Newspapers.'], 'city': ['Washington'], 'date': '19440813', 'title': 'Evening star. [volume]', 'end_year': 1972, 'note': ['"From April 25 through May 24, 1861 one sheet issues were published intermittently owing to scarcity of paper." Cf. Library of Congress, Photoduplication Service.', 'Also issued on microfilm from Microfilming Corp. of America and the Library of Congress, Photoduplication Service.', 'Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', 'Publisher varies: Noyes, Baker & Co., <1867>; Evening Star Newspaper Co., <1868->', "Suspended Jan. 1-6, 1971 because of a machinists' strike."], 'state': ['District of Columbia'], 'section_label': '', 'type': 'page', 'place_of_publication': 'Washington, D.C.', 'sta

In [31]:
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

3158


In [32]:
data = []

In [33]:
start_date = '1770'
end_date = '1865'
search_term = "China"+"Chinese"
state = ''

In [34]:
for i in range(1, 10):  
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state={state}&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')  # f-string
    response = requests.get(url)
    raw = response.text
    print(f'page {i} status code:', response.status_code)  # checking for errors
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        row_data = {}
        try:
          row_data['title'] = item_['title_normal']
        except:
          row_data['city'] = "none"
        try:
          row_data['city'] = item_['city']
        except:
          row_data['city'] = "none"
        try:
          row_data['date'] = item_['date']
        except:
          row_data['date'] = "none"
        try:
          row_data['raw_text'] = item_['ocr_eng']
        except:
          row_data['raw_text'] = 'none'
    data.append(row_data)

page 1 status code: 200
page 2 status code: 200
page 3 status code: 200
page 4 status code: 200
page 5 status code: 200
page 6 status code: 200
page 7 status code: 200
page 8 status code: 200
page 9 status code: 200


In [35]:
df = pd.DataFrame.from_dict(data)

In [40]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,davenport gazette.,[Davenport],18530707,"mr""®\nSANDERS k MYIS,\nFBOPBIETOBS' BUSINESS C..."
1,yazoo democrat.,[Yazoo City],18530105,THE YAZOO DEMOCRAT\n1'\nf\nPublished Weekly\nV...
2,washington sentinel.,[Washington],18540405,fetal aitb personal.\nThe Hippodrome.?The prep...
3,lancaster gazette.,[Lancaster],18480324,n i\n' S. I:\nill\nn\n' ' 1\n;!:!)\n111\n! i\n...
4,yazoo democrat.,[Yazoo City],18520825,TRY AGAIN.\nTV CORBETT takes leave to return h...


## Prepare data for topic model
The Python library we are going to use to make our topic model requires the data to be in a form of a list. Within that list, each "document" is also a list. So it looks something like this:

`[
  ['This is document 1'],
  ['This is document 2'],
  ['This is document 3']
]`

In [41]:
# extract the data out of the DataFrame
documents = df['raw_text'].to_list()

In [43]:
len(documents[0])

45959

`Gensim` needs each document to be tokenized. We can use [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) to quickly achieve this result. When complete, our data will now look like this:

`[
  ['This', 'is', 'document', '1'],
  ['This', 'is', 'document', '2'],
  ['This', 'is', 'document', '3'],
]`

In [44]:
# tokenize - the syntax below will create a list of lists
texts =[
    [word for word in document.lower().split()]
    for document in documents
]

It takes a lot of preparation to build a useful topic model. An important part of that preparation is to eliminate "noise" from you model. One way to do this is to remove pieces of data that are irrelevant. Here we will remove tokens that only occur once. **You may want to adjust this as you refine your topic model.**

In [45]:
# create a count of each token
frequency = defaultdict(int)
for text in texts:
  for token in text:
    frequency[token] += 1

In [46]:
# remove words that appear only 1 time
texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

## Build topic model

`Gensim` is built around [four core concepts](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#core-concepts):
- **document**: an individual text. In our case, this is an individual section from Gibbon.
- **corpus**: a collection of documents. In our case, this is all the sections from Gibbon put together.
- **vector**: a mathematically convenient representation of a document. Basically, each word in the document is given a numerical id. This allows `Gensim` to do faster calculations behind the scene.
- **model**: an algorithm for transforming vectors from one representation to another. In our case, this will be the LDA model we build.


### Basic topic model



In [47]:
# create a dictionary based off our texts
# The dictionary maps each token to a unique integer id
dictionary = corpora.Dictionary(texts)

In [48]:
# create a corpus based off our dictionary and our texts
corpus = [dictionary.doc2bow(text) for text in texts]

In [49]:
# build LDA model
lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=50)

In [50]:
# explore topics
lda_model.print_topics()

[(0,
  '0.000*"the" + 0.000*"of" + 0.000*"and" + 0.000*"to" + 0.000*"in" + 0.000*"for" + 0.000*"a" + 0.000*"be" + 0.000*"at" + 0.000*"that"'),
 (1,
  '0.000*"of" + 0.000*"the" + 0.000*"and" + 0.000*"in" + 0.000*"to" + 0.000*"a" + 0.000*"for" + 0.000*"be" + 0.000*"on" + 0.000*"at"'),
 (2,
  '0.000*"the" + 0.000*"of" + 0.000*"and" + 0.000*"to" + 0.000*"in" + 0.000*"a" + 0.000*"be" + 0.000*"at" + 0.000*"for" + 0.000*"by"'),
 (3,
  '0.000*"the" + 0.000*"of" + 0.000*"and" + 0.000*"to" + 0.000*"in" + 0.000*"a" + 0.000*"for" + 0.000*"at" + 0.000*"that" + 0.000*"is"'),
 (4,
  '0.076*"the" + 0.051*"of" + 0.034*"and" + 0.027*"to" + 0.023*"in" + 0.014*"a" + 0.012*"be" + 0.011*"at" + 0.010*"on" + 0.010*"."'),
 (5,
  '0.083*"the" + 0.036*"of" + 0.032*"and" + 0.028*"a" + 0.027*"to" + 0.019*"in" + 0.014*"was" + 0.010*"his" + 0.010*"with" + 0.010*"that"'),
 (6,
  '0.000*"the" + 0.000*"of" + 0.000*"and" + 0.000*"to" + 0.000*"in" + 0.000*"a" + 0.000*"for" + 0.000*"be" + 0.000*"that" + 0.000*"at"'),
 (7,

In [51]:
# Find topics in each document
lda_model.get_document_topics(corpus[0])

[(5, 0.99984163)]

In [59]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary, mds='mmds')
vis



### Tf-idf topic model

In [53]:
# initialize a tfidf model
tfidf = models.TfidfModel(corpus)

In [54]:
# make a new corpus based on the tfidf model
corpus_tfidf = tfidf[corpus]

In [55]:
# here we build our topic model
lda_model_tfidf = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=20, passes=50)
corpus_lda = lda_model_tfidf[corpus_tfidf]

In [56]:
lda_model_tfidf.print_topics()

[(0,
  '0.000*"ship," + 0.000*"sheriff" + 0.000*"skin" + 0.000*"sixty" + 0.000*"simple" + 0.000*"signed" + 0.000*"sible" + 0.000*"shore" + 0.000*"society" + 0.000*"sheriffs"'),
 (1,
  '0.001*"“" + 0.001*"■" + 0.001*"’" + 0.001*"|" + 0.001*"fairfax" + 0.001*"confederate" + 0.001*"‘" + 0.001*"buren," + 0.001*"capt." + 0.001*"van"'),
 (2,
  '0.000*"ship," + 0.000*"sheriff" + 0.000*"skin" + 0.000*"sixty" + 0.000*"simple" + 0.000*"signed" + 0.000*"sible" + 0.000*"shore" + 0.000*"society" + 0.000*"sheriffs"'),
 (3,
  '0.000*"ship," + 0.000*"sheriff" + 0.000*"skin" + 0.000*"sixty" + 0.000*"simple" + 0.000*"signed" + 0.000*"sible" + 0.000*"shore" + 0.000*"society" + 0.000*"sheriffs"'),
 (4,
  '0.002*"yazoo" + 0.001*"1852." + 0.001*"probate" + 0.001*"township" + 0.001*"quarter" + 0.001*"east-half" + 0.001*"1852," + 0.001*"south-west" + 0.001*"north-east" + 0.001*"quarter,"'),
 (5,
  '0.000*"ship," + 0.000*"sheriff" + 0.000*"skin" + 0.000*"sixty" + 0.000*"simple" + 0.000*"signed" + 0.000*"sible"

In [57]:
# Find topics in each document
lda_model_tfidf.get_document_topics(corpus_tfidf[0])

[(0, 0.031738747),
 (1, 0.031738747),
 (2, 0.031738747),
 (3, 0.031738747),
 (4, 0.031738747),
 (5, 0.031738747),
 (6, 0.031738747),
 (7, 0.031738747),
 (8, 0.03173875),
 (9, 0.031738747),
 (10, 0.3969638),
 (11, 0.031738747),
 (12, 0.031738747),
 (13, 0.031738747),
 (14, 0.031738747),
 (15, 0.031738747),
 (16, 0.031738747),
 (17, 0.03173875),
 (18, 0.031738747),
 (19, 0.031738747)]

In [60]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_tfidf, corpus_tfidf, dictionary, mds='mmds')
vis

