# Exploratory Topic Modeling

## Introduction

### Summary
In this notebook, I'm going to try to build a topic model using [legislation proposed in Maryland in 2018](). 

I heavily relied on a [Topic Modeling tutorial I found on MachineLearningPlus](https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/). Thanks to everyone in the community who is putting out information to help others learn!

### Gathering the data

I built a web scraper that crawls the [Maryland General Assembly's website](http://mgaleg.maryland.gov/webmga/frmLegislation.aspx?pid=legisnpage&tab=subject3) and collects information about each bill proposed in the 2018 session. To build the topic model, I used the "Purpose" from each bill; this was a really interesting challenge! If you're interested on my approach, you can read through my tutorial [here]().

### Requirements

You can download the entire project on my GitHub, but if you want to walk through this yourself, be sure you have the following packages:

* pandas
* gensim
* spacy
* sklearn

## Data Preparation

### Loading in the data

As I mentioned above, I've already collected the data myself. If you're interested in learning more about that, you can look [here]().


In [17]:
import pandas as pd

data = pd.read_csv(
    'C:\\Users\\switkowski\\Documents\\Projects\\Topic-Model-MD-Legislation\\legislation_scraper\\data\\bill_data.csv')
data.head(1)

Unnamed: 0,bill_name,bill_number,broad_subjects,committee,narrow_subjects,purpose,sponsor,status,url
0,Harford County Sheriff - Deputy Sheriffs and C...,HB0015,Courts and Court Personnel - Local,Appropriations,"Collective Bargaining,Contracts -see also- Lan...",FOR the purpose of providing that certain depu...,Delegate Lisanti,In the House - Unfavorable Report by Appropria...,http://mgaleg.maryland.gov/2018RS/bills/hb/hb0...


There's a lot of information in this file. Briefly, here's what's contained in `bill_data.csv`.

* bill_name
* bill_number
* broad_subjects
* committee
* narrow_subjects
* purpose
* sponsor
* status
* url

To build the topic model, we're going to use `purpose`.

The first thing I'm going to do is convert `purpose` to a list.

In [2]:
purpose = data.purpose.tolist()

### Preparing and processing the text data

Next, I'm going to process the text a little bit. To accurately build the model, I need to standardize the text as uniform as possible.

First, I'm going to use `gensim.utils.simplepreprocess` to remove lowercases, accents, and [tokenize the data](https://www.techopedia.com/definition/13698/tokenization). Tokenizing the data converts out 

In [21]:
import gensim

purpose_words = [gensim.utils.simple_preprocess(
    str(text), deacc=True) for text in purpose]

purpose_words[1]

['for', 'the', 'purpose', 'of', 'authorizing', 'the', 'creation', 'of', 'state', 'debt', 'not', 'to', 'exceed', 'the', 'proceeds', 'to', 'be', 'used', 'as', 'grant', 'to', 'the', 'county', 'executive', 'and', 'county', 'council', 'of', 'howard', 'county', 'for', 'certain', 'development', 'or', 'improvement', 'purposes', 'providing', 'for', 'disbursement', 'of', 'the', 'loan', 'proceeds', 'subject', 'to', 'requirement', 'that', 'the', 'grantee', 'provide', 'and', 'expend', 'matching', 'fund', 'establishing', 'deadline', 'for', 'the', 'encumbrance', 'or', 'expenditure', 'of', 'the', 'loan', 'proceeds', 'and', 'providing', 'generally', 'for', 'the', 'issuance', 'and', 'sale', 'of', 'bonds', 'evidencing', 'the', 'loan', 'section', 'be', 'it', 'enacted']

The text looks a lot better now, but there's one more critical step I need to take before we can move forward. You'll notice that there are words that take both a singular and plural version in this text (for example, "purpose" and "purposes"). These are effectively conveying the same message, but they would be processed as different words. 

To improve my model, I'm going to take the [*lemma* of each word](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) using `spacy` and lemmatization. Using the example of "purpose" and "purposes", each word would be converted to "purpose".

In [25]:
nlp('purpose')[0].lemma_

'purpose'

In [4]:
import spacy


def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for text in texts:
        doc = nlp(" ".join(text))
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in [
                         '-PRON-'] else "" for token in doc if token.pos_ in allowed_postags]))
    return texts_out


nlp = spacy.load('en', disable=(['parser', 'ner']))
purpose_lemmatized = lemmatization(purpose_words)

  return f(*args, **kwds)
  return f(*args, **kwds)


In [5]:
purpose_lemmatized[0]

'purpose provide certain deputy sheriff correctional officer office sheriff harford county have right organize negotiate harford county executive harford county sheriff regard certain wage employee health care premium share require right organize negotiate be conduct accordance certain provision harford county code otherwise provide act require term agreement regard certain wage employee health care premium share be set memorandum agreement enter sheriff county executive employee organization provide agreement regard certain wage employee health care premium share be not effective agreement be ratify sheriff county executive employee organization provide modification exist memorandum agreement be not valid certain circumstance require certain procedure set harford county code apply certain party be unable reach certain agreement provide construction act generally relate salary negotiation right swear law enforcement officer correctional officer harford county sheriff office'

In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(analyzer = 'word',
                            min_df = 10,
                            stop_words = 'english',
                            lowercase = True,
                            token_pattern = '[a-zA-Z0-9]{3,}')

purpose_vectorized = vectorizer.fit_transform(purpose_lemmatized)

In [7]:
purpose_dense = purpose_vectorized.todense()

print("Sparcity: " + str(((purpose_dense > 0).sum()/purpose_dense.size)*100) + "%")

Sparcity: 2.43933850340896%


In [8]:
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD

lda_model = LatentDirichletAllocation(n_components=8,
                                     max_iter=10,
                                     learning_method='online',
                                     random_state=100,
                                     batch_size=50,
                                     evaluate_every=-1,
                                     n_jobs=-1)

lda_output = lda_model.fit_transform(purpose_vectorized)

print(lda_model)

LatentDirichletAllocation(batch_size=50, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=8, n_jobs=-1, n_topics=None, perp_tol=0.1,
             random_state=100, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)


In [9]:
print("Log Likelihood: ", lda_model.score(purpose_vectorized))

Log Likelihood:  -1263931.107496432


In [10]:
print("Perplexity: ", lda_model.perplexity(purpose_vectorized))

Perplexity:  256.1582111918381


In [11]:
pprint(lda_model.get_params())

Pretty printing has been turned OFF


In [12]:
from sklearn.model_selection import GridSearchCV

search_params = {'n_components': [5, 10, 15, 20, 25, 30],
                'learning_decay': [0.5, 0.7, 0.9]}

lda = LatentDirichletAllocation()

model = GridSearchCV(lda, param_grid = search_params)

model.fit(purpose_vectorized)



C:\Users\switkowski\AppData\Local\Programs\Python\Python36\Lib\site-packages\sklearn\decomposition\online_lda.py:536: DeprecationWarn

In [13]:
best_lda_model = model.best_estimator_

print("Best Model's Params: ", model.best_params_)

print("Best Log Likelihood Score: ", model.best_score_)

print("Model Perplexity: ", best_lda_model.perplexity(purpose_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_components': 15}
Best Log Likelihood Score:  -459277.68657627713
Model Perplexity:  218.16156096221894


In [14]:
topic_keywords = pd.DataFrame(best_lda_model.components_)

topic_keywords.columns = vectorizer.get_feature_names()

topic_keywords.head()

Unnamed: 0,abandon,ability,absentee,abuse,academic,accept,access,accessible,accordance,account,...,work,worker,workforce,workgroup,write,writing,year,young,youth,zone
0,0.072929,0.102199,0.073529,15.024464,0.082069,0.072303,27.79773,0.073751,8.966334,0.073478,...,0.330395,0.073103,0.085247,0.074435,21.134991,0.076543,0.109063,0.073831,0.072555,7.832677
1,0.074297,0.087642,0.120497,0.077659,0.083216,0.09276,0.092979,0.0982,6.652856,25.879754,...,1.601368,0.104675,0.074457,0.073684,4.861161,2.092029,92.294292,0.072936,0.081779,1.126161
2,0.287013,9.127958,0.073814,0.074949,0.073602,0.077562,8.828429,0.077769,1.590058,16.628538,...,0.090366,14.122516,0.072565,0.154756,21.587233,5.692054,7.382479,0.072305,0.073227,0.083909
3,0.074918,2.264629,0.074712,44.643841,0.088491,6.451913,60.996098,0.081971,40.47811,0.093005,...,12.070804,0.738668,0.078097,43.320254,15.189809,0.087651,67.616249,6.529728,59.466995,0.073886
4,0.073047,0.087885,0.073795,0.082269,13.379773,0.200664,0.118826,0.074274,0.101041,10.252645,...,41.500187,0.115481,48.128957,0.080863,0.108026,7.181339,1.267744,0.07819,0.082567,0.07467


In [15]:
def show_topics(vectorizer = vectorizer,
               lda_model = lda_model,
               n_words = 20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        topic_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(topic_keyword_locs))
    return topic_keywords

topic_keywords = show_topics(lda_model = best_lda_model)

NameError: name 'np' is not defined

In [None]:
pd.DataFrame(topic_keywords)