# Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) represents documents as combinations of topics for which each word has a certain probability. The statistics underlying this approach to text analysis assumes a generative probabilistic model for any set of discrete data (in this case, text documents). In other words, LDA relies on the assumption that each topic may be modeled as an "infinite mixture over an underlying set of topic probabilities" [1]. The number of words in a document follows a certain distribution and the topic composition of the document follows a Dirichlet distribution over a fixed set of topics. 
<br>
<img src="http://deliveryimages.acm.org/10.1145/1860000/1859210/figs/uf1.jpg" align="center" height="500" width="500">
<br>
The purpose of LDA is to determine a set of topics that are likely to have generated the documents from the words in the documents themselves. [2] One shortcoming of the LDA model is it uses the "bag of words" assumption that the order of words in a document may be ignored and that, therefore, documents are more or less exchangable. Non-probabilistic instantiations of LDA mainly use word counts as features.

<br>
[1] D. M. Blei, A. Y. Ng, M. I. Jordan, 2003. "Latent Dirichlet Allocation." *Journal of Machine Learning Research*. pp. 993-1022. [Online](https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).<br>
[2] E. Chen, 2013. "Introduction to Latent Dirichlet Allocation." [Online](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/). <br>
[3] G. Anthes, 2010. "Topic Models vs. Unstructured Data." *Communications of the ACM* 53(12): p. 16-18. DOI:
10.1145/1859204.1859210. 

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option("display.width", 500)
pd.set_option("display.max_columns", 100)
pd.set_option("display.notebook_repr_html", True)
import seaborn as sns
sns.set_style("darkgrid")
sns.set_context("poster")

### Import and Clean Data

Using the methods defined in the Data Cleaning notebook, we can transform the text into training and test corpuses that are segmented by word type (noun, adjective, verb, foreign, and precedent).

In [2]:
sample_df = pd.read_csv("sample_cases.csv")

In [3]:
sample_df.head()

Unnamed: 0,full_cite,text,url,us_cite,year,case,case_id,caseId,caseOriginState,dateArgument,decisionDirection,decisionType,docket,docketId,issueArea,jurisdiction,lawType,majOpinWriter,majVotes,minVotes,usCite
0,United States v. Jimenez Recio 537 U.S. 270 (2...,"OCTOBER TERM, 2002 Syllabus UNITED STATES v. J...",https://supreme.justia.com/cases/federal/us/53...,537 U.S. 270,2003,United States v. Jimenez Recio,10400,2002-016,,11/12/2002,1,1,1/1/1184,2002-016-01,1,1,,110.0,8,1,537 U.S. 270
1,United States v. Jones 345 U.S. 377 (1953),United States v. Jones No. 556 Decided April 1...,https://supreme.justia.com/cases/federal/us/34...,345 U.S. 377,1953,United States v. Jones,927,1952-078,12.0,,1,2,556,1952-078-01,9,2,6.0,,9,0,345 U.S. 377
2,"Joy Oil Co., Ltd. v. State Tax Commission 337 ...","Joy Oil Co., Ltd. v. State Tax Commission No. ...",https://supreme.justia.com/cases/federal/us/33...,337 U.S. 286,1949,"Joy Oil Co., Ltd. v. State Tax Commission",461,1948-090,27.0,1/6/1949,2,1,223,1948-090-01,8,1,1.0,80.0,6,3,337 U.S. 286
3,Witte v. United States 515 U.S. 389 (1995),"OCTOBER TERM, 1994 Syllabus WITTE v. UNITED ST...",https://supreme.justia.com/cases/federal/us/51...,515 U.S. 389,1995,Witte v. United States,9663,1994-076,,4/17/1995,1,1,94-6187,1994-076-01,1,1,2.0,104.0,8,1,515 U.S. 389
4,Warth v. Seldin 422 U.S. 490 (1975),"Warth v. Seldin No. 73-2024 Argued March 17, 1...",https://supreme.justia.com/cases/federal/us/42...,422 U.S. 490,1975,Warth v. Seldin,5850,1974-140,,3/17/1975,1,1,73-2024,1974-140-01,9,1,,101.0,5,4,422 U.S. 490


In [24]:
# training, test data split 
trainingcoln = pd.read_csv('traintestarray.csv',sep=',',header=None).values.ravel()
sample_df['training'] = trainingcoln

The following cleaning functions can also be found in the Data Cleaning notebook.

In [4]:
import re 
regex1 = r"\(.\)" 

In [6]:
from pattern.en import parse
from pattern.en import pprint
from pattern.en import conjugate, lemma, lexeme
from pattern.vector import stem, PORTER, LEMMA
from sklearn.feature_extraction import text
import string

#stopwords and punctuation
stopwords=text.ENGLISH_STOP_WORDS
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')

def get_parts(opinion):
    oplow = opinion.lower()
    #REMOVING CHARACTERS: we have ugly text, and remove unnecssary characters.
    oplow = unicode(oplow, 'ascii', 'ignore') #remove non-unicode characters 
    oplow = str(oplow).translate(string.maketrans("\n\t\r", "   ")) #remove characters like \n 
    #justices (eg, Justice Breyer) are referred to as J. (eg,Breyer, J.); we remove the J., also JJ. for plural
    oplow = oplow.replace('j.','')
    oplow = oplow.replace('jj.','')
    oplow = oplow.replace('c.','') #remove C. for chief justice 
    oplow = oplow.replace('pp.','') #page numbers
    oplow = oplow.replace('  ','') #multiple spaces
    oplow = ''.join([i for i in oplow if not i.isdigit()]) #remove digits 
    oplow=re.sub(regex1, ' ', oplow)
    #Remove the Justia disclaimer at the end of the case, if it appears in the string
    justiadisclaimer = "disclaimer: official"
    if justiadisclaimer in oplow: 
        optouse = oplow.split(justiadisclaimer)[0]
    else:
        optouse = oplow
    
    #GET A LIST OF PRECEDENTS CITED IN THE OPINION 
    wordslist = optouse.split()
    #find precedents based on string 'v.' (eg, 'Brown v. Board')
    indices = [i for i in range(len(wordslist)) if wordslist[i]=='v.']
    precedents = [wordslist[i-1]+ ' ' + wordslist[i]+ ' ' + wordslist[i+1] for i in indices]
    
    #remove precedents, as we have already accounted for these
    for precedent in precedents:
        optouse = optouse.replace(precedent,'')
    
    #PARSE INTO LIST OF LISTS --> GET WORDS
    parsed = parse(optouse,tokenize=True,chunks=False,lemmata=True).split()
    verbs = [] 
    nouns = [] 
    adjectives = [] 
    foreign = [] 
    i=0
    #Create lists of lists of verbs, nouns, adjectives and foreign words in each sentence.
    for sentence in parsed: #for each sentence 
        verbs.append([])
        nouns.append([])
        adjectives.append([])
        foreign.append([])
        for token in sentence: #for each word in the sentence 
            if token[0] in punctuation or token[0] in stopwords or len(token[0])<=2:
                continue
            wordtouse = token[0]
            for x in punctuation:
                wordtouse = wordtouse.replace(x,' ') #if punctuation in word, take it out
            if token[1] in ['VB','VBZ','VBP','VBD','VBN','VBG']:
                verbs[i].append(lemma(wordtouse)) #append the lemmatized verb (we relemmatize because lemmata in parse does not seem to always work)
            if token[1] in ['NN','NNS','NNP','NNPS']:
                nouns[i].append(lemma(wordtouse))
            if token[1] in ['JJ','JJR','JJS']:
                adjectives.append(lemma(wordtouse))
            if token[1] in ['FW']:
                foreign.append(wordtouse)  
        i+=1  
    #Zip together lists so each tuple is a sentence. 
    out=zip(verbs,nouns,adjectives,foreign)
    verbs2 = []
    nouns2 = []
    adjectives2 = []
    foreign2 = []
    for sentence in out: 
        if sentence[0]!=[] and sentence[1]!=0: #if the sentence has at least one verb and noun, keep it. Otherwise, drop it.
            if type(sentence[0])==list: 
                verbs2.append(sentence[0])
            else: 
                verbs2.append([sentence[0]]) #if verb is a string rather than a list, put string in list
            if type(sentence[1])==list:
                nouns2.append(sentence[1])
            else:
                nouns2.append([sentence[1]])
            if type(sentence[2])==list:
                adjectives2.append(sentence[2])
            else:
                adjectives2.append([sentence[2]])
            if type(sentence[3])==list:
                foreign2.append(sentence[3])
            else:
                foreign2.append([sentence[3]])
    return(verbs2,nouns2,adjectives2,foreign2,precedents)

### Vocabulary Creation

Now that we have cleaned our text corpus, we want to create vocabularies segmented by word type. We use the **get_parts** function defined above to return all verbs, nouns, adjectives, foreign words, and precedents. This word list will be useful for creating matrices of word frequencies. 

In [7]:
%%time 
verbwords = []
nounwords = []
adjwords = []
forwords = []
precedents_all = []
for op in sample_df.text:
    verbs,nouns,adjectives,foreign,precedents = get_parts(op)
    verbwords.append(verbs)
    nounwords.append(nouns)
    adjwords.append(adjectives)
    forwords.append(foreign)
    precedents_all.append(precedents)

CPU times: user 1min 39s, sys: 316 ms, total: 1min 39s
Wall time: 1min 40s


In [8]:
issue_areas = sample_df.issueArea.tolist()

In [9]:
#create precedents vocab
precedents_vocab = list(set([precedent for sublist in precedents_all for precedent in sublist]))
#create other vocabs
verbvocab = list(set([word for sublist in verbwords for subsublist in sublist for word in subsublist]))
nounvocab = list(set([word for sublist in nounwords for subsublist in sublist for word in subsublist]))
adjvocab = list(set([word for sublist in adjwords for subsublist in sublist for word in subsublist]))
forvocab = list(set([word for sublist in forwords for subsublist in sublist for word in subsublist]))

In [10]:
#dictionaries: id --> word
id2prec = dict(enumerate(precedents_vocab))
id2verb = dict(enumerate(verbvocab))
id2noun = dict(enumerate(nounvocab))
id2adj = dict(enumerate(adjvocab))
id2for = dict(enumerate(forvocab))
#dictionaries: word --> id
prec2id = dict(zip(id2prec.values(),id2prec.keys()))
verb2id = dict(zip(id2verb.values(),id2verb.keys()))
noun2id = dict(zip(id2noun.values(),id2noun.keys()))
adj2id = dict(zip(id2adj.values(),id2adj.keys()))
for2id = dict(zip(id2for.values(),id2for.keys()))

In [11]:
#this function takes a list of words, and outputs a list of tuples 
counter = lambda x:list(set([(i,x.count(i)) for i in x]))

#corpus_creator takes a list of lists of lists like verbwords, or a list of lists like precedents_all. 
#It also takes a word2id dictionary.
def corpus_creator(sentence_word_list,word2id):
    counter = lambda x:list(set([(word2id[i],x.count(i)) for i in x]))
    op_word_list = []
    if type(sentence_word_list[0][0])==list: #if list of lists of lists 
        for opinion in sentence_word_list: 
            #for each list (which corresponds to an opinion) in sentence_word_list, get a list of the words
            op_word_list.append([word for sublist in opinion for word in sublist])
    else: #if list of lists 
        op_word_list = sentence_word_list
    corpus = []
    for element in op_word_list: 
        corpus.append(counter(element))
    return(corpus)

### First Pass (Untransformed)

To adjust the parameters of our mode, we will use the corpus of nouns to check for model accuracy since nouns are most strongly indicative of topics. The first pass at LDA will use the untransformed word matrices created above. We will adjust the number of topic categories in order to finetune the model output.

In [56]:
import gensim

In [13]:
corpus = corpus_creator(nounwords,noun2id)

In [59]:
%%time 
# model with noun corpus, 5 topics
lda1a = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2noun, num_topics=5, update_every=1, chunksize=200, passes=1)



CPU times: user 4.23 s, sys: 39.7 ms, total: 4.27 s
Wall time: 4.39 s


In [60]:
lda1a.print_topics()

[u'0.033*court + 0.014*petitioner + 0.013*state + 0.013*sentence + 0.011*respondent + 0.009*opinion + 0.009*amendment + 0.009*evidence + 0.008*case + 0.007*offense',
 u'0.040*court + 0.017*opinion + 0.017*state + 0.015*trial + 0.014*rule + 0.013*petitioner + 0.013*case + 0.012*respondent + 0.011*post + 0.010*evidence',
 u'0.063*court + 0.036*state + 0.021*district + 0.013*act + 0.012*appeal + 0.011*petitioner + 0.011*law + 0.010*case + 0.010*respondent + 0.009*claim',
 u'0.018*court + 0.014*act + 0.012*state + 0.009*opinion + 0.008*district + 0.008*agency + 0.008*congres + 0.007*regulation + 0.007*government + 0.007*provision',
 u'0.025*act + 0.020*court + 0.014*employee + 0.012*union + 0.011*contract + 0.010*respondent + 0.009*employer + 0.009*labor + 0.009*petitioner + 0.008*board']

In [61]:
%%time 
# model with noun corpus, 10 topics
lda1b = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2noun, num_topics=10, update_every=1, chunksize=200, passes=1)



CPU times: user 4.28 s, sys: 77.6 ms, total: 4.35 s
Wall time: 4.5 s


In [62]:
lda1b.print_topics()

[u'0.024*child + 0.018*bankruptcy + 0.017*court + 0.013*debtor + 0.012*performance + 0.010*children + 0.010*parent + 0.009*opinion + 0.008*petitioner + 0.008*state',
 u'0.063*court + 0.031*state + 0.025*district + 0.015*act + 0.012*opinion + 0.012*appeal + 0.011*respondent + 0.010*law + 0.010*case + 0.010*action',
 u'0.048*court + 0.041*state + 0.021*counsel + 0.017*petitioner + 0.015*claim + 0.015*rule + 0.015*jurisdiction + 0.014*habea + 0.014*trial + 0.013*tribe',
 u'0.032*state + 0.030*tax + 0.021*court + 0.014*property + 0.012*regulation + 0.010*congres + 0.010*government + 0.010*law + 0.010*clause + 0.009*bank',
 u'0.023*court + 0.018*amendment + 0.017*respondent + 0.014*officer + 0.011*search + 0.011*speech + 0.010*government + 0.010*opinion + 0.010*information + 0.009*police',
 u'0.018*water + 0.017*act + 0.016*patent + 0.016*u s  + 0.015*court + 0.012*decree + 0.011*land + 0.010*merger + 0.009*guideline + 0.009*rate',
 u'0.030*employee + 0.029*union + 0.023*board + 0.021*emplo

### Second Pass (td-idf)

We can see from the words and corresponding coefficients above that there are some frequent words that occur in each topic (i.e. court, case, action, opinion, respondent). In order to refine the words that are generated by each topic, we can use term frequency–inverse document frequency (td-idf) to weight the corpus of words by the number of times a word appears in the document. Each document in the corpus is reduced to a fixed-length list of numbers, roughly corresponding to the frequency of appearance for a basic set of words within that document. This should help adjust for the fact that some words appear across all documents at a consistently high frequency. <br>

In order to implement this, we need several helper functions that turn our corpus of words (represented as a list of list of tuples) into a matrix (weighted frequencies) and then back into a corpus that can be inputted into the model.

In [63]:
# takes a corpus and a number of words, and returns a matrix in which the element at row i and column j is the number of
# occurrences of word j in document i.
def corpus_to_mat(corpus, num_words):
    n = len(corpus)
    M = np.zeros((n, num_words))
    for i,doc in enumerate(corpus):
        for word,count in doc:
            M[i][word] = count
    return M

In [25]:
%%time
#get noun corpus
nouncorpus = corpus_creator(nounwords,noun2id)
noun_train_corpus = [nouncorpus[i] for i in range(len(nouncorpus)) if sample_df['training'][i]==1]
noun_test_corpus = [nouncorpus[i] for i in range(len(nouncorpus)) if sample_df['training'][i]==0]

#get verb corpus
verbcorpus = corpus_creator(verbwords,verb2id)
verb_train_corpus = [verbcorpus[i] for i in range(len(verbcorpus)) if sample_df['training'][i]==1]
verb_test_corpus = [verbcorpus[i] for i in range(len(verbcorpus)) if sample_df['training'][i]==0]

#get adjective corpus
adjcorpus = corpus_creator(adjwords,adj2id)
adj_train_corpus = [adjcorpus[i] for i in range(len(adjcorpus)) if sample_df['training'][i]==1]
adj_test_corpus = [adjcorpus[i] for i in range(len(adjcorpus)) if sample_df['training'][i]==0]

#get foreign corpus
forcorpus = corpus_creator(forwords,for2id)
for_train_corpus = [forcorpus[i] for i in range(len(forcorpus)) if sample_df['training'][i]==1]
for_test_corpus = [forcorpus[i] for i in range(len(forcorpus)) if sample_df['training'][i]==0]

#get precedents corpus
preccorpus = corpus_creator(precedents_all,prec2id)
prec_train_corpus = [preccorpus[i] for i in range(len(preccorpus)) if sample_df['training'][i]==1]
prec_test_corpus = [preccorpus[i] for i in range(len(preccorpus)) if sample_df['training'][i]==0]

CPU times: user 2.29 s, sys: 48.8 ms, total: 2.34 s
Wall time: 2.34 s


In [26]:
from sklearn.feature_extraction.text import TfidfTransformer
#this function takes a training matrix of size n_documents_training*vocab_size and a test matrix
#of size n_documents_test*vocab_size. The function outputs the corresponding tfidf matrices.
#Note that we fit on the training data, and then apply that fit to the test data.
def tfidf_mat_creator(trainmatrix,testmatrix):
    tf_idf_transformer=TfidfTransformer()
    tfidf_fit = tf_idf_transformer.fit(trainmatrix)
    tfidf_train = tfidf_fit.transform(trainmatrix).toarray()
    tfidf_test = tfidf_fit.transform(testmatrix).toarray()
    return(tfidf_train,tfidf_test)

In [49]:
noun_tfidf_mat_train,noun_tfidf_mat_test = tfidf_mat_creator(corpus_to_mat(noun_train_corpus, len(nounvocab)),
                                   corpus_to_mat(noun_test_corpus, len(nounvocab)))

In [50]:
# takes a tfidf matrix and returns the corresponding matrix
def tfidf_to_corpus(tfidf_mat): #takes as input: matrix of size n_documents*vocabulary size
    tfidfcorpus = []
    i=0 #keep track of document you are on
    for doc in tfidf_mat: #for each case
        tfidfcorpus.append([])
        j=0
        for word in doc: #for each word in the vocabulary, append tuple (wordid,num_times_word_used)
            tfidfcorpus[i].append((j,tfidf_mat[i][j])) 
            j+=1
        i+=1
    return(tfidfcorpus) 

In [51]:
tfidf_noun_corpus = tfidf_to_corpus(noun_tfidf_mat_train)

In [64]:
%%time
# model with noun corpus (tf-idf weighted), 5 topics
lda2a = gensim.models.ldamodel.LdaModel(corpus=tfidf_noun_corpus, id2word=id2noun, num_topics=5, update_every=1, chunksize=200, passes=1)



CPU times: user 2min 4s, sys: 6.17 s, total: 2min 10s
Wall time: 2min 19s


In [65]:
lda2a.print_topics()

[u'0.001*election + 0.001*tort + 0.001*house + 0.001*applicant + 0.001*religion + 0.001*deportation + 0.001*appellant + 0.001*court + 0.001*negligence + 0.001*vote',
 u'0.001*solicitation + 0.001*contempt + 0.001*copyright + 0.001*racketeer + 0.000*teacher + 0.000*school + 0.000*desegregation + 0.000*rent + 0.000*rico + 0.000*court',
 u'0.009*court + 0.006*state + 0.005*act + 0.004*respondent + 0.004*petitioner + 0.003*district + 0.003*case + 0.003*tax + 0.003*u s  + 0.003*opinion',
 u'0.000*deposit + 0.000*licensee + 0.000*tribe + 0.000*allotment + 0.000*apportionment + 0.000*land + 0.000*injury therefore + 0.000*court + 0.000*coleman + 0.000*secretary',
 u'0.001*veteran + 0.000*liquor + 0.000*dna + 0.000*sate + 0.000*tort + 0.000*hospital + 0.000*wholesaler + 0.000*gene + 0.000*fault + 0.000*schedule']

In [67]:
%%time
# model with noun corpus (tf-idf weighted), 10 topics
lda2b = gensim.models.ldamodel.LdaModel(corpus=tfidf_noun_corpus, id2word=id2noun, num_topics=10, update_every=1, chunksize=200, passes=1)



CPU times: user 1min 57s, sys: 4.75 s, total: 2min 2s
Wall time: 2min 4s


In [68]:
lda2b.print_topics()

[u'0.006*contractor + 0.003*veteran + 0.003*tribe + 0.003*franchise + 0.002*nonresident + 0.001*marijuana + 0.001*division + 0.001*naturalization + 0.001*gift + 0.001*prima',
 u'0.012*court + 0.009*state + 0.008*act + 0.005*petitioner + 0.005*respondent + 0.005*statute + 0.004*case + 0.004*government + 0.004*district + 0.004*u s ',
 u'0.002*narcotic + 0.002*agriculture + 0.001*surveillance + 0.001*malice + 0.001*bro + 0.001*airport + 0.001*badge + 0.001*rent + 0.001*store + 0.000*contact',
 u'0.005*court + 0.004*ordinance + 0.003*amendment + 0.003*school + 0.003*opinion + 0.003*city + 0.003*market + 0.003* page + 0.003*child + 0.003*district',
 u'0.005*sentence + 0.005*trial + 0.004*court + 0.004*confession + 0.003*prison + 0.003*conviction + 0.003*death + 0.003*habea + 0.003*counsel + 0.003*capital',
 u'0.016*tax + 0.006*property + 0.004*revenue + 0.003*income + 0.003*peace + 0.003*taxpayer + 0.002*bankruptcy + 0.002*code + 0.002*ir + 0.002*railway',
 u'0.002*solicitation + 0.002*liqu

In [91]:
for bow in corpus[0:100:5]:
    print lda2a.get_document_topics(bow)
    print " ".join([id2noun[e[0]] for e in bow])
    print "=========================================="

[(1, 0.64807654967956374), (2, 0.062449968248195029), (3, 0.23414863256490431), (4, 0.04324581438729512)]
chertoff emphasi shannon time center lopez meza commission ginsburg public law driver commission both evil colby misbehavior term trial respondent resemblance precedent individual al  tuskey solicitor cruz seizure act attorney thornton purpose agreement convict sting probability court truck crime language jury feldman certiorari fact change evidence november car judge state cause acquaintance john operation general view dreeben alan breyer entrapment path reason criminality scalia circuit use jay goal help shurtliff enforcement opinion danger justice origin recio appeal january contact force objective police kennedy drug jimenez liability souter conspiracy government brief c ombination f d thoma majority conviction defeat steven threat conspirator rule object example olson case marcu text post
[(0, 0.014573436036213812), (1, 0.63551267022168179), (3, 0.12648805317715925), (4, 0.150

Although we can see that the tf-idf improved the performance of the model insofar as diluting the weights of words like "court" and "action," the model still classifies most cases as belonging to Topic 1, which is predicted by the all-purposes legal terms "court," "state," "act," "petitioner," "respondent," "statute," "case," "government," while  "district," and "u s." So, although it is possible to glean what sort of topics might be represented by the other categories (i.e. Topic 9 contains predictor words such as "search warrant" and "arrest" and therefore might be mapped to the issue area of privacy, while Topic 5 contains words like "tax," "property," and "revenue" and therefore might be mapped to the issue area of taxation), the model is biased towards assigning cases to the general topic of Topic 1.

### Third Pass (LSI)

Another model we might try before moving on to more supervised approaches (that might be more performant because they do not function under the assumption that there are no interactions among words). Latent Semantic Indexing (LSI) implements fast truncated SVD (Singular Value Decomposition), and performs well for text corpora much larger than RAM since only constant memory is needed. Since our text corpus is not terribly large, it's likely that LSI models will not output substantially better topic cluster than LDA, but it is worth trying this implementation.

In [95]:
lsi1a = gensim.models.lsimodel.LsiModel(corpus=tfidf_noun_corpus, id2word=id2noun, num_topics=5, chunksize=200)

In [96]:
lsi1a.print_topics()

[u'0.398*"court" + 0.245*"state" + 0.194*"petitioner" + 0.163*"act" + 0.161*"respondent" + 0.146*"district" + 0.129*"appeal" + 0.127*"case" + 0.125*"opinion" + 0.122*"trial"',
 u'-0.333*"union" + -0.323*"employee" + -0.311*"labor" + -0.241*"employer" + -0.233*"board" + 0.206*"trial" + -0.203*"act" + -0.178*"bargain" + 0.143*"jury" + -0.142*"relation"',
 u'0.613*"tax" + -0.184*"trial" + 0.159*"state" + -0.157*"union" + -0.148*"jury" + -0.146*"labor" + 0.145*"property" + -0.134*"petitioner" + -0.133*"employee" + 0.128*"income"',
 u'-0.487*"tax" + 0.243*"school" + -0.233*"jury" + -0.191*"trial" + -0.179*"union" + -0.144*"employee" + -0.140*"employer" + -0.134*"labor" + 0.134*"district" + -0.131*"sentence"',
 u'-0.446*"decree" + -0.276*"gas" + -0.225*"commission" + -0.200*"jurisdiction" + -0.153*"u s " + 0.152*"amendment" + -0.144*"court" + -0.140*"tribe" + 0.135*"school" + -0.128*"order"']

In [99]:
lsi1b = gensim.models.lsimodel.LsiModel(corpus=tfidf_noun_corpus, id2word=id2noun, num_topics=10, chunksize=200)

In [100]:
lsi1b.print_topics()

[u'0.398*"court" + 0.246*"state" + 0.194*"petitioner" + 0.163*"act" + 0.161*"respondent" + 0.147*"district" + 0.129*"appeal" + 0.127*"case" + 0.125*"opinion" + 0.122*"trial"',
 u'-0.328*"union" + -0.323*"employee" + -0.318*"labor" + -0.244*"employer" + -0.228*"board" + 0.207*"trial" + -0.203*"act" + -0.178*"bargain" + 0.148*"jury" + -0.147*"relation"',
 u'0.622*"tax" + -0.175*"union" + 0.167*"state" + -0.163*"trial" + -0.157*"labor" + 0.144*"commerce" + 0.142*"property" + -0.142*"petitioner" + -0.136*"employee" + -0.131*"jury"',
 u'-0.502*"tax" + -0.223*"jury" + -0.212*"trial" + -0.161*"labor" + 0.156*"school" + -0.155*"union" + -0.142*"employer" + -0.140*"employee" + -0.136*"petitioner" + 0.122*"district"',
 u'0.309*"school" + 0.222*"state" + 0.209*"amendment" + -0.186*"court" + -0.183*"act" + -0.158*"decree" + 0.147*"post" + 0.137*"opinion" + 0.129*"plan" + -0.128*"appeal"']

As we predicted above, the LSI model did not output more discerning topics than the LDA model. Therefore, we can now move on to supervised models.