# NLP Modelling on Legal Court Text

*Yi Yin*

## Table of Contents

1. Enviroment Information
2. Read PDF files
3. Unsupervised Learning using LDA


### 1. Environment Information

- show the python version and system for this computer
- help others who use this code to produce replicable result

In [1]:
import IPython

# Information of my Python version, computer system
print(IPython.sys_info())

{'commit_hash': '523ed2fe5',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': '/anaconda/envs/nlp/lib/python3.6/site-packages/IPython',
 'ipython_version': '7.2.0',
 'os_name': 'posix',
 'platform': 'Darwin-18.5.0-x86_64-i386-64bit',
 'sys_executable': '/anaconda/envs/nlp/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) \n'
                '[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'}


### 2.Read PDF files

In [2]:
# import pdfminer and to read PDF files
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

# import io (input and output); BytesIO encode string to byte object
from io import BytesIO

# extract all file name in a folder, for the convenience of reading PDF files
import glob

# latter use to store cleaned string into json file
import simplejson as json

# re (regular expression) to find string with certain patterns
import re

#### A function to read PDF file:
 
    pdf_file: the filename of PDF (including the path (i.e location) )
    return: contentt of the PDF (string in Byte object,
    remember we use BytesIO to encode our string result)


In [4]:
def read_pdf(pdf_file):

    resource_mgr = PDFResourceManager()
    retstr = BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(resource_mgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(resource_mgr, device)
    fp = open(pdf_file, 'rb')
    maxpages = 0
    caching = True
    pagenos = set()
    
    for page in PDFPage.get_pages(fp, pagenos, 
                                  maxpages=maxpages, 
                                  caching=caching, 
                                  check_extractable=True):
        interpreter.process_page(page)

    result_str = retstr.getvalue()
    
    fp.close()
    device.close()
    retstr.close()
    
    return result_str

In [6]:
case_list = glob.glob('./case_test/*.pdf')
# a list to store the name of the pdf
pdf_name = []
# a list to store the content of the pdf
pdf_content = []

# a loop to read all the pdf and store their name and content to the respective list
for case in case_list:
    pdf_name.append(case.replace('./case_test/', '').replace('.pdf', ''))
    pdf_content.append(read_pdf(case))

In [7]:
pdf_content[0][:300]

b'Intellectual Ventures I LLC v. Motorola Mobility LLC, 870 F.3d 1320 (2017)\n124 U.S.P.Q.2d 1129\n\n870 F.3d 1320\n\nUnited States Court of Appeals,\n\nFederal Circuit.\n\nINTELLECTUAL VENTURES I LLC, Intellectual Ventures II LLC, Plaintiffs\xe2\x80\x93Appellees\n\nMOTOROLA MOBILITY LLC, fka Motorola Mobility, INC., Def'

Decode the PDF contents

In [8]:
court_text = []
for content in pdf_content:
    # decode every pdf file content, sepecify the decode style "utf-8"
    # p.s. "utf-8" is the most commom encoding sytle today
    decoded_content = content.decode("utf-8") 
    court_text.append(decoded_content) 

In [17]:
type(court_text[0])

str

### 3. Unsupervised Learning using Latent Dirichlet Allocation (LDA) 

Latent Dirichlet Allocation (LDA) is a popular topic modelling tool.

In [32]:
from nltk.tokenize import StanfordTokenizer
from nltk.stem.snowball import EnglishStemmer

In [23]:
import nltk
# remove English stop words such as "the" "a" "and"
from nltk.corpus import stopwords

In [24]:
stopwords = set(stopwords.words('english'))

In [38]:
#  A custom regex tokenizer to get rid of all puctuation
tokenizer = RegexpTokenizer(r'\w+')
token_court = []
for text in court_text:
    # remove all the hyperlikn
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE) 
    # get rid of all puctuation
    toks = tokenizer.tokenize(text)
    # Stem the tokens 
    stemd = [EnglishStemmer().stem(tok) for tok in toks if not tok in stopwords]
    # combine every loop result
    token_court.append(stemd)


Prepare the Bag of Words for LDA model

In [50]:
from gensim.corpora.dictionary import Dictionary
# Create a corpus from tokenized job description
gen_dictionary = Dictionary(token_court)
common_corpus = [gen_dictionary.doc2bow(text) for text in token_court]

In [47]:
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics = 2, id2word=gen_dictionary, passes=10)

In [48]:
topics = lda.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.014*"patent" + 0.013*"claim" + 0.011*"s" + 0.010*"the" + 0.009*"u"')
(1, '0.014*"patent" + 0.009*"claim" + 0.008*"f" + 0.008*"ericsson" + 0.008*"d"')


In [54]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, common_corpus, gen_dictionary,sort_topics=False)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
