**Relevent Source Materials**

https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34

ETA Module 6, Vectorization with SciKit Learn

Stat. Learning Final Project

### Import

In [1]:
import os
from glob import glob
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### SetUp

In [2]:
## MODIFY THIS
# get path to your folder that holds the txt files
source_files = "C:/Users/jacqu/Downloads/Court Case PDFs/Court Case TXTs"
# outputs a list of all the txt files in the folder
source_file_list = sorted(glob(f"{source_files}/*.txt"))

# creates a list of tuples with an elememt for the source path and
# for the file title
file_data = []
for source_file_path in source_file_list:
    # split might be different, recommend checking with INFO.sample() or .head()
    file_title = source_file_path.split('\\')[-1].split(".txt")[0]
    file_data.append((source_file_path, file_title))

# creating df with the file title as the index and source path as a col
INFO = pd.DataFrame(file_data, columns=['txt_path','file_title'])\
    .set_index('file_title').sort_index()
INFO.head()

Unnamed: 0_level_0,txt_path
file_title,Unnamed: 1_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",C:/Users/jacqu/Downloads/Court Case PDFs/Court...
"A.D. v. Best Western Int_l, Inc., 2023 U.S. Dist. LEXIS 150376",C:/Users/jacqu/Downloads/Court Case PDFs/Court...
"A.D. v. Choice Hotels Int_l, Inc., 2023 U.S. Dist. LEXIS 150380",C:/Users/jacqu/Downloads/Court Case PDFs/Court...
B.M. v. Wyndham Hotels,C:/Users/jacqu/Downloads/Court Case PDFs/Court...
"Bacon v. Marshall, 2023 U.S. App. LEXIS 32309",C:/Users/jacqu/Downloads/Court Case PDFs/Court...


In [3]:
# making the CORPUS
## CORPUS df: multindex = doc name/index, sent. num, token num
## columns = pos tag, token str, term str (token str normalized)

narratives_list = []
for doc_idx, txt_path in enumerate(INFO['txt_path']):
    with open(txt_path, 'r',  encoding='utf-8') as file:
        narrative = file.read()
    narratives_list.append({"title": INFO.index[doc_idx], "narrative": narrative})

# Convert the list of dictionaries to a DataFrame
narratives = pd.DataFrame(narratives_list)
narratives = narratives.reset_index().set_index("title")
narratives = narratives.drop(columns=['index'])
narratives.head()

Unnamed: 0_level_0,narrative
title,Unnamed: 1_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",OPINION AND ORDER GRANTING DEFENDANT SUMMIT H...
"A.D. v. Best Western Int_l, Inc., 2023 U.S. Dist. LEXIS 150376",OPINION AND ORDER This matter comes before the...
"A.D. v. Choice Hotels Int_l, Inc., 2023 U.S. Dist. LEXIS 150380",OPINION AND ORDER This matter comes before the...
B.M. v. Wyndham Hotels,ORDER GRANTING IN PART AND DENYING IN PART DE...
"Bacon v. Marshall, 2023 U.S. App. LEXIS 32309",[*1] ORDER AND JUDGMENT* _____________________...


In [4]:
df = pd.DataFrame(index=narratives.index)
df['sent_str'] = [nltk.sent_tokenize(narratives.narrative[x]) for x in range(len(narratives))]
df = df.explode('sent_str')
s1 = df.index.to_series()
s2 = s1.groupby(s1).cumcount()
df.index = [df.index, s2]
df.index.names = ['title','sent_num']
# nltk.word_tokenize(df.sent_str[x])
df['token_pos'] = [nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(df.sent_str[x])) for x in range(len(df))]
df = df.explode('token_pos')
s1 = df.index.to_series()
s2 = s1.groupby(s1).cumcount()
df.index = [df.index.get_level_values(level=0), df.index.get_level_values(level=1), s2]
df.index.names = ['title','sent_num', 'token_num']
df.drop(columns=['sent_str'], inplace=True)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token_pos
title,sent_num,token_num,Unnamed: 3_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,0,"(OPINION, NN)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,1,"(AND, CC)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,2,"(ORDER, NNP)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,3,"(GRANTING, NNP)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,4,"(DEFENDANT, NNP)"


In [5]:
df['token_str'] = df.token_pos.apply(lambda x: x[0].strip())
df['term_str'] = df.token_pos.apply(lambda x: x[0].lower().strip())
df['pos_tag'] = df.token_pos.apply(lambda x: x[1])
CORPUS = df.drop(columns="token_pos")
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token_str,term_str,pos_tag
title,sent_num,token_num,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,0,OPINION,opinion,NN
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,1,AND,and,CC
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,2,ORDER,order,NNP
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,3,GRANTING,granting,NNP
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,4,DEFENDANT,defendant,NNP


In [6]:
np.random.seed(3418)

**Here's a question:** Do we want to get rid of stop words? Maybe use a custom list of stop words... and do we want to do lemmatization on the words? Consult Brain??? 0.0 

If the answer is no to one or both questions, we can just have a df with index: title and columns: raw_narrative, n_tokens and go straight to TFIDFVec. We don't need to do most of the steps above; just go straight from narratives df to tfidf_engine.fit_transform(narratives.narrative). 

### Vectorization with SciKit Learn, TFIDF

In [7]:
## DOC df: index = doc name/index
## columns = narrative str, num tokens

In [8]:
def gather_docs(CORPUS, ohco_level, term_col='term_str'):
    OHCO = CORPUS.index.names
    CORPUS[term_col] = CORPUS[term_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[term_col].apply(lambda x:' '.join(x)).to_frame('doc_str')
    return DOC

In [9]:
DOC = gather_docs(CORPUS, 1)
DOC['n_tokens'] = DOC.doc_str.apply(lambda x: len(x.split()))
DOC.head()

Unnamed: 0_level_0,doc_str,n_tokens
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",opinion and order granting defendant summit ho...,3608
"A.D. v. Best Western Int_l, Inc., 2023 U.S. Dist. LEXIS 150376",opinion and order this matter comes before the...,2940
"A.D. v. Choice Hotels Int_l, Inc., 2023 U.S. Dist. LEXIS 150380",opinion and order this matter comes before the...,2926
B.M. v. Wyndham Hotels,order granting in part and denying in part def...,7354
"Bacon v. Marshall, 2023 U.S. App. LEXIS 32309",[*1] order and judgment* _____________________...,1933


In [10]:
ngram_range = (2,2)
n_terms = 1000

**Applying TFIDF Vectorization**

In [11]:
tfidf_engine = TfidfVectorizer(
    stop_words = 'english',
    ngram_range = ngram_range,
    max_features = n_terms,
    norm = 'l2', 
    use_idf = True)

**Vectorized data**

In [12]:
X = tfidf_engine.fit_transform(DOC.doc_str)
print(X[:1])

  (0, 607)	0.017124985771566223
  (0, 367)	0.017124985771566223
  (0, 141)	0.015460942182795936
  (0, 761)	0.013840263039999507
  (0, 516)	0.017124985771566223
  (0, 470)	0.012887562575125163
  (0, 49)	0.015460942182795936
  (0, 819)	0.015958974638220037
  (0, 115)	0.03781512146259613
  (0, 933)	0.0463828265483878
  (0, 527)	0.012082065809330791
  (0, 860)	0.011606703175799815
  (0, 137)	0.015460942182795936
  (0, 533)	0.015958974638220037
  (0, 736)	0.013186237336323806
  (0, 746)	0.030921884365591873
  (0, 934)	0.06849994308626489
  (0, 487)	0.0463828265483878
  (0, 237)	0.010966058748964207
  (0, 181)	0.04376406044771847
  (0, 695)	0.017124985771566223
  (0, 120)	0.02637247467264761
  (0, 252)	0.025775125150250325
  (0, 44)	0.01350302437129094
  (0, 42)	0.016509523975400443
  :	:
  (0, 759)	0.013840263039999507
  (0, 461)	0.05483029374482104
  (0, 662)	0.017822738807826602
  (0, 302)	0.015460942182795936
  (0, 247)	0.06383589855288015
  (0, 362)	0.040509073113872814
  (0, 415)	0.019

**Learned vocabulary**

In [13]:
import itertools
print(dict(itertools.islice(tfidf_engine.vocabulary_.items(), 5)))

{'motion dismiss': 660, 'district judge': 393, 'plaintiff filed': 745, 'residence inn': 821, 'inn portland': 556}


In [14]:
TFIDF = pd.DataFrame(X.toarray(), columns=tfidf_engine.get_feature_names_out(), index=DOC.index)
TFIDF.head()

Unnamed: 0_level_0,07 2023,07057 07057,07057 page,10 11,10th cir,11 12,11th cir,11th ed,12 07,12 13,...,wyndham hotels,xxx 5542,xxx xxx,year old,years life,years old,york jurisdiction,york state,zero point,zte phone
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0.0,0.0,0.0,0.0,0.0,0.0,0.011384,0.0,0.0,0.0,...,0.015461,0.0,0.0,0.0,0.0,0.011171,0.0,0.0,0.0,0.0
"A.D. v. Best Western Int_l, Inc., 2023 U.S. Dist. LEXIS 150376",0.0,0.0,0.0,0.0,0.0,0.0,0.023652,0.016578,0.0,0.0,...,0.016061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"A.D. v. Choice Hotels Int_l, Inc., 2023 U.S. Dist. LEXIS 150380",0.0,0.0,0.0,0.0,0.0,0.0,0.027196,0.019062,0.0,0.0,...,0.018468,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B.M. v. Wyndham Hotels,0.0,0.0,0.0,0.030476,0.0,0.0,0.0,0.0,0.0,0.0,...,0.062301,0.0,0.0,0.0,0.0,0.009003,0.0,0.0,0.0,0.0
"Bacon v. Marshall, 2023 U.S. App. LEXIS 32309",0.31267,0.0,0.0,0.0,0.368692,0.0,0.0,0.0,0.295097,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
TFIDF.stack().to_frame('score').score.nlargest(20).to_frame('score')

Unnamed: 0_level_0,Unnamed: 1_level_0,score
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"United States v. Wilkins, 538 F. Supp. 3d 49",mr wilkins,0.923909
"State v. Chatman, 2023 La. App. LEXIS 2102",ms jones,0.89628
"United States v. Woods, 2021 U.S. Dist. LEXIS 81858",cornelius galloway,0.868035
"McSean v. LeMons, 2023 U.S. Dist. LEXIS 217918",8th cir,0.839268
"Kocher v. Sec_y VA., 2023 U.S. App. LEXIS 32355",federal sector,0.819387
"Rice Enters., LLC v. RSUI Indem. Co., 2023 U.S. Dist. LEXIS 217212",umbrella policy,0.768822
"D.B. v. IE Hotel Grp., LLC, 2023 U.S. Dist. LEXIS 17945",g6 defendants,0.747694
United States v. Robinson,jane doe,0.743498
"United States v. Smith, 2023 U.S. Dist. LEXIS 217838",minor victim,0.726214
"State v. Taylor-Hollingsworth, 2023 Ohio App. LEXIS 4249",aggravated robbery,0.696958


### VOCAB DF
Making a vocabulary list with significant uni/bi grams based on tfidf... 

In [16]:
VOCAB = TFIDF.mean().to_frame('tfidf_mean')
VOCAB.sort_values('tfidf_mean', ascending=False).head(20)

Unnamed: 0,tfidf_mean
sex trafficking,0.073895
united states,0.067361
trial court,0.061735
district court,0.060788
dist lexis,0.038739
criminal history,0.028859
8th cir,0.028184
4th cir,0.026022
commercial sex,0.024677
supp 3d,0.024112


### Logistic Regression

### SVM