## Word statistics

In [1]:
import spacy
import pandas as pd
import prettytable

In [32]:
train_df = pd.read_csv('./Data/train.csv')

idx = 0
first_party = train_df.loc[idx, 'first_party'].lower()
second_party = train_df.loc[idx, 'second_party'].lower()
fact = train_df.loc[idx, 'facts'].lower()

nlp = spacy.load('en_core_web_sm')
doc = nlp(fact)

print(fact)
print(first_party)
print(second_party)

on june 27, 1962, phil st. amant, a candidate for public office, made a television speech in baton rouge, louisiana.  during this speech, st. amant accused his political opponent of being a communist and of being involved in criminal activities with the head of the local teamsters union.  finally, st. amant implicated herman thompson, an east baton rouge deputy sheriff, in a scheme to move money between the teamsters union and st. amant’s political opponent. 
thompson successfully sued st. amant for defamation.  louisiana’s first circuit court of appeals reversed, holding that thompson did not show st. amant acted with “malice.”  thompson then appealed to the supreme court of louisiana.  that court held that, although public figures forfeit some of their first amendment protection from defamation, st. amant accused thompson of a crime with utter disregard of whether the remarks were true.  finally, that court held that the first amendment protects uninhibited, robust debate, rather tha

In [33]:
list(doc.sents)

[on june 27, 1962, phil st. amant, a candidate for public office, made a television speech in baton rouge, louisiana.  ,
 during this speech, st. amant accused his political opponent of being a communist and of being involved in criminal activities with the head of the local teamsters union.  ,
 finally, st. amant implicated herman thompson, an east baton rouge deputy sheriff, in a scheme to move money between the teamsters union and st. amant’s political opponent. ,
 thompson successfully sued st. amant for defamation.  ,
 louisiana’s first circuit court of appeals reversed, holding that thompson did not show st. amant acted with “malice.”  ,
 thompson then appealed to the supreme court of louisiana.  ,
 that court held that, although public figures forfeit some of their first amendment protection from defamation, st. amant accused thompson of a crime with utter disregard of whether the remarks were true.  ,
 finally, that court held that the first amendment protects uninhibited, robu

In [34]:
sen_table = prettytable.PrettyTable()
sen_table.field_names = ['Token', 'Lemma', 'POS', 'Tag', 'Dep', "HEAD"]

for token in doc:
    sen_table.add_row([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.text])

print(sen_table)

+--------------+--------------+-------+------+----------+------------+
|    Token     |    Lemma     |  POS  | Tag  |   Dep    |    HEAD    |
+--------------+--------------+-------+------+----------+------------+
|      on      |      on      |  ADP  |  IN  |   prep   |    made    |
|     june     |     june     | PROPN | NNP  |   pobj   |     on     |
|      27      |      27      |  NUM  |  CD  |  nummod  |    june    |
|      ,       |      ,       | PUNCT |  ,   |  punct   |    june    |
|     1962     |     1962     |  NUM  |  CD  |  nummod  |    june    |
|      ,       |      ,       | PUNCT |  ,   |  punct   |    made    |
|     phil     |     phil     | PROPN | NNP  | compound |     st     |
|      st      |      st      | PROPN | NNP  | compound |   amant    |
|      .       |      .       | PROPN | NNP  | compound |   amant    |
|    amant     |    amant     | PROPN | NNP  |  nsubj   |    made    |
|      ,       |      ,       | PUNCT |  ,   |  punct   |   amant    |
|     

In [35]:
verb_table = prettytable.PrettyTable()
verb_table.field_names = ['Token', 'Lemma', 'POS', 'Tag', 'Dep', "HEAD"]

for token in doc:
    if token.pos_ == 'VERB':
        verb_table.add_row([token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.text])

print(verb_table)

+------------+-----------+------+-----+-------+------------+
|   Token    |   Lemma   | POS  | Tag |  Dep  |    HEAD    |
+------------+-----------+------+-----+-------+------------+
|    made    |    make   | VERB | VBD |  ROOT |    made    |
|  accused   |   accuse  | VERB | VBD |  ROOT |  accused   |
|  involved  |  involve  | VERB | VBN | pcomp |     of     |
| implicated | implicate | VERB | VBD |  ROOT | implicated |
|    move    |    move   | VERB |  VB |  acl  |   scheme   |
|    sued    |    sue    | VERB | VBD |  ROOT |    sued    |
|  reversed  |  reverse  | VERB | VBD |  ROOT |  reversed  |
|  holding   |    hold   | VERB | VBG | csubj |    show    |
|    show    |    show   | VERB |  VB | ccomp |  reversed  |
|   acted    |    act    | VERB | VBD | ccomp |    show    |
|  appealed  |   appeal  | VERB | VBD |  ROOT |  appealed  |
|    held    |    hold   | VERB | VBD |  ROOT |    held    |
|  accused   |   accuse  | VERB | VBD | ccomp |    held    |
|    held    |    hold  

In [36]:
import spacy
import pandas as pd

train_df = pd.read_csv('./Data/train.csv')

nlp = spacy.load('en_core_web_sm')

verb_list = []
verb_info_list = []

for fact in train_df['facts']:
    doc = nlp(fact.lower())
    for token in doc:
        if token.pos_ == 'VERB':
            verb_list.append(token.lemma_)
            verb_info = {'lemma': token.lemma_, 'dependency': token.dep_}  # Save as a dictionary
            verb_info_list.append(verb_info)

In [37]:
print(verb_list)

['make', 'accuse', 'involve', 'implicate', 'move', 'sue', 'reverse', 'hold', 'show', 'act', 'appeal', 'hold', 'accuse', 'hold', 'protect', 'shoot', 'happen', 'ride', 'suffer', 'identify', 'try', 'convict', 'carry', 'crack', 'rule', 'try', 'knock', 'find', 'sentence', 'file', 'violate', 'argue', 'base', 'infer', 'deny', 'appeal', 'reverse', 'hold', 'violate', 'adjudicate', 'base', 'present', 'convict', 'sentence', 'grant', 'uphold', 'instruct', 'look', 'mitigate', 'resentence', 'resentence', 'sentence', 'file', 'argue', 'apply', 'lack', 'grant', 'vacate', 'reverse', 'hold', 'raise', 'raise', 'hold', 'fail', 'raise', 'decide', 'convict', 'obtain', 'concern', 'apply', 'state', 'deny', 'obtain', 'argue', 'base', 'break', 'struggle', 'seize', 'flee', 'have', 'identify', 'apprehend', 'hold', 'question', 'place', 'limit', 'confess', 'lead', 'confess', 'question', 'admit', 'testify', 'regard', 'surround', 'rule', 'subject', 'convict', 'sentence', 'affirm', 'allow', 'construct', 'operate', 'hel

In [38]:
from collections import Counter

stat_verb_list = Counter(verb_list).most_common()
stat_verb_list[:25]

[('hold', 1652),
 ('file', 1160),
 ('find', 1099),
 ('argue', 1057),
 ('affirm', 1042),
 ('violate', 981),
 ('deny', 896),
 ('have', 839),
 ('reverse', 753),
 ('grant', 741),
 ('require', 692),
 ('sue', 642),
 ('dismiss', 630),
 ('allege', 611),
 ('appeal', 574),
 ('rule', 574),
 ('convict', 552),
 ('claim', 539),
 ('make', 520),
 ('seek', 505),
 ('apply', 504),
 ('use', 487),
 ('provide', 454),
 ('base', 422),
 ('allow', 407)]

In [39]:
verb_info_df = pd.DataFrame(verb_info_list)
verb_info_df

Unnamed: 0,lemma,dependency
0,make,ROOT
1,accuse,ROOT
2,involve,pcomp
3,implicate,ROOT
4,move,acl
...,...,...
56316,infringe,ccomp
56317,direct,ROOT
56318,track,xcomp
56319,use,advcl


In [40]:
verb_info_df['dependency'].value_counts()

dependency
ROOT         17421
conj          6936
advcl         6762
ccomp         6194
acl           5044
xcomp         3805
relcl         3607
pcomp         2566
amod          2308
prep           658
pobj           223
csubj          149
acomp          133
compound       125
nsubj          104
dobj            92
parataxis       48
nmod            28
dep             27
poss            19
oprd            13
nsubjpass       12
punct           10
appos            9
csubjpass        5
dative           5
attr             4
advmod           3
npadvmod         3
meta             3
cc               2
case             2
auxpass          1
Name: count, dtype: int64

In [41]:
verb_root_df = verb_info_df[verb_info_df['dependency'] == 'ROOT']
verb_root_df

Unnamed: 0,lemma,dependency
0,make,ROOT
1,accuse,ROOT
3,implicate,ROOT
5,sue,ROOT
6,reverse,ROOT
...,...,...
56306,maintain,ROOT
56309,use,ROOT
56315,find,ROOT
56317,direct,ROOT


In [42]:
verb_root_stat = verb_root_df['lemma'].value_counts()
print(verb_root_stat[:25])

lemma
file         797
affirm       767
hold         754
argue        555
find         525
sue          489
deny         487
appeal       439
reverse      422
grant        417
rule         331
convict      314
move         261
dismiss      252
seek         226
claim        206
agree        199
reject       191
challenge    186
require      170
have         167
charge       164
bring        164
allege       163
arrest       160
Name: count, dtype: int64


## Party Encoding

In [43]:
import re
import pandas as pd
import string

In [44]:
train_df = pd.read_csv("./Data/train.csv")
test_df = pd.read_csv("./Data/test.csv")

In [45]:
train_df['first_party'] = train_df['first_party'].str.lower()
train_df['second_party'] = train_df['second_party'].str.lower()
train_df['facts'] = train_df['facts'].str.lower()

test_df['first_party'] = test_df['first_party'].str.lower()
test_df['second_party'] = test_df['second_party'].str.lower()
test_df['facts'] = test_df['facts'].str.lower()

In [46]:
def get_name_re(name, fact_token: spacy.tokens.doc.Doc, first=True):
    name = re.sub(rf'[ .,{string.punctuation}]+', r' ', name.lower())
    name_list = [n for n in name.split() if len(n) > 2]

    for n in name_list:
        changed_name = re.findall(rf"{n} ?\([a-z]+\)", fact_token.text)
        # print(f"changed name: {changed_name}")
        if changed_name:
            name_list.extend([re.sub(rf'({n}|[ {string.punctuation}])', '', cn) for cn in changed_name])

    # fact_subj = ' '.join([token.text for token in fact_token if 'NN' in token.tag_])
    # print(f"name list: {name_list}")
    # print(f"fact subj: {fact_subj}")
    # res = []
    # for name in name_list:
    #     res.append((name, len(re.findall(name, fact_subj))))
    # print(res)
    abbrev = "firstparty" if first else "secondparty"
    fact_subj = []
    for token in fact_token:
        if 'NN' in token.tag_ and token.text in name_list:
            fact_subj.append(abbrev)
        else:
            fact_subj.append(token.text)
    fact_subj = ' '.join(fact_subj)
    # fact_subj = re.sub(rf"({'|'.join(name_list)})", abbrev, fact)
    fact_subj = re.sub(rf"({abbrev} ?)+", f'{abbrev} ', fact_subj)
    return fact_subj

In [47]:
import re

def replace_name(first_party, second_party, fact_token: spacy.tokens.doc.Doc):
    first_party_name = re.sub(rf'[ .,{string.punctuation}]+', r' ', first_party.lower())
    second_party_name = re.sub(rf'[ .,{string.punctuation}]+', r' ', second_party.lower())
    fp_name_list = [n for n in first_party_name.split() if len(n) > 2]
    sp_name_list = [n for n in second_party_name.split() if len(n) > 2]

    fp_name_list_added = fp_name_list.copy()
    sp_name_list_added = sp_name_list.copy()

    for n in fp_name_list:
        changed_name = re.findall(rf"{n} ?\([a-z]+\)", fact_token.text)
        if changed_name:
            fp_name_list_added.extend([re.sub(rf'({n}|[ {string.punctuation}])', '', cn) for cn in changed_name])
    
    for n in sp_name_list:
        changed_name = re.findall(rf"{n} ?\([a-z]+\)", fact_token.text)
        if changed_name:
            sp_name_list_added.extend([re.sub(rf'({n}|[ {string.punctuation}])', '', cn) for cn in changed_name])

    # for name_list in [fp_name_list, sp_name_list]:
    #     for n in name_list:
    #         changed_name = re.findall(rf"{n} ?\([a-z]+\)", fact_token.text)
    #         if changed_name:
    #             name_list.extend([re.sub(rf'({n}|[ {string.punctuation}])', '', cn) for cn in changed_name])
    
    # print(f"fp name list: {fp_name_list}")
    # print(f"sp name list: {sp_name_list}")

    fact_subj = []
    for token in fact_token:
        if 'NN' in token.tag_:
            if token.text in fp_name_list_added:
                fact_subj.append('firstparty')
            elif token.text in sp_name_list_added:
                fact_subj.append('secondparty')
            else:
                fact_subj.append(token.text)
        else:
            fact_subj.append(token.text)
    
    fact_subj = ' '.join(fact_subj)
    fact_subj = re.sub(rf"(firstparty ?)+", f'firstparty ', fact_subj)
    fact_subj = re.sub(rf"(secondparty ?)+", f'secondparty ', fact_subj)
    return fact_subj

In [48]:
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=2)

# train_df['facts_token'] = train_df['facts'].parallel_apply(nlp)
train_df['new_facts'] = train_df.parallel_apply(lambda x: replace_name(x['first_party'], x['second_party'], nlp(x['facts'])), axis=1)

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=1239), Label(value='0 / 1239'))), …

In [49]:
train_df.head()

Unnamed: 0,ID,first_party,second_party,facts,first_party_winner,new_facts
0,TRAIN_0000,phil a. st. amant,herman a. thompson,"on june 27, 1962, phil st. amant, a candidate ...",1,"on june 27 , 1962 , firstparty st . firstparty..."
1,TRAIN_0001,stephen duncan,lawrence owens,ramon nelson was riding his bike when he suffe...,0,ramon nelson was riding his bike when he suffe...
2,TRAIN_0002,billy joe magwood,"tony patterson, warden, et al.",an alabama state court convicted billy joe mag...,1,an alabama state court convicted billy firstpa...
3,TRAIN_0003,linkletter,walker,victor linkletter was convicted in state court...,0,victor firstparty was convicted in state court...
4,TRAIN_0004,william earl fikes,alabama,"on april 24, 1953 in selma, alabama, an intrud...",1,"on april 24 , 1953 in selma , secondparty , an..."


In [50]:
nlp = spacy.load('en_core_web_sm')

enc_verb_list = []
enc_verb_info_list = []

for fact in train_df['new_facts']:
    doc = nlp(fact.lower())
    for token in doc:
        if token.pos_ == 'VERB':
            enc_verb_list.append(token.lemma_)
            verb_info = {'lemma': token.lemma_, 'dependency': token.dep_}  # Save as a dictionary
            enc_verb_info_list.append(verb_info)

In [51]:
from collections import Counter

stat_enc_verb_list = Counter(enc_verb_list).most_common()
stat_enc_verb_list[:25]

[('hold', 1656),
 ('file', 1160),
 ('affirm', 1127),
 ('find', 1106),
 ('argue', 1057),
 ('violate', 981),
 ('deny', 896),
 ('have', 838),
 ('reverse', 758),
 ('grant', 741),
 ('require', 691),
 ('sue', 643),
 ('dismiss', 630),
 ('allege', 618),
 ('appeal', 579),
 ('rule', 574),
 ('convict', 553),
 ('claim', 535),
 ('make', 520),
 ('seek', 506),
 ('apply', 504),
 ('use', 486),
 ('provide', 454),
 ('base', 422),
 ('allow', 407)]

In [52]:
enc_verb_info_df = pd.DataFrame(enc_verb_info_list)
enc_verb_info_df

Unnamed: 0,lemma,dependency
0,make,ROOT
1,accuse,ROOT
2,involve,pcomp
3,implicate,ROOT
4,move,acl
...,...,...
56557,direct,ROOT
56558,track,xcomp
56559,use,advcl
56560,affirm,pcomp


In [53]:
enc_verb_info_df['dependency'].value_counts()

dependency
ROOT         17598
conj          7028
advcl         6799
ccomp         6007
acl           5059
xcomp         3817
relcl         3557
pcomp         2744
amod          2374
prep           674
csubj          161
pobj           115
compound       107
acomp          104
nsubj           94
dobj            91
punct           43
dep             42
parataxis       35
nmod            30
nsubjpass       16
oprd            14
appos           11
poss             9
npadvmod         7
csubjpass        7
case             6
meta             4
dative           3
advmod           3
prt              1
auxpass          1
intj             1
Name: count, dtype: int64

In [54]:
enc_verb_root_df = enc_verb_info_df[enc_verb_info_df['dependency'] == 'ROOT']
enc_verb_root_df

Unnamed: 0,lemma,dependency
0,make,ROOT
1,accuse,ROOT
3,implicate,ROOT
5,sue,ROOT
8,show,ROOT
...,...,...
56546,maintain,ROOT
56549,use,ROOT
56555,find,ROOT
56557,direct,ROOT


In [55]:
print(train_df['facts'][0])
print(train_df['new_facts'][0])

on june 27, 1962, phil st. amant, a candidate for public office, made a television speech in baton rouge, louisiana.  during this speech, st. amant accused his political opponent of being a communist and of being involved in criminal activities with the head of the local teamsters union.  finally, st. amant implicated herman thompson, an east baton rouge deputy sheriff, in a scheme to move money between the teamsters union and st. amant’s political opponent. 
thompson successfully sued st. amant for defamation.  louisiana’s first circuit court of appeals reversed, holding that thompson did not show st. amant acted with “malice.”  thompson then appealed to the supreme court of louisiana.  that court held that, although public figures forfeit some of their first amendment protection from defamation, st. amant accused thompson of a crime with utter disregard of whether the remarks were true.  finally, that court held that the first amendment protects uninhibited, robust debate, rather tha

In [58]:
# Define the list of words you are interested in
words_of_interest = ['firstparty', 'secondparty']  # replace these with your actual words

sentences_with_words = []

for fact in train_df['new_facts']:
    doc = nlp(fact.lower())
    for sent in doc.sents:  # spacy splits the document into sentences
        if any(word in sent.text for word in words_of_interest):
            sentences_with_words.append(sent.text)
            
sentences_with_words



In [61]:
woi_df = pd.DataFrame(sentences_with_words, columns=['sentences'])
woi_verb_list = []
# woi_verb_info_list = []

for fact in woi_df['sentences']:
    doc = nlp(fact.lower())
    for token in doc:
        if token.pos_ == 'VERB':
            woi_verb_list.append(token.lemma_)
            # verb_info = {'lemma': token.lemma_, 'dependency': token.dep_}  # Save as a dictionary
            # woi_verb_info_list.append(verb_info)

In [62]:
stat_woi_verb_list = Counter(woi_verb_list).most_common()
stat_woi_verb_list[:25]

[('file', 895),
 ('hold', 820),
 ('argue', 740),
 ('violate', 651),
 ('find', 627),
 ('sue', 526),
 ('deny', 522),
 ('allege', 468),
 ('have', 456),
 ('convict', 436),
 ('affirm', 434),
 ('appeal', 418),
 ('claim', 393),
 ('grant', 390),
 ('seek', 361),
 ('require', 356),
 ('rule', 342),
 ('dismiss', 331),
 ('reverse', 319),
 ('make', 293),
 ('use', 285),
 ('base', 260),
 ('provide', 250),
 ('apply', 244),
 ('fail', 240)]