# Processing provided test data

This notebook aims to process the three test sets provided to convert them into a usable format.

Loading the three test sets

In [232]:
import pandas as pd
from itertools import chain
import re
import json
import os
from ast import literal_eval

In [233]:
test_df1 = pd.read_csv("/home/workboots/Datasets/LLPE/raw/statute_pred_45_cases_without_exp.csv")
test_df2 = pd.read_csv("/home/workboots/Datasets/LLPE/raw/statute_pred_100_cases_without_exp.csv")
test_df3 = pd.read_csv("/home/workboots/Datasets/LLPE/raw/statute_pred_100_cases_without_exp-gender_religion_bias.csv")

## Getting targets

The provided test data (100 + 100 + 45) have statutes associated with each datapoint. Of the 245, 45 do not have explicitly specified statutes (e.g. S21 instead of 'Constitution_226'). This notebook is about combining all the data across the three csv files to get an overview of what targets a model has to be trained on.

### First DataFrame

The first DataFrame contains the indirect statute labels that needs to be explicitly found.

In [84]:
test_df1.head()

Unnamed: 0,Query ID,Query,Annoted_Query,Actual Statutes,Predicted Statutes
0,AILA_Q1,"AILA_Q1||The appellant on February 9, 1961 was...","AILA_Q1||$S52 The appellant on February 9, 196...",['S3'],"['S1', 'S2', 'S3', 'S5', 'S6', 'S9', 'S11', 'S..."
1,AILA_Q2,AILA_Q2||The appellant before us was examined ...,AILA_Q2||The appellant before us was examined ...,"['S127', 'S27']","[""S2"", ""S9"", ""S11"", ""S12"", ""S13"", ""S15"", ""S19""..."
2,AILA_Q5,AILA_Q5||This appeal is preferred against the ...,AILA_Q5||This appeal is preferred against the ...,"['S19', 'S6', 'S13', 'S24', 'S64']","['S2', 'S5', 'S6', 'S9', 'S11', 'S13', 'S15', ..."
3,AILA_Q6,"AILA_Q6||On 19.3.1999, SI P1 along Ct. P2 went...","AILA_Q6||On 19.3.1999, SI P1 along Ct. P2 went...","['S13', 'S2']","[""S2"", ""S5"", ""S6"", ""S9"", ""S11"", ""S12"", ""S13"", ..."
4,AILA_Q7,AILA_Q7||This criminal appeal is directed agai...,AILA_Q7||This criminal appeal is directed agai...,"['S27', 'S127']","[""S2"", ""S9"", ""S11"", ""S12"", ""S13"", ""S15"", ""S19""..."


Getting the statute values into a single set

In [179]:
test_df1_statutes = test_df1["Actual Statutes"].apply(literal_eval).values
test_df1_statutes = chain.from_iterable(test_df1_statutes)

In [180]:
test_df1_statutes = set(test_df1_statutes)

In [181]:
test_df1_statutes

{'S1',
 'S11',
 'S12',
 'S127',
 'S13',
 'S15',
 'S19',
 'S2',
 'S21',
 'S24',
 'S27',
 'S3',
 'S4',
 'S43',
 'S5',
 'S6',
 'S64',
 'S9'}

In [182]:
len(test_df1_statutes)

18

#### Attempting to map ALL AILA 2019 statutes automatically (FAILING)

Mapping all statutes of the AILA dataset to their actual statute names

In [56]:
with open("/home/workboots/Datasets/IndiaCode/new/CentralActs/act_chapter_section_info/186045.json", 'r') as f:
    ipc = json.load(f)

Going through the AILA 2019 statutes and mapping accordingly

In [62]:
statute_mapping = {}
statute_text = {}

In [58]:
statute_path = "/home/workboots/Datasets/LLPE/raw/Object_statutes/"

In [63]:
for flpath in os.listdir(statute_path):
    with open(os.path.join(statute_path, flpath), 'r') as f:
        statute_text[os.path.splitext(flpath)[0]] = f.readline()

In [66]:
statute_text = {
    k: v.replace("Title: ", "").replace("\n", "") for k, v in statute_text.items()}

In [78]:
statute_text

{'S20': 'Cheating and dishonestly inducing delivery of property',
 'S146': 'Report of police-officer',
 'S147': 'Rules of court, etc.',
 'S22': 'Dismissal, removal or reduction in rank of persons employed in civil capacities under the Union or a State',
 'S82': 'Powers to control production, supply, distribution, etc., of essential commodities',
 'S177': 'Public charities',
 'S170': 'Order for maintenance of wives, children and parents',
 'S23': 'Saving of inherent power of High Court',
 'S159': 'Application for setting aside arbitral award',
 'S171': 'Questions to be determined by the Court executing decree',
 'S55': 'Lists of common and special jurors',
 'S54': 'Punishment for criminal intimidation',
 'S44': 'Protection in respect of conviction for offences',
 'S123': 'Arbitration agreement or award to be contested by application',
 'S165': 'Dishonestly receiving stolen property',
 'S77': 'Restrictions as to imposition of tax on the sale or purchase of goods',
 'S95': 'Public servant

Finding the the statute numbers from their titles

In [79]:
for statute, text in statute_text.items():
    for s_num, data in ipc.items():
        if re.sub(r"[^A-Za-z ]", "", data["title"]).lower() == re.sub("r[^A-Za-z ]", "", text).lower():
            statute_mapping[statute] = s_num
            break

In [80]:
len(statute_mapping)

48

In [81]:
statute_mapping

{'S20': '420',
 'S54': '506',
 'S165': '411',
 'S92': '364',
 'S48': '304B',
 'S19': '323',
 'S63': '306',
 'S127': '5',
 'S25': '498A',
 'S197': '396',
 'S28': '376',
 'S34': '326',
 'S131': '114',
 'S13': '307',
 'S129': '397',
 'S107': '143',
 'S117': '395',
 'S41': '471',
 'S118': '427',
 'S2': '302',
 'S189': '4',
 'S94': '193',
 'S49': '467',
 'S108': '366',
 'S168': '477A',
 'S112': '304A',
 'S110': '379',
 'S53': '406',
 'S12': '120B',
 'S119': '504',
 'S136': '392',
 'S155': '447',
 'S24': '324',
 'S21': '147',
 'S51': '304',
 'S15': '148',
 'S40': '468',
 'S80': '342',
 'S76': '341',
 'S141': '363',
 'S43': '300',
 'S152': '380',
 'S88': '452',
 'S86': '3',
 'S64': '325',
 'S11': '149',
 'S6': '34',
 'S125': '465'}

#### Manually annotating the relevant sections for the provided dataset

As automatic annotation of all the AILA 2019 statutes does not yield satisfactory results, it is preferable to manually annotate the 18 statutes as part of the first test set.

In [183]:
test_df1_statute_mapping = {
    'S1': 'Constitution_226',
    'S11': 'Indian Penal Code, 1860_149',
    'S12': 'Indian Penal Code, 1860_120B',
    'S127': 'Indian Penal Code, 1860_5',
    'S13': 'Indian Penal Code, 1860_307',
    'S15': 'Indian Penal Code, 1860_148',
    'S19': 'Indian Penal Code, 1860_321',
    'S2': 'Indian Penal Code, 1860_302',
    'S21': 'Indian Penal Code, 1860_147',
    'S24': 'Indian Penal Code, 1860_324',
    'S27': 'Special Courts Act, 1979_5',
    'S3': 'Constitution_14',
    'S4': 'Constitution_136',
    'S43': 'Indian Penal Code, 1860_300',
    'S5': 'Constitution_32',
    'S6': 'Indian Penal Code, 1860_34',
    'S64': 'Indian Penal Code, 1860_325',
    'S9': 'Constitution_21'
}

In [184]:
len(test_df1_statute_mapping)

18

### Second DataFame

In [185]:
test_df2.head()

Unnamed: 0,index,Statement,Actual Statutes,Predicted Statutes
0,1953_L_8,0 [DATE] C.A. no.[CARDINAL] of [DATE].The [ORG...,"['Constitution_226', 'Constitution_136']","['Constitution_226', 'Constitution_136""]"
1,1954_A_6,[ORG] of [GPE] 9 [DATE] C.A. no.[CARDINAL] of ...,"['Constitution_226', 'Constitution_136', 'Cons...","['Constitution_226', 'Constitution_136""]"
2,1954_R_3,of India 4 [DATE] Civil Appeal no.[CARDINAL] o...,"['Constitution_226', 'Constitution_136']","['Constitution_226', 'Constitution_136""]"
3,1954_T_53,kur Raghuraj Singh and [ORG] [GPE] 19 [DATE] C...,"['Constitution_226', 'Constitution_136']","['Representation of the People Act, 1951_81""]"
4,1956_E_1,C.A. no.[CARDINAL] of 1954.The [ORG] was deliv...,"['Constitution_226', 'Constitution_136', 'Cons...","['Constitution_226', 'Constitution_136""]"


In [186]:
test_df2["Actual Statutes"].apply(literal_eval)

0                  [Constitution_226, Constitution_136]
1     [Constitution_226, Constitution_136, Constitut...
2                  [Constitution_226, Constitution_136]
3                  [Constitution_226, Constitution_136]
4     [Constitution_226, Constitution_136, Constitut...
                            ...                        
95                 [Constitution_226, Constitution_136]
96                 [Constitution_226, Constitution_136]
97                 [Constitution_226, Constitution_136]
98                 [Constitution_226, Constitution_136]
99    [Constitution_226, Constitution_136, Constitut...
Name: Actual Statutes, Length: 100, dtype: object

In [187]:
test_df2_statutes = test_df2["Actual Statutes"].apply(literal_eval).values
test_df2_statutes = chain.from_iterable(test_df2_statutes)

In [188]:
test_df2_statutes = set(test_df2_statutes)

In [189]:
test_df2_statutes

{'Code of Civil Procedure, 1882_151',
 'Code of Criminal Procedure, 1973_2',
 'Constitution_1',
 'Constitution_12',
 'Constitution_13',
 'Constitution_132',
 'Constitution_133',
 'Constitution_136',
 'Constitution_14',
 'Constitution_141',
 'Constitution_142',
 'Constitution_15',
 'Constitution_16',
 'Constitution_161',
 'Constitution_162',
 'Constitution_19',
 'Constitution_191',
 'Constitution_2',
 'Constitution_20',
 'Constitution_21',
 'Constitution_22',
 'Constitution_225',
 'Constitution_226',
 'Constitution_227',
 'Constitution_246',
 'Constitution_25',
 'Constitution_3',
 'Constitution_300',
 'Constitution_301',
 'Constitution_309',
 'Constitution_31',
 'Constitution_311',
 'Constitution_32',
 'Constitution_39',
 'Constitution_4',
 'Constitution_5',
 'Constitution_6',
 'Indian Penal Code, 1860_1',
 'Indian Penal Code, 1860_120',
 'Indian Penal Code, 1860_323',
 'Indian Penal Code, 1860_324',
 'Indian Penal Code, 1860_34',
 'Indian Penal Code, 1860_342',
 'Indian Penal Code, 186

### Third DataFame

In [190]:
test_df3.head()

Unnamed: 0,Fact ID,Actual Statutes,Relevant statutes,Predicted Statutes
0,2017_Z_1,"['Constitution_226', 'Constitution_14', 'Const...","['Constitution_226', 'Constitution_14', 'Const...","['Constitution_226', 'Constitution_136', 'Ind..."
1,2016_C_27,"['Constitution_226', 'Constitution_227', 'Indi...","['Constitution_226', 'Constitution_227', 'Indi...","['Constitution_226', 'Constitution_136', 'Ind..."
2,2003_S_647,"['Constitution_226', 'Constitution_311', 'Indi...","['Constitution_226', 'Indian Penal Code, 1860_...","['Constitution_226', 'Constitution_136', 'Ind..."
3,2015_R_96,"['Constitution_226', 'Constitution_14', 'India...","['Constitution_226', 'Constitution_14', 'Const...","['Constitution_226', 'Constitution_136', 'Ind..."
4,2011_C_23,"['Constitution_226', 'Constitution_136', 'Indi...","['Constitution_226', 'Constitution_136', 'Indi...","['Constitution_226', 'Constitution_136', 'Ind..."


In [191]:
test_df3["Actual Statutes"].apply(literal_eval)

0     [Constitution_226, Constitution_14, Constituti...
1     [Constitution_226, Constitution_227, Indian Pe...
2     [Constitution_226, Constitution_311, Indian Pe...
3     [Constitution_226, Constitution_14, Indian Pen...
4     [Constitution_226, Constitution_136, Indian Pe...
                            ...                        
95    [Constitution_226, Constitution_227, Code of C...
96    [Constitution_226, Code of Civil Procedure, 18...
97    [Constitution_226, Constitution_227, Code of C...
98    [Constitution_226, Constitution_136, Code of C...
99    [Constitution_136, Indian Penal Code, 1860_376...
Name: Actual Statutes, Length: 100, dtype: object

In [192]:
test_df3_statutes = test_df3["Actual Statutes"].apply(literal_eval).values
test_df3_statutes = chain.from_iterable(test_df3_statutes)

In [193]:
test_df3_statutes = set(test_df3_statutes)

In [194]:
test_df3_statutes

{'Arms Act, 1959_25',
 'Arms Act, 1959_27',
 'Code of Civil Procedure, 1882_115',
 'Code of Civil Procedure, 1882_151',
 'Code of Criminal Procedure, 1973_161',
 'Code of Criminal Procedure, 1973_2',
 'Code of Criminal Procedure, 1973_313',
 'Code of Criminal Procedure, 1973_482',
 'Constitution_1',
 'Constitution_12',
 'Constitution_13',
 'Constitution_133',
 'Constitution_136',
 'Constitution_14',
 'Constitution_141',
 'Constitution_142',
 'Constitution_15',
 'Constitution_16',
 'Constitution_161',
 'Constitution_19',
 'Constitution_191',
 'Constitution_2',
 'Constitution_20',
 'Constitution_21',
 'Constitution_22',
 'Constitution_225',
 'Constitution_226',
 'Constitution_227',
 'Constitution_246',
 'Constitution_25',
 'Constitution_3',
 'Constitution_300',
 'Constitution_309',
 'Constitution_311',
 'Constitution_32',
 'Constitution_39',
 'Constitution_4',
 'Constitution_5',
 'Constitution_6',
 'Indian Penal Code, 1860_1',
 'Indian Penal Code, 1860_109',
 'Indian Penal Code, 1860_120

### Overall Targets

Combining the set of targets found in all three test sets

In [222]:
overall = set(test_df1_statute_mapping.values())

In [223]:
overall.update(test_df2_statutes, test_df3_statutes)

In [224]:
len(overall)

81

Therefore, the three test sets have 81 statute targets.

The section 'Indian Penal Code, 1860_120B' is being renamed as 'Indian Penal Code, 1860_120'

In [225]:
overall.remove("Indian Penal Code, 1860_120B")
overall.update({"Indian Penal Code, 1860_120"})

In [226]:
len(overall)

80

Thus, there is a total of 80 statutes in the test set with which statute prediction models are to be trained.

### Verification of the existence of targets provided in Supreme Court Data

In [228]:
with open("/home/workboots/Datasets/SC_50k/common/act_chapter_section_info/section_case_num.json", 'r') as f:
    sc_section_info = json.load(f)

In [229]:
statute_existence = {}
for statute in overall:
    statute_existence[statute] = sc_section_info.get(statute, -1)

In [230]:
not_found = [k for k, v in statute_existence.items() if v == -1]

In [231]:
not_found

[]

Thus, all the statutes exist in the provided dataset

## Mapping unknown queries to queries in the SC dataset to avoid leakage

The queries in the first DataFrame do not provide names of the cases but belong to the Supreme Court. As a result, taking cases from the dataset could result in data leakage. To avoid this, the queries in the AILA 2019 dataset need to be mapped to datapoints in the Supreme Court data (as much as possible)

On inspection, while the AILA 2019 queries do exist in the dataset, the ones in the dataset contain aggressive masking which is absent in the test queries. There is no need to check for data leakage.