## Act and Sections Exploration
This notebook explores the `act_sections.csv` data file and combines it with `cases/section_key.csv` and `cases/act_key.csv`.

In [2]:
import pandas as pd

In [2]:
pd.set_option('display.max_rows', 50)

In [None]:
pd.reset_option('all')

### Setup
We set up the `pd.dataFrame` and other global variables.

Due to the large size of the data, we chunk the file and process each chunk independently and combine results.

In [3]:
CHUNK_SIZE = 1_000_000

In [4]:
case_law_df = pd.read_csv('../data/acts_sections.csv',
                          chunksize=CHUNK_SIZE,
                          iterator=True,
                          low_memory=False)
sections_df = pd.read_csv('../data/keys/section_key.csv')
acts_df = pd.read_csv('../data/keys/act_key.csv')

### Exploration and Initial Analysis

In [36]:
sections_df.sort_values(by='count', ascending=False)

Unnamed: 0,section_s,count,section
0,,4834909,
213800,138,3939583,213800.0
1182576,506,2315311,1182576.0
647330,279,1527325,647330.0
965693,379,1485923,965693.0
...,...,...,...
847412,"324r/w34IPCalteredintosec.302,201r/w34IPCandal...",1,847412.0
847411,324r/w34IPCSec.3-1x,1,847411.0
847410,324r/w34IPCAND3-I-X,1,847410.0
847409,324r/w34IPC3ixscst,1,847409.0


In [82]:
acts_df2 = acts_df.sort_values(by='count', ascending=False)

In [104]:
# the percentages of total value
total_act_count = acts_df2['count'].sum()  # 76765611.0

In [84]:
acts_df2['percentages'] = acts_df2['count'].div(total_act_count).mul(100)

In [85]:
acts_df2.head()

Unnamed: 0,act_s,count,act,percentages
17353,The Indian Penal Code,20900000.0,17353.0,27.225733
4759,Code of Criminal Procedure,8630668.0,4759.0,11.242883
10581,Motor Vehicles Act,3124278.0,10581.0,4.069893
4747,Code of Civil Procedure,2679956.0,4747.0,3.491089
4650,Civil Procedure Code,1746442.0,4650.0,2.275032


### Mutate DataFrame
Here, I merge `case_law_df` with `sections_df` and `acts_df` to convert the `act` and `section` column into their string counterparts.

I believe this makes the data easier to read and understand, I will keep count values by renaming the `count` column.

In [101]:
acts_df_cols = list(acts_df.columns)
acts_df_cols[1] = 'act_count'
acts_df.columns = acts_df_cols

In [102]:
sections_df_cols = list(sections_df.columns)
sections_df_cols[1] = 'section_count'
sections_df.columns = sections_df_cols

In [103]:
chunk = 1
for df in case_law_df:
    # merge with acts
    df_acts = pd.merge(df, acts_df, how='inner', on=['act'])
    df_acts.drop(columns=['act'], inplace=True)

    # merge with sections
    df_acts_sections = pd.merge(df_acts, sections_df, how='inner', on=['section'])
    df_acts_sections.drop(columns=['section'], inplace=True)

    # write df_acts_sections to a data file
    df_acts_sections.to_csv('../data/_baked/case_law.csv',
                            header=(chunk == 1),
                            mode='a',
                            index=False)

    print(f'written_chunk: {chunk}')
    chunk += 1

print(f'total_chunks: {chunk}')

written_chunk: 1
written_chunk: 2
written_chunk: 3
written_chunk: 4
written_chunk: 5
written_chunk: 6
written_chunk: 7
written_chunk: 8
written_chunk: 9
written_chunk: 10
written_chunk: 11
written_chunk: 12
written_chunk: 13
written_chunk: 14
written_chunk: 15
written_chunk: 16
written_chunk: 17
written_chunk: 18
written_chunk: 19
written_chunk: 20
written_chunk: 21
written_chunk: 22
written_chunk: 23
written_chunk: 24
written_chunk: 25
written_chunk: 26
written_chunk: 27
written_chunk: 28
written_chunk: 29
written_chunk: 30
written_chunk: 31
written_chunk: 32
written_chunk: 33
written_chunk: 34
written_chunk: 35
written_chunk: 36
written_chunk: 37
written_chunk: 38
written_chunk: 39
written_chunk: 40
written_chunk: 41
written_chunk: 42
written_chunk: 43
written_chunk: 44
written_chunk: 45
written_chunk: 46
written_chunk: 47
written_chunk: 48
written_chunk: 49
written_chunk: 50
written_chunk: 51
written_chunk: 52
written_chunk: 53
written_chunk: 54
written_chunk: 55
written_chunk: 56
w

### Type, Disposition, Purpose Exploration

In [3]:
types_df = pd.read_csv('../data/keys/type_name_key.csv')
purposes_df = pd.read_csv('../data/keys/purpose_name_key.csv')
dispositions_df = pd.read_csv('../data/keys/disp_name_key.csv')

In [21]:
types_df.shape

(62714, 4)

In [8]:
types_df.groupby(by='type_name_s').sum().sort_values('count', ascending=False).head(50)

Unnamed: 0_level_0,year,type_name,count
type_name_s,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cc,18126,8087.0,6120886
cri. case,18126,16135.0,3150883
c.c.,18126,6879.0,3105428
s.c.c.,18126,50170.0,3042465
os,18126,42643.0,1956871
st,18126,56008.0,1739822
rct,18126,46788.0,1615621
criminal case,18126,17306.0,1557543
crl.misc.,18126,18978.0,1254416
complaint cases,18126,13313.0,1140676


In [20]:
types_df[types_df['type_name_s'].str.contains('motor', na=False)].sort_values('count', ascending=False)

Unnamed: 0,year,type_name,type_name_s,count
59962,2018,4782.0,motor vehc act,136236
52450,2017,4826.0,motor vehicle act,91298
59964,2018,4784.0,motor vehicle act,72492
52447,2017,4823.0,motor vehc act,62878
44991,2016,4750.0,motor vehicle act,48698
...,...,...,...,...
9471,2011,4020.0,motor vehicles act,1
16072,2012,4392.0,motor accident claim cases,1
16074,2012,4394.0,motor accident claim tribunal old,1
23078,2013,4580.0,motor accidents,1


In [22]:
purposes_df.groupby(by='purpose_name_s').sum().sort_values('count', ascending=False).head(50)

Unnamed: 0_level_0,year,purpose_name,count
purpose_name_s,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
evidence,18126,24822.0,9152327
hearing,18126,37970.0,6513926
appearance,18126,4043.0,6395119
order,18126,48023.0,5789525
judgement,18126,41925.0,4430629
argument,18126,6932.0,4361497
summons,18126,63652.0,3698835
issue,18126,41650.0,2446104
appereance,18126,6394.0,2196235
misc,18126,43622.0,1609892


In [25]:
dispositions_df.groupby(by='disp_name_s').sum().sort_values('count', ascending=False).head(100)

Unnamed: 0_level_0,year,disp_name,count
disp_name_s,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
disposition var missing,18126,237,21453111
dismissed,18126,201,6837637
allowed,18126,45,6439842
disposed,18126,219,5331080
acquitted,18126,36,4219597
transferred,18126,435,3634252
referred to lok adalat,18126,381,3445350
convicted,18126,174,3041181
fine,18126,264,3012723
judgement,18126,273,2964114


In [24]:
dispositions_df[dispositions_df['disp_name_s'].str.contains('bail')]

Unnamed: 0,year,disp_name,disp_name_s,count
7,2010,8,bail granted,553
8,2010,9,bail refused,110
9,2010,10,bail rejected,6
58,2011,8,bail granted,597
59,2011,9,bail refused,216
60,2011,10,bail rejected,7
109,2012,8,bail granted,574
110,2012,9,bail refused,56
111,2012,10,bail rejected,348
160,2013,8,bail granted,2051
