## Cases Exploration
This notebook explores data in court cases with files like `cases/cases_[year].csv`.

In [1]:
import pandas as pd

In [184]:
pd.set_option('display.max_rows', 200)

In [None]:
pd.reset_option('all')

In [3]:
# next, we import helper tables
types_df = pd.read_csv('../data/keys/type_name_key.csv')
purposes_df = pd.read_csv('../data/keys/purpose_name_key.csv')
dispositions_df = pd.read_csv('../data/keys/disp_name_key.csv')

### Exploration
First I explore types, purposes and dispositions for valuable and interesting data.

In [109]:
types_df.sort_values(by='count', ascending=False)

Unnamed: 0,year,type_name,type_name_s,count
56157,2018,977.0,cc,900758
48605,2017,981.0,cc,899445
41170,2016,929.0,cc,895703
33719,2015,956.0,cc,787564
26503,2014,915.0,cc,766294
...,...,...,...,...
30102,2014,4514.0,misscellaneous,1
18311,2012,6631.0,u.p.u.b. misc,1
56551,2018,1371.0,cm 41,1
56550,2018,1370.0,cm 38,1


In [188]:
domestic_filt = types_df['type_name_s'].str.contains('domestic violence', na=False)
types_df[domestic_filt].head()

Unnamed: 0,year,type_name,type_name_s,count
1744,2010,1745.0,domestic violence,81
1745,2010,1746.0,domestic violence ac,12
1746,2010,1747.0,domestic violence act 2005,850
1747,2010,1748.0,domestic violence act.,122
1748,2010,1749.0,domestic violence cases,38


In [38]:
purposes_df.sort_values(by='count', ascending=False)

Unnamed: 0,year,purpose_name,purpose_name_s,count
64405,2018,4551.0,hearing,1479733
54274,2017,3206.0,evidence,1288248
46164,2016,2813.0,evidence,1287232
60304,2018,450.0,appearance,1274278
55957,2017,4889.0,hearing,1270379
...,...,...,...,...
49664,2016,6313.0,report of mediation,1
24751,2013,5377.0,points for determination,1
7736,2011,2470.0,ex- party hearing,1
49660,2016,6309.0,report of i/o in final form,1


In [110]:
dispositions_df.sort_values(by='count', ascending=False)

Unnamed: 0,year,disp_name,disp_name_s,count
436,2018,27,disposition var missing,6338472
384,2017,27,disposition var missing,4521012
332,2016,27,disposition var missing,3260488
280,2015,26,disposition var missing,2259042
229,2014,26,disposition var missing,1607071
...,...,...,...,...
418,2018,9,bail order,30
60,2011,10,bail rejected,7
9,2010,10,bail rejected,6
366,2017,9,bail order,6


In [128]:
# notice, most dispositions are missing, we drop these values
filt = ~dispositions_df['disp_name_s'].str.contains('var missing')
dispositions_df = dispositions_df[filt]

plead_guilty_filt = dispositions_df['disp_name_s'] == 'plead guilty'
dispositions_df[plead_guilty_filt].sort_values(by='count', ascending=False)

Unnamed: 0,year,disp_name,disp_name_s,count
343,2016,38,plead guilty,283364
291,2015,37,plead guilty,274034
395,2017,38,plead guilty,265563
447,2018,38,plead guilty,233732
240,2014,37,plead guilty,211634
189,2013,37,plead guilty,110164
138,2012,37,plead guilty,52895
87,2011,37,plead guilty,33190
36,2010,37,plead guilty,24872


### Initials
There is a `judge_position` column in the `cases_[year].csv` but for which judge does it refer to? It's not well explained so I will be dropping this value, `judge_position` is present in the `judges_clean.csv` file and does not need to be in the cases csv.

#### Analysis Ideas
* <s>Maybe we can identify patterns in judges that fill in for other judges.</s>
* Finding which states have greater density of criminal crimes vs non-criminal crimes. #
* Distribution of cases that haven't reached a decision over the years. #
* Which months generally see more crime (the distribution of crime across the year) #
* Pleading guilty. (disp_name == 38 || disp_name == 37) #
* Domestic Violence Cases.

### Analysis No Verdict Cases
We analyse cases that didn't reach a decision yet.

### Result Store
Distribution of no verdict cases
```
{'2010': 570668,
 '2011': 676752,
 '2012': 885657,
 '2013': 1225055,
 '2014': 1598947,
 '2015': 2248229,
 '2016': 3259272,
 '2017': 4520630,
 '2018': 6327528}
```

Notice, as the years go on, more cases are left unresolved.

In [57]:
CHUNK_SIZE = 1_000_000
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']

In [69]:
%%time
cases_noverdict = {year: 0 for year in years}  # year[int]: {cri: [int], ncri: [int]}
for year in years:
    cases_df = pd.read_csv(f'../data/cases/cases_{year}.csv',
                           chunksize=CHUNK_SIZE,
                           iterator=True,
                           low_memory=False)

    for df in cases_df:
        cases_noverdict[year] += df.shape[0] - df['date_of_decision'].count()

CPU times: total: 4min 13s
Wall time: 4min 14s


In [71]:
cases_noverdict

{'2010': 570668,
 '2011': 676752,
 '2012': 885657,
 '2013': 1225055,
 '2014': 1598947,
 '2015': 2248229,
 '2016': 3259272,
 '2017': 4520630,
 '2018': 6327528}

### Analysis States with the highest criminal activity
We can find states which have the highest number of criminal cases, this allows us to roughly compare the safety of states.

We group by `state_code` and get the count on the aggregate. To do this, we first merge the cases csv with the act_sections csv.

Note: Although it is good practice to clean the data by dropping missing values and such, I have already imported this data into kaggle for a quick overview of value_counts and such and confirmed no missing values present before applying panda functions.

#### Results
Distribution of total cases and criminal cases across states,
```
{1: {'total': 9279577, 'criminal': 3079828.0},
 2: {'total': 1971164, 'criminal': 694513.0},
 3: {'total': 6396418, 'criminal': 1698992.0},
 4: {'total': 4447322, 'criminal': 1695785.0},
 5: {'total': 696932, 'criminal': 214842.0},
 6: {'total': 998950, 'criminal': 508030.0},
 7: {'total': 924529, 'criminal': 683236.0},
 8: {'total': 2745276, 'criminal': 2135724.0},
 9: {'total': 3269526, 'criminal': 1364226.0},
 10: {'total': 2770935, 'criminal': 805615.0},
 11: {'total': 1291388, 'criminal': 529585.0},
 13: {'total': 9557727, 'criminal': 3685990.0},
 12: {'total': 255890, 'criminal': 28259.0},
 14: {'total': 2919978, 'criminal': 664429.0},
 15: {'total': 654724, 'criminal': 87924.0},
 16: {'total': 1977734, 'criminal': 896541.0},
 17: {'total': 4853837, 'criminal': 1468420.0},
 18: {'total': 1025179, 'criminal': 394552.0},
 20: {'total': 79503, 'criminal': 41456.0},
 21: {'total': 29429, 'criminal': 6403.0},
 22: {'total': 2525279, 'criminal': 715912.0},
 23: {'total': 4113830, 'criminal': 1905263.0},
 19: {'total': 6420, 'criminal': 1782.0},
 26: {'total': 1479683, 'criminal': 432133.0},
 27: {'total': 181632, 'criminal': 28857.0},
 29: {'total': 1436693, 'criminal': 664579.0},
 30: {'total': 155042, 'criminal': 36031.0},
 32: {'total': 7755, 'criminal': 4155.0},
 25: {'total': 76671, 'criminal': 34014.0},
 31: {'total': 10789, 'criminal': 3194.0},
 24: {'total': 24536, 'criminal': 11714.0},
 33: {'total': 842, 'criminal': 135.0}}
 ```

In [88]:
# :warning: this cell takes 3h 29min 19s to run
%%time
statewise_criminal_cases = {}  # state_code: { total: [int], criminal: [int] }
for year in years:
    cases_df = pd.read_csv(f'../data/cases/cases_{year}.csv',
                           chunksize=CHUNK_SIZE,
                           iterator=True,
                           low_memory=False)

    for df in cases_df:
        case_law_df = pd.read_csv('../data/acts_sections.csv',
                                  chunksize=CHUNK_SIZE,
                                  iterator=True,
                                  low_memory=False)

        for part_df in case_law_df:
            merge_df = pd.merge(df, part_df, how='inner', on=['ddl_case_id'])

            for state_code, row in merge_df.groupby(by='state_code').sum(numeric_only=True).iterrows():
                criminal_cases = row['criminal']

                if state_code not in statewise_criminal_cases:
                    statewise_criminal_cases[state_code] = {
                        'total': 0,
                        'criminal': criminal_cases
                    }
                else:
                    statewise_criminal_cases[state_code]['criminal'] += criminal_cases

            for state_code, row in merge_df.groupby(by='state_code').count().iterrows():
                total_cases = row['criminal']

                if state_code not in statewise_criminal_cases:
                    statewise_criminal_cases[state_code] = {
                        'total': total_cases,
                        'criminal': 0
                    }
                else:
                    statewise_criminal_cases[state_code]['total'] += total_cases

    print('After year ' + year)
    print(statewise_criminal_cases)

print()
print()
print('Done.')
statewise_criminal_cases

After year 2010
{1: {'total': 707222, 'criminal': 268981.0}, 2: {'total': 103824, 'criminal': 30467.0}, 3: {'total': 150582, 'criminal': 29966.0}, 4: {'total': 172041, 'criminal': 52732.0}, 5: {'total': 2026, 'criminal': 834.0}, 6: {'total': 29521, 'criminal': 23730.0}, 7: {'total': 20174, 'criminal': 14674.0}, 8: {'total': 105968, 'criminal': 85061.0}, 9: {'total': 34795, 'criminal': 15976.0}, 10: {'total': 53457, 'criminal': 11506.0}, 11: {'total': 19611, 'criminal': 10637.0}, 13: {'total': 129871, 'criminal': 55532.0}, 12: {'total': 4462, 'criminal': 60.0}, 14: {'total': 19854, 'criminal': 8770.0}, 15: {'total': 13339, 'criminal': 1933.0}, 16: {'total': 28604, 'criminal': 12709.0}, 17: {'total': 261221, 'criminal': 79978.0}, 18: {'total': 9704, 'criminal': 4925.0}, 20: {'total': 1512, 'criminal': 1267.0}, 21: {'total': 851, 'criminal': 231.0}, 22: {'total': 15772, 'criminal': 4793.0}, 23: {'total': 178408, 'criminal': 81492.0}, 19: {'total': 52, 'criminal': 22.0}, 26: {'total': 4684

{1: {'total': 9279577, 'criminal': 3079828.0},
 2: {'total': 1971164, 'criminal': 694513.0},
 3: {'total': 6396418, 'criminal': 1698992.0},
 4: {'total': 4447322, 'criminal': 1695785.0},
 5: {'total': 696932, 'criminal': 214842.0},
 6: {'total': 998950, 'criminal': 508030.0},
 7: {'total': 924529, 'criminal': 683236.0},
 8: {'total': 2745276, 'criminal': 2135724.0},
 9: {'total': 3269526, 'criminal': 1364226.0},
 10: {'total': 2770935, 'criminal': 805615.0},
 11: {'total': 1291388, 'criminal': 529585.0},
 13: {'total': 9557727, 'criminal': 3685990.0},
 12: {'total': 255890, 'criminal': 28259.0},
 14: {'total': 2919978, 'criminal': 664429.0},
 15: {'total': 654724, 'criminal': 87924.0},
 16: {'total': 1977734, 'criminal': 896541.0},
 17: {'total': 4853837, 'criminal': 1468420.0},
 18: {'total': 1025179, 'criminal': 394552.0},
 20: {'total': 79503, 'criminal': 41456.0},
 21: {'total': 29429, 'criminal': 6403.0},
 22: {'total': 2525279, 'criminal': 715912.0},
 23: {'total': 4113830, 'crim

### Revelations
So, I noticed the above took far too long, I am gonna combine all valid cases (ones that have a row in acts_sections) and merge it with acts_sections to have one large file to work with and not several 3gb files that is stupid and i am stupid.

Although the dataset claims ~80 million recorded cases, In reality we only have ~67 million cases as clear from `wc` output,
```
$ wc cases_recorded.csv
   66165191   345111674 12967268763 cases_recorded.csv
```

In [None]:
# :warning: this cell takes very long to run
chunk = 1
for year in years:
    cases_df = pd.read_csv(f'../data/cases/cases_{year}.csv',
                           chunksize=CHUNK_SIZE,
                           iterator=True,
                           low_memory=False)

    for df in cases_df:
        case_law_df = pd.read_csv('../data/acts_sections.csv',
                                  chunksize=CHUNK_SIZE,
                                  iterator=True,
                                  low_memory=False)

        for part_df in case_law_df:
            merge_df = pd.merge(df, part_df, how='inner', on=['ddl_case_id'])

            # the merge_df now contains valid cases as rows,
            # we can write this to a file.
            merge_df.to_csv('../data/_baked/cases_recorded.csv',
                            header=(chunk == 1),
                            mode='a',
                            index=False)

            print(f'written_chunk: {chunk}')
            chunk += 1

    print()
    print(f'total_chunks_for_{year}: {chunk}')

print(f'total_chunks: {chunk}')

### Analysis Monthly Distribution of Criminal Cases
We take the month of filing.

#### Sidenote
Using `_baked/cases_recorded.csv` drastically reduced the time taken for this analysis, execution took `5min 32s`.
It would've taken a lot longer if we needed to chunk `act_sections.csv` and merge aswell.
Therefore, saving intermediate results is a simple way to work with very large files.

### Results
First, here is the data
```
{1: {'total': 5214333, 'crime': 1875800},
 2: {'total': 5025822, 'crime': 1781486},
 3: {'total': 5417506, 'crime': 1916943},
 4: {'total': 5312099, 'crime': 1901371},
 5: {'total': 4941842, 'crime': 1965947},
 6: {'total': 5071289, 'crime': 2015290},
 7: {'total': 6312769, 'crime': 2308302},
 8: {'total': 5822311, 'crime': 2184314},
 9: {'total': 5780851, 'crime': 2134357},
 10: {'total': 5403726, 'crime': 2007111},
 11: {'total': 5724909, 'crime': 2094397},
 12: {'total': 6137733, 'crime': 2336801}}
 ```
 
 Interestingly, we notice that the number of crimes varies through the months with two clear peaks. One at July and one during December.

In [106]:
%%time
recorded_cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                                chunksize=CHUNK_SIZE,
                                iterator=True,
                                low_memory=False)

# month_no: { 'nocrime': [int], 'crime': [int] } ; no of criminal cases
criminal_distribtion_monthwise = {m: {'total': 0, 'crime': 0} for m in range(1, 12 + 1)}
for df in recorded_cases_df:
    # first we augment the df to have an additonal column called month
    df['month_of_filing'] = df['date_of_filing'].apply(lambda d: d.split('-')[1])
    df['month_of_filing'] = pd.to_numeric(df['month_of_filing'])
    
    # we can then group by month and add the total criminal cases
    for m, row in df.groupby(by='month_of_filing').sum(numeric_only=True).iterrows():
        criminal_distribtion_monthwise[m]['crime'] += row['criminal']
        
    # here we group by month and add the total cases
    for m, row in df.groupby(by='month_of_filing').count().iterrows():
        criminal_distribtion_monthwise[m]['total'] += row['criminal']

    print('.', end='')

print()
print(criminal_distribtion_monthwise)

...................................................................
{1: {'total': 5214333, 'crime': 1875800.0}, 2: {'total': 5025822, 'crime': 1781486.0}, 3: {'total': 5417506, 'crime': 1916943.0}, 4: {'total': 5312099, 'crime': 1901371.0}, 5: {'total': 4941842, 'crime': 1965947.0}, 6: {'total': 5071289, 'crime': 2015290.0}, 7: {'total': 6312769, 'crime': 2308302.0}, 8: {'total': 5822311, 'crime': 2184314.0}, 9: {'total': 5780851, 'crime': 2134357.0}, 10: {'total': 5403726, 'crime': 2007111.0}, 11: {'total': 5724909, 'crime': 2094397.0}, 12: {'total': 6137733, 'crime': 2336801.0}}
CPU times: total: 5min 31s
Wall time: 5min 32s


### Analysis Pleading Guilty
We find data on cases and defendents that plead guilty and try to find interesting patterns. The disp_name for 'plead guilty' cases is 37 and 38.

#### Parts
1. The regional distribution (where did people plead guilty the most?)
2. Did women plead guilty more often than women?
3. Were cases in which the defendent plead guilty faster resolved (date_of_decicision - date_of_filing)

In [196]:
%%time
# Regional Analysis Cell
recorded_cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                                chunksize=CHUNK_SIZE,
                                iterator=True,
                                low_memory=False)

# pg prefixed variables are short for plead_guilty
pg_regional = {}  # state_code: [int]
regional_total = {}
for df in recorded_cases_df:
    # print(df.columns)

    pg_filt = (df['disp_name'] == 37) | (df['disp_name'] == 38)
    for state_code, row in df[pg_filt].groupby(by='state_code').count().iterrows():
        if state_code in pg_regional:
            pg_regional[state_code] += row['disp_name']
        else:
            pg_regional[state_code] = row['disp_name']

    for state_code, row in df.groupby(by='state_code').count().iterrows():
        if state_code in regional_total:
            regional_total[state_code] += row['disp_name']
        else:
            regional_total[state_code] = row['disp_name']

    print('.', end='')

print()
print(pg_regional)
print(regional_total)

...................................................................
{1: 189948, 3: 534996, 4: 175978, 6: 5832, 7: 7805, 8: 164, 9: 101225, 11: 720, 13: 1675, 14: 235, 15: 6614, 16: 36691, 17: 668975, 18: 2631, 20: 422, 21: 905, 22: 329, 26: 10065, 32: 292, 12: 12, 24: 273, 27: 49, 25: 10, 31: 1, 33: 1}
{1: 9279577, 2: 1971164, 3: 6396418, 4: 4447322, 5: 696932, 6: 998950, 7: 924529, 8: 2745276, 9: 3269526, 10: 2770935, 11: 1291388, 12: 255890, 13: 9557727, 14: 2919978, 15: 654724, 16: 1977734, 17: 4853837, 18: 1025179, 19: 6420, 20: 79503, 21: 29429, 22: 2525279, 23: 4113830, 24: 24536, 25: 76671, 26: 1479683, 27: 181632, 29: 1436693, 30: 155042, 31: 10789, 32: 7755, 33: 842}
CPU times: total: 4min 36s
Wall time: 4min 38s


#### Regional Analysis Results
```
Cases in which the defendant plead guilty
{1: 189948, 3: 534996, 4: 175978, 6: 5832, 7: 7805, 8: 164, 9: 101225, 11: 720, 13: 1675, 14: 235, 15: 6614, 16: 36691, 17: 668975, 18: 2631, 20: 422, 21: 905, 22: 329, 26: 10065, 32: 292, 12: 12, 24: 273, 27: 49, 25: 10, 31: 1, 33: 1}

Total Cases Statewise
{1: 9279577, 2: 1971164, 3: 6396418, 4: 4447322, 5: 696932, 6: 998950, 7: 924529, 8: 2745276, 9: 3269526, 10: 2770935, 11: 1291388, 12: 255890, 13: 9557727, 14: 2919978, 15: 654724, 16: 1977734, 17: 4853837, 18: 1025179, 19: 6420, 20: 79503, 21: 29429, 22: 2525279, 23: 4113830, 24: 24536, 25: 76671, 26: 1479683, 27: 181632, 29: 1436693, 30: 155042, 31: 10789, 32: 7755, 33: 842}
 ```

In [141]:
%%time
# Gender Analysis Cell (we use the defendant gender column)
# we normalize these values by calculating the total number of cases
# of men and women alongside those plead guilty
recorded_cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                                chunksize=CHUNK_SIZE,
                                iterator=True,
                                low_memory=False)

pg_gender = { 'male': 0, 'female': 0 }
gender_total = { 'male': 0, 'female': 0 }
for df in recorded_cases_df:
    # setting up filters this way resolves with the issue of missing values
    # or unclear names
    pg_filt = (df['disp_name'] == 37) | (df['disp_name'] == 38)
    male_filt = (df['female_defendant'] == '0 male')
    female_filt = (df['female_defendant'] == '1 female')
    
    gender_total['male'] += df[male_filt].shape[0]
    gender_total['female'] += df[female_filt].shape[0]

    pg_gender['male'] += df[pg_filt & male_filt].shape[0]
    pg_gender['female'] += df[pg_filt & female_filt].shape[0]
    
    print('.', end='')

print()
print(pg_gender)

...................................................................
{'male': 1385294, 'female': 164112}
CPU times: total: 4min 11s
Wall time: 4min 13s


In [146]:
print(
    'Men', round(pg_gender['male']/gender_total['male'] * 100, 3),
    'Women', round(pg_gender['female']/gender_total['female'] * 100, 3),
)

Men 3.121 Women 2.112


#### Gender Analysis Result
```
Men 3.121% Women 2.112%

plead_guilty: {'male': 1385294, 'female': 164112}
total: {'male': 44380705, 'female': 7768874}
```

Men plead guilty more often, but only by 1%

In [173]:
%%time
# Time Analysis Cell
recorded_cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                                chunksize=CHUNK_SIZE,
                                iterator=True,
                                low_memory=False)

days_deciding = { 'pg': 0, 'npg': 0 }
total_case_types = { 'pg': 0, 'npg': 0 }
for df in recorded_cases_df:
    df['date_of_decision'] = pd.to_datetime(df['date_of_decision'], infer_datetime_format=True, errors='coerce')
    df['date_of_filing'] = pd.to_datetime(df['date_of_filing'], infer_datetime_format=True, errors='coerce')

    df.dropna(subset=['date_of_decision', 'date_of_filing'], inplace=True)

    pg_filt = (df['disp_name'] == 37) | (df['disp_name'] == 38)
    ndf = df[~pg_filt]
    df = df[pg_filt]

    days_deciding['pg'] += (df['date_of_decision'] - df['date_of_filing']).dt.days.sum()
    days_deciding['npg'] += (ndf['date_of_decision'] - ndf['date_of_filing']).dt.days.sum()

    total_case_types['pg'] += df.shape[0]
    total_case_types['npg'] += ndf.shape[0]

    print('.', end='')

print()
print(days_deciding)
print(total_case_types)

...................................................................
{'pg': 170973300, 'npg': 17848592029}
{'pg': 1745821, 'npg': 45204640}
CPU times: total: 4min 40s
Wall time: 4min 42s


In [175]:
print(
    round(days_deciding['pg'] / total_case_types['pg'], 3),
    round(days_deciding['npg'] / total_case_types['npg'], 3)
)

97.933 394.84


#### Time Analysis Results
```
Days took to reach a decision {'pg': 170973300, 'npg': 17848592029}
Number of cases {'pg': 1745821, 'npg': 45204640}
```

|  Statistic                       | Plead Guilty | Didn't Plead Guilty |
|  ------------------------------  | ------------ | ------------------- |
|  total Days to reach a decision  | 170973300    | 17848592029         |
|  number of cases                 | 1745821      | 45204640            |
|  mean Days to reach a decision   | 97.933       | 394.84              |

Notice, cases in which the defendant plead guilty ended on average 296 days earlier than in cases where the defendant didn't.

### Analysis Domestic Violence Cases
We find data on cases, defendants and judges for domestic violence cases.

#### Parts
1. Which gender was more involved in the act (as the accused)
2. Regional Distribution of domestic violence cases.

In [192]:
%%time
# Gender Analysis Cell
recorded_cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                                chunksize=CHUNK_SIZE,
                                iterator=True,
                                low_memory=False)

# dv_gender_stats[gen] = number of cases of domestic violence where
# the gen is the defendant
dv_gender_stats = { 'male': 0, 'female': 0 }
for df in recorded_cases_df:
    merged_df = pd.merge(df, types_df, on='type_name', how='inner')
    
    domestic_violence_filt = merged_df['type_name_s'].str.contains('domestic violence')    
    male_defendant_filt = merged_df['female_defendant'] == '0 male'
    female_defendant_filt = merged_df['female_defendant'] == '1 female'
    
    dv_gender_stats['male'] += merged_df[domestic_violence_filt & male_defendant_filt].shape[0]
    dv_gender_stats['female'] += merged_df[domestic_violence_filt & female_defendant_filt].shape[0]

    print('.', end='')

print()
print(dv_gender_stats)

...................................................................
{'male': 1013611, 'female': 144007}
CPU times: total: 10min 18s
Wall time: 10min 21s


In [195]:
print(
    round(dv_gender_stats['male'] / gender_total['male'], 3) * 100,
    round(dv_gender_stats['female'] / gender_total['female'], 3) * 100
)

2.3 1.9


#### Gender Analysis Report
```
{'male': 1013611, 'female': 144007}
```

So, there is no real correlation between domestic violence and gender. As 2% of each gender are charged with domestic violence.

In [201]:
%%time
# regional Analysis Cell
recorded_cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                                chunksize=CHUNK_SIZE,
                                iterator=True,
                                low_memory=False)

dv_regional = {}
for df in recorded_cases_df:
    merged_df = pd.merge(df, types_df, on='type_name', how='inner')
    domestic_violence_filt = merged_df['type_name_s'].str.contains('domestic violence', na=False)

    for state_code, row in merged_df[domestic_violence_filt].groupby(by='state_code').count().iterrows():
        if state_code in dv_regional:
            dv_regional[state_code] += row['type_name']
        else:
            dv_regional[state_code] = row['type_name']

    print('.', end='')

print()
print(dv_regional)

...................................................................
{1: 38076, 2: 64343, 3: 19065, 5: 16752, 6: 6708, 7: 1701, 8: 75267, 9: 411739, 11: 5144, 12: 3894, 13: 429904, 15: 4284, 16: 12602, 17: 16393, 18: 5692, 22: 24831, 20: 272, 25: 1669, 31: 77, 32: 116, 4: 76525, 10: 58473, 14: 18871, 23: 8791, 29: 36823, 30: 1464, 21: 110, 26: 14620, 27: 1165, 19: 197, 24: 550, 33: 29}
CPU times: total: 9min 51s
Wall time: 9min 53s


#### Region Analysis Report
```
{1: 38076, 2: 64343, 3: 19065, 5: 16752, 6: 6708, 7: 1701, 8: 75267, 9: 411739, 11: 5144, 12: 3894, 13: 429904, 15: 4284, 16: 12602, 17: 16393, 18: 5692, 22: 24831, 20: 272, 25: 1669, 31: 77, 32: 116, 4: 76525, 10: 58473, 14: 18871, 23: 8791, 29: 36823, 30: 1464, 21: 110, 26: 14620, 27: 1165, 19: 197, 24: 550, 33: 29}
```