# Classification
This is my first attempt at an interesting classification problem. During data exploration and analysis, I noticed many patterns concerning cases in which the defendant plead guilty such as

* shorter cases
* gender distribution
* state disposition
* type of crime

These are not enough to determine whether a defendant pleads guilty or not of course, but there does seem to be some correlation. I hope to explore this pattern in this notebook and find out whether it is possible to predict if a defendant pleads guilty just from the data given to us.

## Outcomes
If we find that such a pattern does exist in cases then it is possible to predict a lot of court data, it could help courts speed up resolving cases, it could help prosecution with techniques to increase the chances of the defendant pleading guilty and so on.

## Technical Aspects
This is a binary classification problem which avoids the overhead of `ovo` or `ova` solutions to multiclassification however we still have to clean the data.

We will use the sklearn classification report to gauge the accuracy of the models, specifically because of the recall.

#### Features
For the above classification, we choose the following features,

* the `state` at which the case was held
* the `court` at which the case was held
* the `judge` position
* the `gender` of the defendant
* the `gender` of the judge
* the `type` of case
* the `purpose` of the case
* the `act`
* whether the case was of `criminal` nature or not
* `time` taken to decide on the case

### Pre-processing
Our data consists of `../data/_baked/cases_recorded.csv` and helper tables in `../data/keys/`.

Before we can train models, we have to prepare the data. This is done with,

* merging with judge details (judge gender and position likely impacts a defendant's decision to plead guilty)
* dropping columns and rows with missing or conflicting data and working with dates
* one hot encoding non numerical data with no inherent order like purpose_name
* transforming numerical columns to have the same weight

In [2]:
import pandas as pd

### Dropping Columns or Rows
Below, we clean our data of unhelpful columns and rows. We also clean out missing values and incorrectly formatted data.

* We remove rows whose date isn't in any valid format.
* We remove columns whose values are mostly `na`.
* Clean up columns such as the gender columns.

The following changes to the DataFrame has been made,
1. column 'bailable_ipc' is dropped (mostly na)
2. column 'dist_code' is dropped (district is too specific and unrelated to affect results)
3. column 'cino', 'judge_id', and 'ddl_case_id' is dropped (id rows are unimportant to a classifier)
4. drop rows with unclear gender and convert gender to a numerical column (0 if male, 1 if female)

In [4]:
CHUNK_SIZE = 100_000

In [5]:
# decision judge with case id
judge_case_df = pd.read_csv('../data/keys/judge_case_merge_key.csv')
judge_case_df.drop(['ddl_filing_judge_id'], axis=1, inplace=True)

In [6]:
# rename column
judge_case_cols = list(judge_case_df.columns)
judge_case_cols[1] = 'ddl_judge_id'
judge_case_df.columns = judge_case_cols

In [7]:
# judges df
judges_df = pd.read_csv('../data/judges_clean.csv')
judges_df = judges_df[['ddl_judge_id', 'female_judge']]

In [8]:
# merge the two dfs
judge_data_df = pd.merge(judge_case_df, judges_df, on='ddl_judge_id', how='inner')

In [9]:
judge_data_df

Unnamed: 0,ddl_case_id,ddl_judge_id,female_judge
0,01-01-01-201900000022018,5.0,0 nonfemale
1,01-01-01-201900000032017,5.0,0 nonfemale
2,01-01-01-201900000042016,5.0,0 nonfemale
3,01-01-01-201900000052018,5.0,0 nonfemale
4,01-01-01-201900000072016,5.0,0 nonfemale
...,...,...,...
12290700,30-02-05-204000000162014,98405.0,1 female
12290701,30-02-06-201400000682011,98452.0,0 nonfemale
12290702,30-02-06-201400000992011,98452.0,0 nonfemale
12290703,30-02-06-201400001092011,98452.0,0 nonfemale


In [10]:
%%time
# in this cell we bake our data to extract features
cases_df = pd.read_csv('../data/_baked/cases_recorded.csv',
                iterator=True,
                chunksize=CHUNK_SIZE,
                low_memory=False)

chunk = 1
for df in cases_df:
    # PART1: merging to obtain decision judge gender
    df = pd.merge(df, judge_data_df, on='ddl_case_id', how='inner')

    # PART2: working with irrelevant columns
    df.drop(['bailable_ipc', 'dist_code', 'ddl_case_id', 'section',
             'cino', 'ddl_judge_id', 'number_sections_ipc',
             'female_adv_pet', 'female_adv_def', 'female_petitioner',
             'date_first_list', 'date_last_list', 'date_next_list',
             'year'
            ], axis=1, inplace=True)

    # PART3: working with dates
    date_columns = ['date_of_decision', 'date_of_filing']
    # parse date columns as dates
    for date_col in date_columns:
        df[date_col] = pd.to_datetime(df[date_col], infer_datetime_format=True, errors='coerce')
    
    # drop rows whose dates could not be parsed
    df.dropna(subset=date_columns, inplace=True)
    
    # add duration column
    df['duration_days'] = (df['date_of_decision'] -df['date_of_filing']).dt.days
    # drop the date columns
    df.drop(date_columns, axis=1, inplace=True)
    # take valid duration rows
    df = df[df['duration_days'] >= 0]

    # PART4: working with gender columns
    gender_columns = ['female_judge', 'female_defendant']
    # filter out unclear genders
    for gender_col in gender_columns:
        # filter on valid data
        df[gender_col] = df[gender_col].astype(str).transform(lambda gen: gen[0])
        gender_valid_filt = (df[gender_col] == '0') | (df[gender_col] == '1')
        df = df[gender_valid_filt]

        # convert to numeric column
        df[gender_col] = pd.to_numeric(df[gender_col])
    
    # write df_acts_sections to a data file
    df.to_csv('../data/_baked/ml/plead_guilty.csv',
              header=(chunk == 1),
              mode='a',
              index=False)

    print('.', end='')
    chunk += 1

print('Done.')

......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Done.
CPU times: total: 1h 57min 51s
Wall time: 1h 58min 7s


In [11]:
# we now load the dataset
features_df = pd.read_csv('../data/_baked/ml/plead_guilty.csv', low_memory=False) 

In [12]:
features_df['court_no'].value_counts()

3     1157680
1     1104544
2      935136
4      450974
5      338894
6      198620
8      140178
7      124446
9       88557
10      84810
11      84339
14      75792
13      71535
15      69483
12      46337
17      43064
18      41779
21      36739
16      36451
20      35629
19      31144
22      20297
23      15302
26      14130
24      12142
53      10866
29      10474
27      10328
32      10286
25       4688
28       4421
42       3417
31       3123
38       2121
36       1588
30       1523
39       1325
34       1176
35       1056
33       1007
41        922
40        825
43        790
46        553
49        551
45        379
44        306
58        285
48        194
65        167
59        150
37        109
66         95
47         10
60          5
50          1
Name: court_no, dtype: int64

In [13]:
features_df['purpose_name'].value_counts()

4551.0    198889
4889.0    162445
4342.0    125443
4922.0    109011
3229.0    102826
           ...  
4510.0         1
3115.0         1
2983.0         1
3081.0         1
6239.0         1
Name: purpose_name, Length: 7321, dtype: int64

In [14]:
features_df['type_name'].value_counts()

929.0     95227
956.0     59577
981.0     54944
736.0     54385
915.0     53202
          ...  
5939.0        1
4158.0        1
1141.0        1
1055.0        1
7532.0        1
Name: type_name, Length: 6233, dtype: int64

In [15]:
features_df['act'].value_counts()

17353.0    3659901
4759.0     1408500
14451.0      89587
14409.0      60679
14134.0      30074
            ...   
4245.0           1
5688.0           1
17170.0          1
8105.0           1
7735.0           1
Name: act, Length: 288, dtype: int64

In [17]:
features_df.columns

Index(['state_code', 'court_no', 'judge_position', 'female_defendant',
       'type_name', 'purpose_name', 'disp_name', 'act', 'criminal',
       'female_judge', 'duration_days'],
      dtype='object')

In [23]:
pg_filt = (features_df['disp_name'] == 37) | (features_df['disp_name'] == 38)
features_df[pg_filt]['act'].value_counts()

17353.0    92267
14134.0    27411
14451.0    10507
4759.0      4003
14124.0     1200
7530.0       534
14409.0      356
18295.0      265
25795.0      241
14411.0      177
530.0        166
490.0        159
14454.0      149
481.0        118
423.0        112
14452.0      104
7896.0        71
14405.0       69
4241.0        59
29715.0       51
14133.0       51
24083.0       50
25049.0       44
492.0         43
483.0         37
14442.0       36
28735.0       29
28742.0       21
28741.0       21
25798.0       21
7899.0        13
5084.0        12
24752.0        9
525.0          8
24078.0        7
24200.0        6
24079.0        6
17509.0        4
24742.0        3
486.0          3
7581.0         2
7531.0         2
12587.0        2
2301.0         1
16907.0        1
21648.0        1
24743.0        1
7765.0         1
5115.0         1
24072.0        1
19980.0        1
4236.0         1
14130.0        1
533.0          1
Name: act, dtype: int64

In [30]:
# we remove every act from '7896.0        71' and below (top)
keep_acts = list(features_df[pg_filt]['act'].value_counts().index[:16])

In [33]:
act_filt = features_df['act'].isin(keep_acts)
features_df2 = features_df[act_filt]

In [64]:
keep_courts = features_df2['court_no'].value_counts().index[:22]

In [65]:
court_filt = features_df2['court_no'].isin(keep_courts)
features_df3 = features_df2[court_filt]

In [76]:
pg_filt3 = (features_df3['disp_name'] == 37) | (features_df3['disp_name'] == 38)
keep_types = features_df3[pg_filt3]['type_name'].value_counts().index[:20]

In [77]:
type_filt = features_df3['type_name'].isin(keep_types)
features_df4 = features_df3[type_filt]

In [85]:
pg_filt4 = (features_df4['disp_name'] == 37) | (features_df4['disp_name'] == 38)
keep_purposes = features_df4['purpose_name'].value_counts().index[:25]

In [87]:
purpose_filt = features_df4['purpose_name'].isin(keep_purposes)
features_df5 = features_df4[purpose_filt]

In [88]:
pg_filt5 = (features_df5['disp_name'] == 37) | (features_df5['disp_name'] == 38)
features_df5[pg_filt5]

Unnamed: 0,state_code,court_no,judge_position,female_defendant,type_name,purpose_name,disp_name,act,criminal,female_judge,duration_days
869593,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1348
869619,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1381
869627,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1429
869644,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1309
870973,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1472
...,...,...,...,...,...,...,...,...,...,...,...
5198509,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,3
5198517,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,0
5198521,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,5
5200341,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,515


In [98]:
keep_positions = features_df5['judge_position'].value_counts().index[:22]

In [99]:
position_filt = features_df5['judge_position'].isin(keep_positions)
features_df6 = features_df5[position_filt]

In [101]:
features_df6

Unnamed: 0,state_code,court_no,judge_position,female_defendant,type_name,purpose_name,disp_name,act,criminal,female_judge,duration_days
692381,2,4,civil judge junior division,1,929.0,4512.0,4,17353.0,1,0,697
692382,2,4,civil judge junior division,0,929.0,4512.0,4,17353.0,1,0,676
692390,2,4,civil judge junior division,1,929.0,4512.0,19,17353.0,1,0,903
692405,2,6,civil judge junior division,1,929.0,4512.0,4,17353.0,1,1,528
692409,2,9,civil judge junior division,0,929.0,4512.0,4,17353.0,1,0,594
...,...,...,...,...,...,...,...,...,...,...,...
5330708,29,17,civil judge junior division,0,977.0,7698.0,2,17353.0,1,0,7
5330709,29,17,civil judge junior division,0,977.0,5053.0,4,17353.0,1,0,10
5330710,29,17,civil judge junior division,0,977.0,5053.0,4,17353.0,1,0,19
5330711,29,17,civil judge junior division,0,977.0,5053.0,4,17353.0,1,0,19


In [102]:
pg_filt6 = (features_df6['disp_name'] == 37) | (features_df6['disp_name'] == 38)
features_df6[pg_filt6]

Unnamed: 0,state_code,court_no,judge_position,female_defendant,type_name,purpose_name,disp_name,act,criminal,female_judge,duration_days
869593,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1348
869619,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1381
869627,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1429
869644,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1309
870973,17,2,civil court,0,929.0,4512.0,37,17353.0,1,1,1472
...,...,...,...,...,...,...,...,...,...,...,...
5198509,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,3
5198517,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,0
5198521,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,5
5200341,17,2,civil court,0,977.0,6208.0,38,17353.0,1,1,515


In [103]:
features_df6.to_csv('../data/_baked/ml/plead_guilty_features.csv', index=False)

In [109]:
y = features_df6['disp_name'].transform(lambda x: 1 if x == 37 or x == 38 else 0)

In [111]:
features_df6['y'] = y

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_df6['y'] = y


In [114]:
features_df6.drop(['disp_name'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_df6.drop(['disp_name'], axis=1, inplace=True)


In [116]:
features_df6.to_csv('../data/_baked/ml/plead_guilty_features.csv', index=False)

In [3]:
dataset_df = pd.read_csv('../data/_baked/ml/plead_guilty_features.csv')

### Encoding
We one hot encode columns `purpose_name`, `type_name`, `state_code`, `court_no`, `act`. We cannot ordinal encode these values as they have no relevant ordering.

We face the problem of quantity, if we were to brute force one hot encode these columns we would have too many columns, what we could do is choose values of `type_name`, `purpose_name`, `act` that make up 90% to 95% of the data. This allows us to remove many scarcely used values.

In [4]:
from sklearn.preprocessing import OneHotEncoder

In [5]:
categorical_columns = ['state_code', 'court_no', 'judge_position', 'type_name', 'purpose_name', 'act']

In [6]:
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

In [7]:
categorical_cols_df = dataset_df[categorical_columns]
numerical_df = dataset_df.drop(categorical_columns, axis=1)

In [8]:
OH_categorical_cols_df = pd.DataFrame(OH_encoder.fit_transform(categorical_cols_df))

In [9]:
OH_categorical_cols_df.index = categorical_cols_df.index

In [10]:
encoded_merged_df = pd.concat([OH_categorical_cols_df, numerical_df], axis=1)

In [11]:
# we normalize values so that all values lie in (-1, 1)
normalized_df = (encoded_merged_df-encoded_merged_df.min())/(encoded_merged_df.max()-encoded_merged_df.min())

In [12]:
normalized_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,female_defendant,criminal,female_judge,duration_days,y
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.284490,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.275918,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.368571,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.215510,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.242449,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
260192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.002857,0.0
260193,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.004082,0.0
260194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.007755,0.0
260195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.007755,0.0


### Split
We split the dataset into a test and train dataset, since we only have about 50,000 plead guilty cases, we can take 70% of them to train the model and 30% to test.

In [13]:
pg_filt_final = (normalized_df['y'] == 1)
not_pg_filt_final = (normalized_df['y'] == 0)
y_1_df = normalized_df[pg_filt_final]
y_0_df = normalized_df[not_pg_filt_final]

In [14]:
split_index_1 = int(0.7 * len(y_1_df))
split_index_0 = int(0.7 * len(y_0_df))

In [15]:
X_train = pd.concat([y_0_df[:split_index_0], y_1_df[:split_index_1]])
X_test = pd.concat([y_0_df[split_index_0:], y_1_df[split_index_1:]])

In [16]:
y_train = X_train['y']
y_test = X_test['y']

In [17]:
X_train.drop(['y'], axis=1, inplace=True)
X_test.drop(['y'], axis=1, inplace=True)

In [18]:
X_train = X_train.sample(frac=1)
y_train = y_train.sample(frac=1)
X_test = X_test.sample(frac=1)
y_test = y_test.sample(frac=1)

## Testing Out Models

### SGDClassifier

In [164]:
from sklearn.linear_model import SGDClassifier

In [175]:
sgd_clf = SGDClassifier(random_state=42)

In [176]:
sgd_clf.fit(X_train.values, y_train.values)

In [26]:
from sklearn.model_selection import cross_val_score

In [178]:
cross_val_score(sgd_clf, X_train.values, y_train.values, cv=3, scoring="accuracy")

array([0.78637195, 0.78636843, 0.78636843])

In [182]:
cross_val_score(sgd_clf, X_test.values, y_test.values, cv=3, scoring="accuracy")

array([0.78639508, 0.78635665, 0.78635665])

In [183]:
y_train[y_train == 1]

3465124    1.0
3519156    1.0
4503165    1.0
4435835    1.0
4514250    1.0
          ... 
2554704    1.0
4449781    1.0
2607865    1.0
2311502    1.0
2324547    1.0
Name: y, Length: 38910, dtype: float64

In [184]:
y_train[y_train == 0]

3223847    0.0
1701177    0.0
3430803    0.0
2331680    0.0
3523734    0.0
          ... 
2357173    0.0
2582815    0.0
3519177    0.0
3171620    0.0
2670341    0.0
Name: y, Length: 143227, dtype: float64

In [187]:
from sklearn.metrics import classification_report

In [193]:
y_test_pred = sgd_clf.predict(X_test.values)

In [201]:
print(classification_report(y_test.values, y_test_pred))

              precision    recall  f1-score   support

         0.0       0.79      1.00      0.88     61384
         1.0       0.00      0.00      0.00     16676

    accuracy                           0.79     78060
   macro avg       0.39      0.50      0.44     78060
weighted avg       0.62      0.79      0.69     78060



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### RandomForestClassifier

In [204]:
from sklearn.ensemble import RandomForestClassifier

In [206]:
clf = RandomForestClassifier(random_state=0)

In [208]:
clf.fit(X_train.values, y_train.values)

In [209]:
cross_val_score(clf, X_test.values, y_test.values, cv=3, scoring="accuracy")

array([0.72252114, 0.72517294, 0.72229055])

In [210]:
y_test_pred = clf.predict(X_test.values)

In [211]:
print(classification_report(y_test.values, y_test_pred))

              precision    recall  f1-score   support

         0.0       0.79      0.96      0.87     61384
         1.0       0.21      0.03      0.06     16676

    accuracy                           0.77     78060
   macro avg       0.50      0.50      0.46     78060
weighted avg       0.66      0.77      0.69     78060



### Neural Networks \[MLPClassifier\]

In [21]:
from sklearn.neural_network import MLPClassifier

In [23]:
clf = MLPClassifier(random_state=1)

In [24]:
clf.fit(X_train.values, y_train.values)

In [27]:
cross_val_score(clf, X_test.values, y_test.values, cv=3, scoring="accuracy")



array([0.78397387, 0.78239816, 0.77886241])

In [218]:
y_test_pred = clf.predict(X_test.values)

In [219]:
print(classification_report(y_test.values, y_test_pred))

              precision    recall  f1-score   support

         0.0       0.79      0.97      0.87     61384
         1.0       0.21      0.03      0.05     16676

    accuracy                           0.77     78060
   macro avg       0.50      0.50      0.46     78060
weighted avg       0.66      0.77      0.69     78060



### Deep Learning
I will train a neural network using keras in hopes of increasing our accuracy

Click here for the google colab link
https://colab.research.google.com/drive/1xVzuiZL9wB8Ie6_i3ciwx3PBIlqYkxtT?usp=sharing

## Results
From the classification report(s) generated above, there is seemingly little (RandomForestClassifier) to no (Keras and every other model) correlation between the features given and cases in which the defendant plead guilty.

Thus, one cannot infer whether a defendant plead guilty based on the features chosen above.