## Project goal: forecasting the probability individuals recidivated year 1

## Data description:
#### The Challenge uses data on roughly 26,000 individuals from the State of Georgia released from Georgia prisons on discretionary parole to the custody of the Georgia Department of Community Supervision (GDCS) for the purpose of post-incarceration supervision between January 1, 2013 and December 31, 2015. This dataset is split into two sets, training and test. We used a 70/30 split, indicating that 70% of the data is in the training dataset and 30% in the test dataset. The training dataset includes the four dichotomous dependent variables measuring if an individual recidivated in the three-year follow-up period (yes/no) as well as recidivated by time period (year 1, year 2, or year 3). Recidivism is measured as an arrest for a new felony or misdemeanor crime within three years of the supervision start date. The test dataset does not include the four dependent variables. The initial test dataset will include all individuals selected in the 30% test dataset. After the first Challenge period (forecasting the probability individuals recidivated year 1) concludes, a second test dataset will be released containing only those individuals that did not recidivate year 1. The same will be done after the second Challenge period. It should also be noted that the test dataset will contain variables that describe supervision activities, such as drug testing and employment. These data will not appear in the test dataset until the second Challenge period (i.e., year 2 dataset). We believe this is more reflective of practice where activities must accrue and correctional oficers must become aware prior to a recidivism event. The additional data released at the second Challenge period will not change at the third Challenge period release (i.e., year 3 dataset); they are measures of supervision activities during the entire time people were under supervision or until the date of recidivism for those arrested. The only thing that changes with the third Challenge period release is the removal of those individuals that did recidivate in year 2. 
## Data Source:
#### Both the GDCS and the Georgia Bureau of Investigation provided data.

In [1]:
import csv
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import string
import nltk

In [2]:
#Load train and test data
data_train = pd.read_csv(r'C:\NIJ_s_Recidivism_Challenge_Training_Dataset.csv')
data_test = pd.read_csv(r'C:\NIJ_s_Recidivism_Challenge_Test_Dataset1.csv')

In [3]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18028 entries, 0 to 18027
Data columns (total 53 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   ID                                                 18028 non-null  int64  
 1   Gender                                             18028 non-null  object 
 2   Race                                               18028 non-null  object 
 3   Age_at_Release                                     18028 non-null  object 
 4   Residence_PUMA                                     18028 non-null  int64  
 5   Gang_Affiliated                                    15811 non-null  object 
 6   Supervision_Risk_Score_First                       17698 non-null  float64
 7   Supervision_Level_First                            16816 non-null  object 
 8   Education_Level                                    18028 non-null  object 
 9   Depend

In [4]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7807 entries, 0 to 7806
Data columns (total 33 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   ID                                                 7807 non-null   int64  
 1   Gender                                             7807 non-null   object 
 2   Race                                               7807 non-null   object 
 3   Age_at_Release                                     7807 non-null   object 
 4   Residence_PUMA                                     7807 non-null   int64  
 5   Gang_Affiliated                                    6857 non-null   object 
 6   Supervision_Risk_Score_First                       7662 non-null   float64
 7   Supervision_Level_First                            7299 non-null   object 
 8   Education_Level                                    7807 non-null   object 
 9   Dependen

In [5]:
# Remove rows with null values
data_train=data_train.dropna(axis=0)
data_test=data_test.dropna(axis=0)


In [6]:
# Remove the last 4 columns (y values) from the train data and removing the id column from the test and train data
X_train=data_train[data_train.columns[1:-4]]
Y_train=data_train[data_train.columns[-3]]*1
X_test=data_test[data_test.columns[1:]]

In [7]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9838 entries, 0 to 18010
Data columns (total 48 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Gender                                             9838 non-null   object 
 1   Race                                               9838 non-null   object 
 2   Age_at_Release                                     9838 non-null   object 
 3   Residence_PUMA                                     9838 non-null   int64  
 4   Gang_Affiliated                                    9838 non-null   object 
 5   Supervision_Risk_Score_First                       9838 non-null   float64
 6   Supervision_Level_First                            9838 non-null   object 
 7   Education_Level                                    9838 non-null   object 
 8   Dependents                                         9838 non-null   object 
 9   Prison_

In [8]:
Y_train=data_train[data_train.columns[-3]]*1
Y_train

0        0
1        0
2        0
3        0
4        1
        ..
17984    0
17990    0
18004    0
18008    0
18010    0
Name: Recidivism_Arrest_Year1, Length: 9838, dtype: int32

In [9]:
Y_train.describe()

count    9838.000000
mean        0.281765
std         0.449882
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: Recidivism_Arrest_Year1, dtype: float64

In [10]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5568 entries, 0 to 7787
Data columns (total 32 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Gender                                             5568 non-null   object 
 1   Race                                               5568 non-null   object 
 2   Age_at_Release                                     5568 non-null   object 
 3   Residence_PUMA                                     5568 non-null   int64  
 4   Gang_Affiliated                                    5568 non-null   object 
 5   Supervision_Risk_Score_First                       5568 non-null   float64
 6   Supervision_Level_First                            5568 non-null   object 
 7   Education_Level                                    5568 non-null   object 
 8   Dependents                                         5568 non-null   object 
 9   Prison_O

In [11]:
# change the number of columns in train data from 52 to 32 (same as the test data)  
X_train=X_train[X_test.columns]
X_train.head(3)

Unnamed: 0,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,Prison_Offense,...,Prior_Conviction_Episodes_Prop,Prior_Conviction_Episodes_Drug,Prior_Conviction_Episodes_PPViolationCharges,Prior_Conviction_Episodes_DomesticViolenceCharges,Prior_Conviction_Episodes_GunCharges,Prior_Revocations_Parole,Prior_Revocations_Probation,Condition_MH_SA,Condition_Cog_Ed,Condition_Other
0,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,Drug,...,2,2 or more,False,False,False,False,False,True,True,False
1,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,Violent/Non-Sex,...,0,2 or more,True,True,True,False,False,False,False,False
2,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,Drug,...,1,2 or more,False,True,False,False,False,True,True,False


In [12]:
X_total=X_train.append(X_test)
X_total


Unnamed: 0,Gender,Race,Age_at_Release,Residence_PUMA,Gang_Affiliated,Supervision_Risk_Score_First,Supervision_Level_First,Education_Level,Dependents,Prison_Offense,...,Prior_Conviction_Episodes_Prop,Prior_Conviction_Episodes_Drug,Prior_Conviction_Episodes_PPViolationCharges,Prior_Conviction_Episodes_DomesticViolenceCharges,Prior_Conviction_Episodes_GunCharges,Prior_Revocations_Parole,Prior_Revocations_Probation,Condition_MH_SA,Condition_Cog_Ed,Condition_Other
0,M,BLACK,43-47,16,False,3.0,Standard,At least some college,3 or more,Drug,...,2,2 or more,False,False,False,False,False,True,True,False
1,M,BLACK,33-37,16,False,6.0,Specialized,Less than HS diploma,1,Violent/Non-Sex,...,0,2 or more,True,True,True,False,False,False,False,False
2,M,BLACK,48 or older,24,False,7.0,High,At least some college,3 or more,Drug,...,1,2 or more,False,True,False,False,False,True,True,False
3,M,WHITE,38-42,16,False,7.0,High,Less than HS diploma,1,Property,...,3 or more,2 or more,False,False,False,False,True,True,True,False
4,M,WHITE,33-37,16,False,4.0,Specialized,Less than HS diploma,3 or more,Violent/Non-Sex,...,0,1,False,False,False,False,False,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7783,M,BLACK,43-47,11,False,2.0,Standard,High School Diploma,3 or more,Other,...,0,0,False,False,False,False,False,True,False,False
7784,M,BLACK,23-27,2,False,7.0,Standard,At least some college,3 or more,Drug,...,0,0,False,False,False,False,False,True,False,False
7785,M,BLACK,28-32,23,False,4.0,Standard,At least some college,3 or more,Drug,...,0,1,False,False,False,False,False,True,False,False
7786,M,BLACK,28-32,8,False,6.0,Standard,High School Diploma,1,Property,...,1,0,False,False,False,False,False,True,False,True


In [13]:
X_total=pd.get_dummies(X_total, columns=["Gender","Race", "Age_at_Release", "Prior_Arrest_Episodes_Felony","Prior_Arrest_Episodes_Property","Prior_Arrest_Episodes_Drug",
                                         "Prior_Conviction_Episodes_Viol","Prior_Conviction_Episodes_Prop",
                                         "Prior_Arrest_Episodes_DVCharges","Prior_Conviction_Episodes_Misd",
                                         "Prior_Conviction_Episodes_Felony","Prior_Arrest_Episodes_GunCharges", 
                                         "Prior_Arrest_Episodes_PPViolationCharges", "Gang_Affiliated", 
                                         "Supervision_Level_First", "Education_Level", "Dependents", "Prison_Offense",
                                         "Prison_Years", "Prior_Arrest_Episodes_Misd" ,"Prior_Arrest_Episodes_Violent",
                                         "Prior_Conviction_Episodes_Drug", "Prior_Conviction_Episodes_PPViolationCharges",
                                         "Prior_Conviction_Episodes_DomesticViolenceCharges",
                                         "Prior_Conviction_Episodes_GunCharges", "Prior_Revocations_Parole",
                                         "Prior_Revocations_Probation", "Condition_MH_SA", "Condition_Cog_Ed",
                                         "Condition_Other"])


In [15]:
#X_total.to_csv(r'C:\Users\sshir\TEST.csv', index=False, header=True)

In [16]:
X_train=X_total[0:12733]
X_test=X_total[12733:5569]

X_train.head()

Unnamed: 0,Residence_PUMA,Supervision_Risk_Score_First,Gender_M,Race_BLACK,Race_WHITE,Age_at_Release_18-22,Age_at_Release_23-27,Age_at_Release_28-32,Age_at_Release_33-37,Age_at_Release_38-42,...,Prior_Revocations_Parole_False,Prior_Revocations_Parole_True,Prior_Revocations_Probation_False,Prior_Revocations_Probation_True,Condition_MH_SA_False,Condition_MH_SA_True,Condition_Cog_Ed_False,Condition_Cog_Ed_True,Condition_Other_False,Condition_Other_True
0,16,3.0,1,1,0,0,0,0,0,0,...,1,0,1,0,0,1,0,1,1,0
1,16,6.0,1,1,0,0,0,0,1,0,...,1,0,1,0,1,0,1,0,1,0
2,24,7.0,1,1,0,0,0,0,0,0,...,1,0,1,0,0,1,0,1,1,0
3,16,7.0,1,0,1,0,0,0,0,1,...,1,0,0,1,0,1,0,1,1,0
4,16,4.0,1,0,1,0,0,0,1,0,...,1,0,1,0,0,1,0,1,0,1


In [17]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X_train=X_total[0:12733]
Y_train=data_train[data_train.columns[-3]]*1
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, Y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(gamma='auto'))])
print(clf.predict([[-0.8, -1]]))

ValueError: Found input variables with inconsistent numbers of samples: [12733, 9838]

In [18]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# train an SVM Classifier, test and print the accuracy
C_range = [0.001,0.01,0.1,1.,10.] 
max=0
for x in C_range:
    clf = SVC(C=x)
    clf.fit(X_train, Y_train)
    preds =  clf.predict(X_test)
    acc = accuracy_score(Y_test, preds)
    if acc > max:
        print('New best accuracy {:.2f}'.format(acc))
        max = acc

ValueError: Found input variables with inconsistent numbers of samples: [12733, 9838]

In [47]:
### Step 4: Train a Random Forest Classifier, predict the labels, measure the accuracy, and print the classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Write your code here
### Step 4: Train a Random Forest Classifier, predict the labels, measure the accuracy, and print the classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Write your code here
RFC = RandomForestClassifier (n_estimators = 200, random_state=0)
RFC.fit(X_train, Y_train)
y_pred = RFC.predict(X_test)
#accuracy=accuracy_score(y_test,y_pred)
#print('Accuracy{}'.format(accuracy))


ValueError: could not convert string to float: '10 or more'