# Create Train/Test Matrices

Now, taking the bulk data from the recidivism dataset with a 3 year recidivism flag, I will construct train/testing matrices in the format of Triage. The full data table is 1.7G on disk in pickle format, so I will subsample the years for quicker modelling time. 

In [3]:
import pandas as pd
import numpy as np

In [4]:
target = "RECIDIVATED_3_YR"
offset = 3

In [5]:
recid_data = pd.read_pickle('../../../../nc_recidivism/data/preprocessed/final_recid_data.pkl')

In [6]:
recid_data.head()

Unnamed: 0,INMATE_DOC_NUMBER,INMATE_COMMITMENT_PREFIX,SENTENCE_END,PAROLE_DAYS,AGE_AT_RELEASE,NUMBER_OF_COUNTS,COUNTY_ALAMANCE,COUNTY_BEAUFORT,COUNTY_BRUNSWICK,COUNTY_BUNCOMBE,...,INMATE_RACE_CODE_BLACK,INMATE_RACE_CODE_INDIAN,INMATE_RACE_CODE_OTHER,INMATE_RACE_CODE_UNKNOWN,INMATE_RACE_CODE_WHITE,INMATE_RACE_CODE_nan,PREVIOUS_COMMITMENTS,RECIDIVATED_1_YR,RECIDIVATED_2_YR,RECIDIVATED_3_YR
0,4,AA,1984-07-11,0.0,22,2,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,1.0,0,0,0
1,6,AA,1973-03-28,0.0,21,1,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,1.0,1,1,1
2,6,AB,1975-08-18,0.0,24,27,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,2.0,0,0,0
3,8,AA,1990-05-17,0.0,26,1,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,1.0,0,0,0
4,8,AB,1994-01-26,0.0,30,1,0.0,0.0,0.0,0.0,...,0,0,0,0,1,0,2.0,1,1,1


First, we need to find out which years appear to have full data because the data supplied becomes sparser further back. It appears to begin in earnest in 1973.

In [7]:
recid_data['SENTENCE_END_YEAR'] = recid_data.SENTENCE_END.dt.year

In [8]:
recid_data.groupby('SENTENCE_END_YEAR').agg({target:["count", 'mean']})[15:45]

Unnamed: 0_level_0,RECIDIVATED_3_YR,RECIDIVATED_3_YR
Unnamed: 0_level_1,count,mean
SENTENCE_END_YEAR,Unnamed: 1_level_2,Unnamed: 2_level_2
1970,4,0.0
1971,11,0.0
1972,5,0.0
1973,8377,0.26907
1974,8622,0.263512
1975,8370,0.263919
1976,10911,0.246082
1977,10668,0.228534
1978,9068,0.235002
1979,8927,0.26403


### Getting the Dataset in the Proper Format

In [9]:
features = recid_data[recid_data.columns.difference(['INMATE_DOC_NUMBER', 'INMATE_COMMITMENT_PREFIX', 'RECIDIVATED_1_YR', 
                                                     'RECIDIVATED_2_YR', 'RECIDIVATED_3_YR', 'SENTENCE_START', 
                                                     'SENTENCE_END', 'SENTENCE_END_YEAR'])]

In [10]:
proper_format = pd.concat([recid_data[['INMATE_DOC_NUMBER','SENTENCE_END']], features, recid_data[target]], axis=1)

In [11]:
proper_format.sort_values(by='SENTENCE_END', inplace=True)

In [12]:
proper_format.rename(columns={'INMATE_DOC_NUMBER': 'entity_id', 'SENTENCE_END':"as_of_date"}, inplace=True)

### Breaking into Matrices

I will use the years from 1980 to 1995 as my testing timeframe, with one matrix per year. I will start my training data at 1975.

In [13]:
matrix_keys = range(16)

In [14]:
seed = 12412

train_matrix_uuids = []
train_end_times = []
test_matrix_uuids = []
evaluation_start_times = []
evaluation_end_times = []
model_configs = []
num_features = []


# create and save matrices
for i in matrix_keys:
    test_start_date = pd.to_datetime('1980-01-01') + pd.DateOffset(years = i)
    test_end_date = test_start_date + pd.DateOffset(years = 1)
    
    train_start_date = pd.to_datetime('1975-01-01')
    train_end_date = test_start_date - pd.DateOffset(years = offset)
    
    # Start the training set at 1975
    train_df = proper_format[(proper_format.as_of_date >= train_start_date) &
                             (proper_format.as_of_date < train_end_date)].sample(frac=0.33, random_state=seed)
    
    test_df = proper_format[(proper_format.as_of_date >= test_start_date) &
                            (proper_format.as_of_date < test_end_date)].sample(frac=0.33, random_state=seed)
    
    train_uuid = 'train_{}_{}'.format(i, target)
    test_uuid = 'test_{}_{}'.format(i, target)
    
    
    train_matrix_uuids.append(train_uuid)
    train_end_times.append(train_end_date)
    test_matrix_uuids.append(test_uuid)
    evaluation_start_times.append(test_start_date)
    evaluation_end_times.append(test_end_date)
    model_configs.append("")
    num_features.append(train_df.shape[1])
    
    train_df.to_csv('../train_matrices/' + train_uuid + '.csv', index=False)
    test_df.to_csv('../test_matrices/'  + test_uuid  + '.csv', index=False)
    
    print('{} - {} \n\t Train Size: {} \n\t Test Size: {} \n\t Recid Rate: {} \n'.format(test_uuid, 
                                                                test_start_date, train_df.shape[0], 
                                                                test_df.shape[0], test_df[target].mean()))

test_0_RECIDIVATED_3_YR - 1980-01-01 00:00:00 
	 Train Size: 6363 
	 Test Size: 3183 
	 Recid Rate: 0.27898209236569277 

test_1_RECIDIVATED_3_YR - 1981-01-01 00:00:00 
	 Train Size: 9883 
	 Test Size: 3612 
	 Recid Rate: 0.30841638981173863 

test_2_RECIDIVATED_3_YR - 1982-01-01 00:00:00 
	 Train Size: 12876 
	 Test Size: 4025 
	 Recid Rate: 0.3063354037267081 

test_3_RECIDIVATED_3_YR - 1983-01-01 00:00:00 
	 Train Size: 15822 
	 Test Size: 5162 
	 Recid Rate: 0.2960092987214258 

test_4_RECIDIVATED_3_YR - 1984-01-01 00:00:00 
	 Train Size: 19004 
	 Test Size: 4299 
	 Recid Rate: 0.32775063968364737 

test_5_RECIDIVATED_3_YR - 1985-01-01 00:00:00 
	 Train Size: 22616 
	 Test Size: 4503 
	 Recid Rate: 0.3217854763491006 

test_6_RECIDIVATED_3_YR - 1986-01-01 00:00:00 
	 Train Size: 26641 
	 Test Size: 4539 
	 Recid Rate: 0.32077550121172066 

test_7_RECIDIVATED_3_YR - 1987-01-01 00:00:00 
	 Train Size: 31803 
	 Test Size: 4449 
	 Recid Rate: 0.33086086761069905 

test_8_RECIDIVATED_3_

In [15]:
# Create raw_paired_matrices info file
paired_matrices_raw = pd.DataFrame(np.column_stack([train_matrix_uuids, train_end_times, 
                                                    test_matrix_uuids, evaluation_start_times,
                                                    evaluation_end_times, model_configs, num_features]),
                                   columns=['train_matrix', 'train_end_time', 'test_matrix',
                                            'evaluation_start_time', 'evaluation_end_time',
                                            'model_config', 'num_features'])

paired_matrices_raw.to_csv('../paired_matrices_raw.csv', index=False, header=False)

In [16]:
# Write trainings and testings file
with open('../trainings.txt', 'w') as f:
    for item in train_matrix_uuids:
        f.write("%s\n" % item)
        
with open('../testings.txt', 'w') as f:
    for item in test_matrix_uuids:
        f.write("%s\n" % item)


In [17]:
paired_matrices_raw

Unnamed: 0,train_matrix,train_end_time,test_matrix,evaluation_start_time,evaluation_end_time,model_config,num_features
0,train_0_RECIDIVATED_3_YR,1977-01-01 00:00:00,test_0_RECIDIVATED_3_YR,1980-01-01 00:00:00,1981-01-01 00:00:00,,292
1,train_1_RECIDIVATED_3_YR,1978-01-01 00:00:00,test_1_RECIDIVATED_3_YR,1981-01-01 00:00:00,1982-01-01 00:00:00,,292
2,train_2_RECIDIVATED_3_YR,1979-01-01 00:00:00,test_2_RECIDIVATED_3_YR,1982-01-01 00:00:00,1983-01-01 00:00:00,,292
3,train_3_RECIDIVATED_3_YR,1980-01-01 00:00:00,test_3_RECIDIVATED_3_YR,1983-01-01 00:00:00,1984-01-01 00:00:00,,292
4,train_4_RECIDIVATED_3_YR,1981-01-01 00:00:00,test_4_RECIDIVATED_3_YR,1984-01-01 00:00:00,1985-01-01 00:00:00,,292
5,train_5_RECIDIVATED_3_YR,1982-01-01 00:00:00,test_5_RECIDIVATED_3_YR,1985-01-01 00:00:00,1986-01-01 00:00:00,,292
6,train_6_RECIDIVATED_3_YR,1983-01-01 00:00:00,test_6_RECIDIVATED_3_YR,1986-01-01 00:00:00,1987-01-01 00:00:00,,292
7,train_7_RECIDIVATED_3_YR,1984-01-01 00:00:00,test_7_RECIDIVATED_3_YR,1987-01-01 00:00:00,1988-01-01 00:00:00,,292
8,train_8_RECIDIVATED_3_YR,1985-01-01 00:00:00,test_8_RECIDIVATED_3_YR,1988-01-01 00:00:00,1989-01-01 00:00:00,,292
9,train_9_RECIDIVATED_3_YR,1986-01-01 00:00:00,test_9_RECIDIVATED_3_YR,1989-01-01 00:00:00,1990-01-01 00:00:00,,292
