# Feature Engineering

- 1. [Introduction](#1.-Introduction)
- 2. [Imports](#2.-Imports)
- 3. [Loading the data](#3.-Loading-the-data)
- 4. [Partitioning Our Dataset](#4.-Partitioning-Our-Dataset)
- 5. [Defining Our Preprocessing Pipeline](#5.-Defining-Our-Preprocessing-Pipeline)
    - 5.1 Pipeline Definition
- 6. [Exporting Preprocessed Data](#6.-Exporting-Preprocessed-Data)

## 1. Introduction
This notebook is dedicated to developing features for our later models to ingest. It includes imputation, scaling, one-hot encoding, transformation, outlier handling, and data splitting. Methods for re-sampling the dataset to handle class imbalances will not occur in this phase, but rather the model evaluation phase.

## 2. Imports

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import os
import seaborn as sns

from category_encoders import OrdinalEncoder

from sklearn.decomposition import PCA
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline

from sklearndf.transformation import SimpleImputerDF, StandardScalerDF, FunctionTransformerDF, VarianceThresholdDF, OneHotEncoderDF

from src.data import load_dataset as ld
from src.features.icd9 import icd9_to_classification, is_diabetes_mellitus, icd9_to_category
from src.features.age import age_to_index

from src.features.transformer import PandasFeatureUnion, RowFilter, ColumnSelector, ColumnFilter, \
DiagnosisMapper, EncodeNaNCategoricalImputer, MostFrequentCategoricalImputer, CategoryCollapseThreshold, \
CategoricalHomogeneityThreshold, FeatureCombiner

sns.set()
pd.options.display.max_columns = 100

RANDOM_SEED = 0

## 3. Loading the data

In [3]:
df = ld.load_interim_pickle('00_diabetes.pkl')
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,days_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,number_diagnoses,max_glu_serum,A1Cresult,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,is_readmitted_early
encounter_id,patient_nbr,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1
12522,48330783,Caucasian,Female,[80-90),2,1,4,13,,Not Available,68,2,28,0,0,0,398.0,427,38,8,Not Available,Not Available,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,0
15738,63555939,Caucasian,Female,[90-100),3,3,4,12,,InternalMedicine,33,3,18,0,0,0,434.0,198,486,8,Not Available,Not Available,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,0
16680,42519267,Caucasian,Male,[40-50),1,1,7,1,,Not Available,51,0,8,0,0,0,197.0,157,250,5,Not Available,Not Available,No,No,No,No,No,No,Steady,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,Ch,Yes,0
28236,89869032,AfricanAmerican,Female,[40-50),1,1,7,9,,Not Available,47,2,17,0,0,0,250.7,403,996,9,Not Available,Not Available,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,No,Yes,0
35754,82637451,Caucasian,Male,[50-60),2,1,2,3,,Not Available,31,6,16,0,0,0,414.0,411,250,9,Not Available,Not Available,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Steady,No,No,No,No,No,No,Yes,0


## 4. Partitioning Our Dataset
We must first partition our training and test sets. The two sets should be stratified to ensure that we have an approximately equal distribution of positive and negative observations. We will be fitting our preprocessor on our training set and preprocessing both the training and test set.

In [4]:
X = df.drop(columns=['is_readmitted_early'])
y = df.is_readmitted_early

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((57214, 46), (14304, 46), (57214,), (14304,))

In [5]:
y_train.value_counts(normalize=True)

0    0.912015
1    0.087985
Name: is_readmitted_early, dtype: float64

In [6]:
y_test.value_counts(normalize=True)

0    0.911983
1    0.088017
Name: is_readmitted_early, dtype: float64

## 5. Defining Our Preprocessing Pipeline
We will be building two preprocessors: one that includes one-hot encoding and one that doesn't. Tree-based algorithms don't require that categorical variables be encoded numerically, so we will use the former to speed up training for those kinds of models.

In [7]:
# Columns that we found were homogeneous or had a high NaN values in EDA
columns_to_drop = [
    'payer_code',
    'examide',
    'citoglipton',
    'glimepiride-pioglitazone'
]

categorical_features = X.select_dtypes(exclude=[np.number]).drop(columns=columns_to_drop)
nominal_features = categorical_features.drop(columns=['age']).columns.tolist()
ordinal_features = ['age']
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
diagnosis_features = ['diag_1', 'diag_2', 'diag_3']
target_feature = ['is_readmitted_early']


# Features where we want null/nan values to be encoded as their own category
encode_nan_as_category_features = ['diag_1', 'diag_2', 'diag_3', 'gender', 'max_glu_serum', 'medical_specialty', 'A1Cresult']

# We created custom categories for the following columns to indicate unavailability
encode_nan_as_category_special_cases = {
    'admission_type_id': 5,
    'discharge_disposition_id': 25,
    'admission_source_id': 15,
    'race': 'Unknown/Invalid'
}

# For the remaining nominal features, we will just use the most frequent category in our training set
most_frequent_category_features = list(set(nominal_features) - set(encode_nan_as_category_features))

# When merging small categories into one category, we have label specifications for the following columns
merge_categories_special_cases = {
    'discharge_disposition_id': 30,
    'admission_type_id': 9,
    'admission_source_id': 27  
}

row_filters = {
    # Expired or Hospice-related discharges are not likely to be readmitted. Remove neonatal observations
    'discharge_disposition_id': lambda s: ~s.isin([11, 13, 14, 19, 20, 21, 10]),
    # We remove observations related to birth or infancy
    'admission_source_id': lambda s: ~s.isin([11, 12, 13, 14])
}
 
def compute_entropy(d1, d2, d3):
    diagnoses = [d1, d2, d3]
    num_diagnoses = len(diagnoses)
    prob_d1 = diagnoses.count(d1) / num_diagnoses
    prob_d2 = diagnoses.count(d2) / num_diagnoses
    prob_d3 = diagnoses.count(d3) / num_diagnoses
    return -(prob_d1 * np.log(prob_d1) + prob_d2 * np.log(prob_d2) + prob_d3 * np.log(prob_d3)) 

def diagnosis_diversity(r):
    d1 = icd9_to_category(r.diag_1)
    d2 = icd9_to_category(r.diag_2)
    d3 = icd9_to_category(r.diag_3)
    return compute_entropy(d1, d2, d3)

combined_numerical_features = [
    (('number_inpatient', 'number_outpatient', 'number_emergency'), np.sum, 'service_utilization')
]

combined_nominal_features = [
    (('diag_1', 'diag_2', 'diag_3'), diagnosis_diversity, 'diagnosis_diversity')
]


### 5.1 Pipeline Definition
1. **ColumnFilter**: filter out unwanted columns based on our prior exploratory data analysis. These columns are defined in the `columns_to_drop` variable
2. **RowFilter**: drop observations that have invalid values. These are defined in `row_filters`.
3. **Numerical Features Sub-pipeline:**
    - **ColumnFilter**: only select numerical features (e.g. `numerical_features`)
    - **SimpleImputer**: Impute missing values with the median of each column.
    - **Log1pTransformer**: Log-transform all columns for additivity and to promote normality.
    - **StandardScaler**: Centering the data (mean=0, std=1).
    - **FeatureCombiner**: Create compound features composed of various existing features. These are defined in `combined_numerical_features`.
    - **VarianceThreshold**: Remove homogeneous features that have a variance lower than a particular threshold. I chose 0.1 for this project.
4. **Ordinal Features Sub-pipeline:**
    - **ColumnFilter**: only select ordinal features (e.g. `ordinal_features`)
    - **SimpleImputer**: Impute missing values with the most frequently occurring feature value.
    - **OrdinalEncoder**: Map each category to an index.
    
5. **Nominal Features Sub-pipeline:**
    - **ColumnFilter**: only select nominal features (e.g. `nominal_features`)
    - **EncodeNaNCategoricalImputer**: For some features (`encode_nan_as_category_features`), we want to encode NaN as its own category because it contains some useful information.
    - **MostFrequentImputer**: Impute missing values with the most frequently occurring feature value. This custom imputer only does this for a subset of features (`most_frequent_category_features`).
    - **FeatureCombiner**:  Create compound features composed of various existing features. These are defined in `combined_nominal_features`. We mainly use this to create `diagnosis_diversity`, which measures the heterogeneity in the primary, secondary, and tertiary diagnosis reported.
    - **DiagnosisMapper**: Converts ICD9 codes into two tiers of classifications (t1, t2), differing by specificity. t2 is more general than t1.
    - **HomogeneityThreshold**: Remove features if they're too homogeneous, based on Shannon Entropy. I set this threshold to be below 0.05.
    - **CategoryCollapseThreshold**: Collapse feature values that occur less than X% of the time into a single value. I set this threshold to be below 5%.
    - **OneHotEncoder**: One-hot encode all features (optional)

In [8]:
def create_pipeline():
    nominal_transformers = [
        ('column_selector', ColumnSelector(nominal_features)),
        ('encode_nan_as_category_imputer', EncodeNaNCategoricalImputer(encode_nan_as_category_features, special_cases=encode_nan_as_category_special_cases)),
        ('most_frequent_imputer', MostFrequentCategoricalImputer(most_frequent_category_features)),
        ('feature_combiner', FeatureCombiner(combined_nominal_features)),
        ('diagnosis_mapper', DiagnosisMapper(diagnosis_features)),
        ('homogeneity_threshold', CategoricalHomogeneityThreshold(threshold=0.05, verbose=True)),
        ('category_collapse_threshold', CategoryCollapseThreshold(threshold=0.05, special_cases=merge_categories_special_cases, verbose=True))
    ]
    
    return Pipeline([
        ('column_filter', ColumnFilter(columns_to_drop)),
        ('row_filter', RowFilter(row_filters)),
        ('features', PandasFeatureUnion([
            ('numerical', Pipeline([
                ('column_selector', ColumnSelector(numerical_features)),
                ('simple_imputer', SimpleImputerDF(strategy='median')),
                ('log1p_transformer', FunctionTransformerDF(np.log1p)),
                ('standard_scaler', StandardScalerDF()),
                ('feature_combiner', FeatureCombiner(combined_numerical_features)),
                ('variance_threshold', VarianceThresholdDF(0.1))
            ])),
            ('ordinal', Pipeline([
                ('column_selector', ColumnSelector(ordinal_features)),
                ('simple_imputer', SimpleImputerDF(strategy='most_frequent')),
                ('ordinal_encoder', OrdinalEncoder(return_df=True))
            ])),
            ('nominal', Pipeline(nominal_transformers))
        ]))
    ])


preprocessor = create_pipeline()

In [9]:
_ = preprocessor.fit(X_train)

[CategoricalHomogeneityThreshold] Column: nateglinide, Entropy: 0.042513186153034996. Dropping column
[CategoricalHomogeneityThreshold] Column: chlorpropamide, Entropy: 0.008258907529230738. Dropping column
[CategoricalHomogeneityThreshold] Column: acetohexamide, Entropy: -0.0. Dropping column
[CategoricalHomogeneityThreshold] Column: tolbutamide, Entropy: 0.0023252227478738605. Dropping column
[CategoricalHomogeneityThreshold] Column: acarbose, Entropy: 0.020277208580227035. Dropping column
[CategoricalHomogeneityThreshold] Column: miglitol, Entropy: 0.0026028569966102732. Dropping column
[CategoricalHomogeneityThreshold] Column: troglitazone, Entropy: 0.00021325502467218777. Dropping column
[CategoricalHomogeneityThreshold] Column: tolazamide, Entropy: 0.0036159107434984078. Dropping column
[CategoricalHomogeneityThreshold] Column: glyburide-metformin, Entropy: 0.043473537232382196. Dropping column
[CategoricalHomogeneityThreshold] Column: glipizide-metformin, Entropy: 0.001087389871

In [10]:
X_train_enc = preprocessor.transform(X_train)
X_train_one_hot_enc = pd.get_dummies(X_train_enc, drop_first=True)

In [11]:
X_test_enc = preprocessor.transform(X_test)
X_test_enc

Unnamed: 0_level_0,Unnamed: 1_level_0,days_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,service_utilization,age,race,gender,admission_type_id,discharge_disposition_id,admission_source_id,medical_specialty,max_glu_serum,A1Cresult,metformin,repaglinide,glimepiride,glipizide,glyburide,pioglitazone,rosiglitazone,insulin,change,diabetesMed,diagnosis_diversity,diag_1_t1,diag_1_t2,diag_2_t1,diag_2_t2,diag_3_t1,diag_3_t2
encounter_id,patient_nbr,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
103798830,107100180,-1.550549,-0.190032,-0.984288,-0.779941,-0.352189,-0.263139,-0.340827,-0.427040,-0.956154,3,Caucasian,Female,2,1,1,InternalMedicine,Not Available,Not Available,Steady,No,No,No,No,No,No,Up,Ch,Yes,0.906824,Circulatory,Circulatory,Other,Other,Other,Other
79821444,442683,0.166681,0.385846,1.708494,0.273885,-0.352189,-0.263139,-0.340827,0.802368,-0.956154,1,Caucasian,Male,2,1,1,InternalMedicine,Not Available,Not Available,No,No,No,No,No,No,No,Steady,No,Yes,0.906824,Circulatory,Circulatory,Endocrine,Diabetes,Circulatory,Circulatory
293632478,93097674,-1.550549,-0.001945,0.057423,-2.156531,-0.352189,-0.263139,2.000504,0.802368,1.385177,1,Caucasian,Female,3,1,1,Not Available,Not Available,Not Available,No,No,No,No,No,No,No,No,No,No,1.098612,Musculoskeletal,Musculoskeletal,Other,Other,Endocrine,Diabetes
37967076,15228135,0.508371,0.661379,0.057423,1.334572,-0.352189,-0.263139,-0.340827,0.439205,-0.956154,5,AfricanAmerican,Female,3,1,1,Other,Not Available,Other,Steady,No,No,No,No,No,No,Steady,Ch,Yes,-0.000000,Other,Other,Other,Other,Other,Other
77890116,88434351,-0.251515,0.681833,0.057423,-2.599693,-0.352189,2.759990,3.370095,-1.586812,5.777895,3,Caucasian,Male,1,1,7,Not Available,Not Available,Not Available,Steady,No,No,No,No,No,No,No,No,Yes,-0.000000,Other,Other,Other,Other,Other,Other
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132221820,37045449,1.268258,-3.637553,-0.984288,0.153485,-0.352189,-0.263139,-0.340827,0.802368,-0.956154,4,Caucasian,Male,3,3,27,Not Available,Not Available,Not Available,Steady,No,No,No,No,No,No,No,No,Yes,0.906824,Circulatory,Circulatory,Other,Other,Circulatory,Circulatory
2736744,42470721,0.166681,0.460591,-0.984288,0.494779,-0.352189,-0.263139,-0.340827,0.033223,-0.956154,4,Caucasian,Female,1,6,7,InternalMedicine,Not Available,Not Available,No,No,No,No,No,No,No,Up,Ch,Yes,1.098612,Respiratory,Respiratory,Circulatory,Circulatory,Other,Other
35358396,91801431,1.957415,0.834484,0.057423,1.591187,-0.352189,-0.263139,-0.340827,0.802368,-0.956154,4,Caucasian,Male,2,3,7,Not Available,Not Available,Not Available,No,No,No,No,No,No,No,Steady,No,Yes,0.906824,Injury/Poison,Injury,Circulatory,Circulatory,Circulatory,Circulatory
276079278,58301361,-0.790663,0.721767,-0.984288,-0.590656,-0.352189,-0.263139,-0.340827,-1.586812,-0.956154,5,AfricanAmerican,Male,1,1,7,Not Available,Not Available,Norm,Steady,No,No,No,No,No,No,Down,Ch,Yes,0.906824,Endocrine,Diabetes,Respiratory,Respiratory,Respiratory,Respiratory


In [12]:
X_test_one_hot_enc = pd.get_dummies(X_test_enc, drop_first=True)
X_test_one_hot_enc

Unnamed: 0_level_0,Unnamed: 1_level_0,days_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,service_utilization,age,race_Caucasian,race_Other,gender_Male,gender_Other,admission_type_id_2,admission_type_id_3,admission_type_id_5,admission_type_id_9,discharge_disposition_id_3,discharge_disposition_id_6,discharge_disposition_id_30,admission_source_id_7,admission_source_id_15,admission_source_id_27,medical_specialty_Emergency/Trauma,medical_specialty_Family/GeneralPractice,medical_specialty_InternalMedicine,medical_specialty_Not Available,medical_specialty_Other,max_glu_serum_Other,A1Cresult_Norm,A1Cresult_Not Available,A1Cresult_Other,metformin_Other,metformin_Steady,repaglinide_Other,glimepiride_Other,glipizide_Other,glipizide_Steady,glyburide_Other,glyburide_Steady,pioglitazone_Other,pioglitazone_Steady,rosiglitazone_Other,rosiglitazone_Steady,insulin_No,insulin_Up,insulin_Down,change_No,diabetesMed_Yes,diagnosis_diversity_0.9068242403669224,diagnosis_diversity_1.0986122886681096,diag_1_t1_Digestive,diag_1_t1_Endocrine,diag_1_t1_Injury/Poison,diag_1_t1_Musculoskeletal,diag_1_t1_Other,diag_1_t1_Respiratory,diag_1_t2_Diabetes,diag_1_t2_Digestive,diag_1_t2_Injury,diag_1_t2_Musculoskeletal,diag_1_t2_Other,diag_1_t2_Respiratory,diag_2_t1_Endocrine,diag_2_t1_Genitourinary,diag_2_t1_Other,diag_2_t1_Respiratory,diag_2_t2_Diabetes,diag_2_t2_Endocrine,diag_2_t2_Genitourinary,diag_2_t2_Other,diag_2_t2_Respiratory,diag_3_t1_Endocrine,diag_3_t1_Genitourinary,diag_3_t1_Other,diag_3_t1_Respiratory,diag_3_t2_Diabetes,diag_3_t2_Endocrine,diag_3_t2_Genitourinary,diag_3_t2_Other,diag_3_t2_Respiratory
encounter_id,patient_nbr,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1
103798830,107100180,-1.550549,-0.190032,-0.984288,-0.779941,-0.352189,-0.263139,-0.340827,-0.427040,-0.956154,3,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0
79821444,442683,0.166681,0.385846,1.708494,0.273885,-0.352189,-0.263139,-0.340827,0.802368,-0.956154,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
293632478,93097674,-1.550549,-0.001945,0.057423,-2.156531,-0.352189,-0.263139,2.000504,0.802368,1.385177,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0
37967076,15228135,0.508371,0.661379,0.057423,1.334572,-0.352189,-0.263139,-0.340827,0.439205,-0.956154,5,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0
77890116,88434351,-0.251515,0.681833,0.057423,-2.599693,-0.352189,2.759990,3.370095,-1.586812,5.777895,3,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132221820,37045449,1.268258,-3.637553,-0.984288,0.153485,-0.352189,-0.263139,-0.340827,0.802368,-0.956154,4,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2736744,42470721,0.166681,0.460591,-0.984288,0.494779,-0.352189,-0.263139,-0.340827,0.033223,-0.956154,4,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
35358396,91801431,1.957415,0.834484,0.057423,1.591187,-0.352189,-0.263139,-0.340827,0.802368,-0.956154,4,1,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
276079278,58301361,-0.790663,0.721767,-0.984288,-0.590656,-0.352189,-0.263139,-0.340827,-1.586812,-0.956154,5,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1


## 6. Exporting Preprocessed Data

In [13]:
X_train_enc.to_pickle(ld.find_preprocessed_path('X_train.pkl'))
X_test_enc.to_pickle(ld.find_preprocessed_path('X_test.pkl'))
X_train_one_hot_enc.to_pickle(ld.find_preprocessed_path('X_train_one_hot.pkl'))
X_test_one_hot_enc.to_pickle(ld.find_preprocessed_path('X_test_one_hot.pkl'))