# Data Preparation
**Author**: Robert Smith  
**Date:** 06-18-2020

This notebook builds off of the first notebook and focuses on data preparation for machine learning. For the final deployed model, we'll use scikit-learn pipeline. Ideally, the pipeline would take in raw records/observations and output a prediction. However, there are two main challenges using a scikit-learn pipeline model deployed on the Google Cloud. These are:

* Converting the three numeric features `cp`, `restecg`, and `thal` into their respective categorical equivalent
* Converting the numeric target feature into a binary value. 

The first of these can be implemented using a custom `FunctionTransformer` in the pipeline. However, using a custom transformer requires additional leg work to deploy it on the Google Cloud. You can read more about this [here](https://cloud.google.com/ai-platform/prediction/docs/exporting-for-prediction#custom-pipeline-code).

The second challenge isn't readily solvable in the scikit-learn API and requires a pre-processing step before the scikit-learn pipeline. 

What we'll do to get to the modeling stage right now is do a little data pre-processing before feeding the observations into the scikit-learn pipeline. To do this, we'll use the data_transformer function from the previous script. Eventually, we will re-visit this challenge and write a couple custom python modules that we can package together along with a pickled model which will allow us to create a custom prediction routine. This deployment strategy is the most flexible approach, and is likely needed with real-world raw data. More information about custom prediction routines can be found [here](https://cloud.google.com/ai-platform/prediction/docs/custom-prediction-routines).

## Import Tools of the Trade

In [1]:
import numpy as np
import pandas as pd

## Load Data

We've already downloaded the data set and saved it in the data folder. If you would like to download directly, the data set can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data)

In [2]:
col_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
             "thalach", "exang", "oldpeak", "slope","ca", "thal", "target"]

df = pd.read_csv("../data/processed.cleveland.data", names = col_names, na_values = "?")

## Transform Data

In [3]:
def data_transformer(df):
    """
    Accepts the raw heart disease dataframe and returns it with cp, restecg, and thal_dict
    transformed into categorical features.
    """
    
    cp_dict = {1: "typical angina",
               2: "atypical angina",
               3: "non-anginal pain", 
               4: "asymptomatic"}

    restecg_dict = {0: "normal", 
                    1: "wave abnormality", 
                    2: "ventricular hypertrophy"}

    thal_dict = {3 : "normal",
                 6 : "fixed defect",
                 7 : "reversable defect"}
    
    df["cp"].replace(cp_dict, inplace = True)
    df["restecg"].replace(restecg_dict, inplace = True)
    df["thal"].replace(thal_dict, inplace = True)
    
    df["target"] = (df["target"] > 0).astype(int)
    
    return df

In [4]:
df_tidy = data_transformer(df)

## Output Tidy DataFrame

In [5]:
df_tidy.to_csv("../data/02_df_tidy.csv", index = False)