## Goal
The goal of this assignment is to build a simple, modular, extensible, machine learning pipeline in Python. The pipeline will have functions that can do the following tasks:

1. Read/Load Data
2. Explore Data
3. Pre-Process and Clean Data
4. Generate Features/Predictors
5. Build Machine Learning Classifier
6. Evaluate Classifier

_____


## Data 
The data set below is a modified version of data from https://www.kaggle.com/c/GiveMeSomeCredit
- Data set
- Data Dictionary 
    
___

## Problem

The task here is to predict **who will experience financial distress** in the next two years. The outcome variable (label) in the data is SeriousDlqin2yrs. We have access to other information about this person (as described in the data dictionary). Your assignment is to take this data and build a machine learning pipeline that trains *one* machine learning model on the data.
The primary goal is to build a skeleton code pipeline that has the components described above.

# 1. Read Data

In [1]:
def read_csv(file_path):
    """
    general description: A CSV read data function that loads a panda dataframe and gives general metrics of database description
    input: path file
    print-output: successfull legend and dataframe dimensions
    output: pandas dataframe 
    """
    import pandas as pd
    import ntpath
    df = pd.read_csv(file_path)
    
    head, tail = ntpath.split(file_path)
    print("**",tail,"**","has been loaded succesfully!")
    print("_____________________________")
    print("")
    print("# of Rows:", df.shape[0])
    print("# of Columns:", df.shape[1])
    print("_____________________________")
    
    return df

In [2]:
df = read_csv('/Users/schzcas/Documents/github/machine-learning-public-policy/assignments/credit-data.csv')

** credit-data.csv ** has been loaded succesfully!
_____________________________

# of Rows: 150000
# of Columns: 13
_____________________________


# 2. Explore Data

In [3]:
def explore_df(df):
    """
    general description: A function that loads a dataframe and gives general description of data variables
    input: dataframe
    output: 
            1. Column types
            2. First 2 rows of the dataframe
    """
    import pandas as pd
    print("Column types:")
    print("_____________________________")
    print (df.dtypes)
    print("_____________________________")
    print("")
    
    print("Summary statistics from columns:")
    print("_____________________________")
    
    return df.head(2)


In [4]:
explore_df(df)

Column types:
_____________________________
PersonID                                  int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
zipcode                                   int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object
_____________________________

Summary statistics from columns:
_____________________________


Unnamed: 0,PersonID,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,zipcode,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,60644,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,60637,0,0.121876,2600.0,4,0,0,0,1.0


## 3. Pre-Process Data

In [5]:
def fill_na(df):
    """
    Why NaN? You should be aware that NaN is a bit like a data virus which infects any other object it touches.
    We cannot drop single values from a DataFrame; we can only drop full rows or full columns. 
    By default, dropna() will drop all rows in which any null value is present.
    you can drop NA values along a different axis: axis=1 drops all columns containing a null value: df.dropna(axis=1)
    """
    import pandas as pd
    dataframe = df.fillna(0)
    return dataframe

In [7]:
data = fill_na(df)
data.head(7)

Unnamed: 0,PersonID,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,zipcode,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,1,0.766127,45,60644,2,0.802982,9120.0,13,0,6,0,2.0
1,2,0,0.957151,40,60637,0,0.121876,2600.0,4,0,0,0,1.0
2,3,0,0.65818,38,60601,1,0.085113,3042.0,2,1,0,0,0.0
3,4,0,0.23381,30,60601,0,0.03605,3300.0,5,0,0,0,0.0
4,5,0,0.907239,49,60625,1,0.024926,63588.0,7,0,1,0,0.0
5,6,0,0.213179,74,60629,0,0.375607,3500.0,3,0,1,0,1.0
6,7,0,0.305682,57,60637,0,5710.0,0.0,8,0,3,0,0.0


## 4. Generate Features/Predictors

In [None]:
def discretize():
    """
    Discretize a continuous column
    """
    

In [None]:
def to_dummy():
    """
    Take a categorical variable and create binary/dummy variables from it
    """
    