# Introduction: Home Credit Default Risk


# Data

The data is provided by [Home Credit](http://www.homecredit.net/about-us.aspx), a service dedicated to provided lines of credit (loans) to the unbanked population. Predicting whether or not a client will repay a loan or have difficulty is a critical business need, and Home Credit is hosting this competition on Kaggle to see what sort of models the machine learning community can develop to help them in this task. 

There are 7 different sources of data:

* application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0: the loan was repaid or 1: the loan was not repaid. 
* bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
* bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length. 
* previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature `SK_ID_PREV`. 
* POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
* credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
* installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. 

This diagram shows how all of the data is related:

![image](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

Moreover, we are provided with the definitions of all the columns (in `HomeCredit_columns_description.csv`) and an example of the expected submission file. 

## Metric: ROC AUC

Once we have a grasp of the data (reading through the [column descriptions](https://www.kaggle.com/c/home-credit-default-risk/data) helps immensely), we need to understand the metric by which our submission is judged. In this case, it is a common classification metric known as the [Receiver Operating Characteristic Area Under the Curve (ROC AUC, also sometimes called AUROC)](https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it).

The ROC AUC may sound intimidating, but it is relatively straightforward once you can get your head around the two individual concepts. The [Reciever Operating Characteristic (ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) graphs the true positive rate versus the false positive rate:

![image](http://www.statisticshowto.com/wp-content/uploads/2016/08/ROC-curve.png)

A single line on the graph indicates the curve for a single model, and movement along a line indicates changing the threshold used for classifying a positive instance. The threshold starts at 0 in the upper right to and goes to 1 in the lower left. A curve that is to the left and above another curve indicates a better model. For example, the blue model is better than the red model, which is better than the black diagonal line which indicates a naive random guessing model. 

The [Area Under the Curve (AUC)](http://gim.unmc.edu/dxtests/roc3.htm) explains itself by its name! It is simply the area under the ROC curve. (This is the integral of the curve.) This metric is between 0 and 1 with a better model scoring higher. A model that simply guesses at random will have an ROC AUC of 0.5.

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or the [F1 score](https://en.wikipedia.org/wiki/F1_score) to more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the [ROC AUC is a better representation of model performance.](https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy)

## List of Check Point
- CP #1 : Preparation 
- CP #2 : Data Quick look 
- CP #3 : Miscellaneous Handling
- CP #4 : Selection & Subset
- CP #5 : Transformation
- CP #6: Joining table
- CP #7: Aggregation & Sorting
- CP #8: Visualization
- CP #9: Dummy Var & Split Data
- CP #10 : Modeling


# 1. Preparation

In [6]:
## Input Library Here

In [7]:
# Read Dataset

# 2. Data Quick Look

## Data Shape

## Concat Training and Testing Dataset

## Data View and Data Type

## Number of Target

## Statistics Desriptive

# 2.5 Exploratory Data Analysis

### Target Feature

### Number of Unique Category each Column

### Day Birth

### Days Employe

### Risk Rate from Anomalous

### Correlations

# 3. Miscellaneous Handling

## Check Missing Data

## Checking Duplicates Value

## Drop Missing Value COlumn

In [8]:
def missing_values_stat_in_columns(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values In Column', 1 : '% of Total Values In Column'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values In Column', ascending=False).round(1)
    print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
    return mis_val_table_ren_columns

def missing_value_manipulation(df, missing_percentage = 30.0): # TODO: Shoul be used for generating column list to drop
    missing_percentage_df = missing_values_stat_in_columns(df)
    cols_to_drop = set()
    for index, row in missing_percentage_df.iterrows():
        if row['% of Total Values In Column'] >= float(missing_percentage):
            cols_to_drop.add(index)
#     print(cols_to_drop)
    if 'EXT_SOURCE_1' in cols_to_drop:
        cols_to_drop.remove('EXT_SOURCE_1')
    print('There are '+str(len(cols_to_drop))+' columns that have more than 30% missing value.')
    return cols_to_drop

## Filling Missing Value

In [9]:
def filling_na(df):
    # Put your function here
    return df

# 4 Dummy Variable & Selecting Table

# 5 Transformation & Aggregation

# 6 JOINING TABLE

## Checking Missing Value

# 7 Training and Testing Split

## Splitting X and Y

## Dropping ID

# 8. Modeling

## Baseline

## Feature Importances by Random Forest

In [10]:
# function for creating a feature importance dataframe
def imp_df(column_names, importances):
    # Visualize importance feature in DataFrame
    return df

## Feature Selection

In [11]:
def modeling(x,y):
    #############################################
    
    # STRATIFIED K FOLD
    from sklearn.model_selection import StratifiedKFold
    #Linear Model
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    
    
    #Ensemble
    from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier
    
    #Cross Val Score
    from sklearn.model_selection import cross_val_score
    
    #XGBOOST
    from xgboost import XGBClassifier
    
    #################################################
    
    #linear Model
    logreg = LogisticRegression(random_state = 123)
    
    # Tree
    dt = DecisionTreeClassifier(random_state=123)
    #Ensemble
    randomforest = RandomForestClassifier()
    ada = AdaBoostClassifier()
    gdc = GradientBoostingClassifier()
    bagging = BaggingClassifier()
    xgb= XGBClassifier()
    
    
    classifiers = [logreg,dt,randomforest,bagging,ada,gdc, xgb]
    for clf, label in zip(classifiers, 
                      ['Logistic Regression',
                       'Decision Tree',
                       'Random Forest',
                       'Bagging',
                       'AdaBoost',
                       'GradBoost',
                      'XGBoost']):

        scores = cross_val_score(clf, x, y, cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=123), scoring='roc_auc')
        print("AUC of ROC : %0.4f (+/- %0.4f) [%s]" 
              % (scores.mean(), scores.std(), label))

## Evaluate Overall Model with Testing Dataset

# 8a Imbalance Dataset

# 9. Predicting DataTest

In [12]:
def make_submission(clf,name_file = 'submission.csv'):
    # input submission function here
    return df_valid_id

# 10. SUBMIT!!!