> # From Data to Features & Classification

In this notebook, we will have a basic travel from data feature engineering to classification prediction for the Home Credit Default Risk competition. Machine learning automatically learns from data, and often performs better given more (useful/relevant) features. However, we still need to be careful with correlations between the features and the target. In this competition, several dataframes are available:
![image.png](https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png)

Note that this is my first kaggle notebook, and I do refer hugely the work of [Will Koehrsen](https://www.kaggle.com/willkoehrsen) [here] (https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction). So this notebook is meant to get familiar with the competition data on top of his excellent work as a tutorial. 

In addition to the `application` data, we also have more datasets such as the `bureau` and `bureau_balance` data:

* **bureau**: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau and is identified by the SK_ID_BUREAU, Each loan in the application data can have multiple previous credits.
* **bureau_balance**: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
* **previous_application**: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
* **POS_CASH_BALANCE**: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
* **credit_card_balance**: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
* **installments_payment**: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

If you don't have domain knowledge of loans and credits, no worries because me neither. In this notebook, we'll have a basic travel from data to features and classification. We'll also go through some basic concepts in regards during the travel. Note that this notebook also refers to several other notebooks, so we don't have to reinvent the wheel.

Now let's begin.

In [None]:
# File system manangement
import os

# pandas and numpy for data manipulation
import pandas as pd
import numpy as np

# matplotlib and seaborn for visualizing
import matplotlib.pyplot as plt
import seaborn as sns

># 1. Look at the data
Let's have a look at where we are and what we have:

In [None]:
print('where we are?')
!pwd

print('\nwhat we have/access to?')
!ls ../ -l

print('\nwhat data we can use?')
! ls ../input -l

Uncomment the following cell if you want to know the data items

In [None]:
# # Items of the data
# HC_desc = pd.read_csv('../input/HomeCredit_columns_description.csv', encoding='latin')

# # columns and rows
# print(HC_desc.count())

# app_train.head()

What the training data look like:

In [None]:
# Training data
app_train = pd.read_csv('../input/application_train.csv')
print('Training data shape: ', app_train.shape, \
      '\nTarget/Label:\n', app_train['TARGET'].value_counts())
app_train.head()

We can see that the `TARGET` include two classes {0,1}, which are the labels for training and machine learning model and to predict with. Following is the histogram of the clients who paied/defaulted their loans:

In [None]:
app_train['TARGET'].astype(int).plot.hist(bins=app_train['TARGET'].value_counts().shape[0]+1, alpha=0.75 )

In [None]:
# Testing data features
app_test = pd.read_csv('../input/application_test.csv')
print('Testing data shape: ', app_test.shape, \
      '\nTesting data example: \n')
app_test.head(5) 

># 2. Clean & prepare data
Before starting with machine learning that's really cool, we have to clean it up. Let's have a look at the data's `cleaness`, then prepare it ready for feature engineering and then machine learning ...

> ### 2.1 Check missing values

In [None]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Total values
        tot_val = df.count() # len(df)-mis_val
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, tot_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : 'Valid Values', 2 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,2] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
    
missing_values = missing_values_table(app_train)
missing_values.head(20)

> ### 2.2 Categorial variable encoding
Briefly, we need to make the data "computable" by converting from category to numbers. 
Let's see the types of entries 

In [None]:
app_train.dtypes.value_counts()

`float` and `int64` are entries usable already wrt machine learning. Let's now look at the object entries of the columns, so that we can encode these categoriacal variables. In the following result, we can see that most of the variables have a small number of unique entries. By encoding, we basically use numbers to represent the variables so to be fed to machine learning algorithms:

* label encoding for {2-category object}    - eg.{1,2,3,...}
* one-hot encoding for {>2-category object} - eg.
  - 1 0 0 0
  - 0 1 0 0
  - 0 0 1 0
  - 0 0 0 1

In [None]:
app_train.select_dtypes('object').apply(pd.Series.nunique, axis=0)

In [None]:
# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [None]:
# new label encoder object
le = LabelEncoder()
le_count = 0

# Iterate by object columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # 2-category object
        if len(list(app_train[col].unique())) <= 2:
            # train on the column
            le.fit(app_train[col])
            
            # transform both training & test data
            # - note that this applies to the case where training data has equal or more categorial entries than the test data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # count the number of 2-category objects
            le_count += 1
            
            
print('%d columns were label-encoded.' % le_count)

# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
            

In [None]:
# initialize list of lists 
data_list = [['tom', 't'], ['peter', 'p'], ['paul', 'p']]
  
# Create the pandas DataFrame 
dataFrm = pd.DataFrame(data_list, columns = ['Name', 'alias']) 

dataFrm_ohe = pd.get_dummies(dataFrm)

#

print('This is how one-hot encoding works:')
print('Originally: \n', dataFrm.head())

print('\nOne-Hot Encoded: \n', dataFrm_ohe.head())

Align the dataframes, to make sure the columns in the training data are also in the testing data. We also keep the `target` column to be used as training labels.

In [None]:
train_labels = app_train['TARGET']

# align
app_train, app_test = app_train.align(app_test, join='inner', axis = 1)

# put back the TARGET column - creating a new column
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

> ### 2.3 Check anomaly 
It's almost impossible to record/document/collect perfect data. Anomalies/noise/dirts/,you name it, therefore, are quite common. Checking anomalies is highly necessary.

#### - Age (by years)

# Ages of a person
print( (app_train['DAYS_BIRTH']/-365).describe() )
print('\nAdults, between 20 and 69 - looks reasonable')

(app_train['DAYS_BIRTH']/-365).hist(bins=20)

#### - Years employed
1000 years of employment is anomaly. Histogram shows that 1000 years (`365243`) is very likely the anomaly:

In [None]:
# Years employed

print(  (app_train['DAYS_EMPLOYED']).describe() , '\n')

print(  (app_train['DAYS_EMPLOYED']/365).describe()  )

print('\nAnomalous!')

app_train['DAYS_EMPLOYED'].hist(bins=55)

In [None]:
# Create an anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train["DAYS_EMPLOYED"] == 365243

# Replace the anomalous values with nan
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

(app_train['DAYS_EMPLOYED']/-365).hist(bins=20)
plt.title('DAYS_EMPLOYED Histogram')
plt.xlabel('Years Employed')
plt.ylabel('Years Employed')

Do the same change to the test data as well

In [None]:
app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)

print('There are %d anomalies in the test data out of %d entries' % (app_test["DAYS_EMPLOYED_ANOM"].sum(), len(app_test)))

>### 2.4 Correlation : feature - class ï¼ˆTARGET)
The correlation coefficient does not always represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Its value ranges from -1 to 1. Greater absolute value of the correlation coefficient, the higher "relevance" there is between two variables. In the following, we'll use several examples to show how we check and select features/variables from all with regards to correlation.

In [None]:
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

In [None]:
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

Negative corelation coefficient (CC) of, for the example of age (used negative values of DAYS_BIRTH so its CC is actually -0.78), means that the clients are less likely to default their loan as they get older. Let's visualize the effect of the age on the target with kernel density estimation (a smoothed histogram).

In [None]:
plt.style.use('fivethirtyeight')

plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(  abs(app_train.loc[app_train['TARGET'] == 0, 'DAYS_BIRTH'] / 365), label = 'target == 0 (repaid)', linewidth=3)

# KDE plot of loans which were not repaid on time
sns.kdeplot(  abs(app_train.loc[app_train['TARGET'] == 1, 'DAYS_BIRTH'] / 365), label = 'target == 1 (defaulted)', linewidth=3)

# Labeling of plot
plt.xlabel('Age (years)', fontsize=14); plt.ylabel('Density', fontsize=14); plt.title('Distribution of Ages', fontsize=14)

In the group that client defaulted, there's a skewness towards the younger end by age, while it is fairly flat among different ages of the group who repaid their loans. To further illustrate how different age groups behave, we separate them into groups by 5 years from the youngest to the oldest and calculate their failure rate of repaying loans.

In [None]:
import warnings
warnings.filterwarnings('ignore')


# Age information into a separate dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / -365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(20)

Let's see how each age group behaved by calculating their mean default rate

In [None]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
print(age_groups)

plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

There is a clear trend: younger applicants are more likely to not repay the loan! The rate of failure to repay is above 10% for the youngest three age groups and beolow 5% for the oldest age group.

This is information that could be directly used by the bank: because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This doesn't mean that the bank should discriminize against younger clients by over-generalizing the above observation. In practice, there are many other factors and aspects that should be taken into account. Therefore, we should use more information/features than single ones. Machine learning is good at doing this organically.

Let's also have a look at Exterior Sources (`EXT_SOURCE_1`, `EXT_SOURCE_2`, and `EXT_SOURCE_3`), since they have the 'largest' correlation relevance (negative)

> [(see this notebook)](https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering) According to the documentation, these features represent a "normalized score from external data source". I'm not sure what this exactly means, but it may be a cumulative sort of credit rating made using numerous sources of data

Like previous example of age, let's start with correlation


In [None]:
# Extract the EXT_SOURCE variables and show correlations
ext_data = app_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
print(ext_data_corrs)

plt.figure(figsize=(8,6))

# Heatmap of correlations
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap')

Take a look at the first column of the heatmap, the color represents the correlation heat between EXT_SOURCE features and their loan behavior TARGET. All three features have negative correlation with the TARGET, suggesting that as their value increases, the client is more likely to repay the loan (1-default, 0-repaid). Let's furture visualize this information with KDE (smoothed histogram):

In [None]:
plt.figure(figsize = (10, 12))

# iterate through the sources
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
    
    # create a new subplot for each source
    plt.subplot(3, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 0, source], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train.loc[app_train['TARGET'] == 1, source], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

From these three EXT_SOURCE_3 features, similar observations can be obtained compare with age feature. EXT_SOURCE_3 displays the greatest difference between its values and the target. 

So far so good, or too much information? To be concise, we could use pairs plot to visualize all the above results:

In [None]:
# Copy the data for plotting
plot_data = ext_data.drop(columns = ['DAYS_BIRTH']).copy()

# Add in the age of the client in years
plot_data['YEARS_BIRTH'] = age_data['YEARS_BIRTH']

# Drop na values and limit to first 10000 rows
plot_data = plot_data.dropna().loc[:10000, :]

# Function to calculate correlation coefficient between two columns
def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1]
    ax = plt.gca()
    ax.annotate("r = {:.2f}".format(r),
                xy=(.2, .8), xycoords=ax.transAxes,
                size = 20)

# Create the pairgrid object
grid = sns.PairGrid(data = plot_data, size = 3, diag_sharey=False,
                    hue = 'TARGET', 
                    vars = [x for x in list(plot_data.columns) if x != 'TARGET'])

# Upper is a scatter plot
grid.map_upper(plt.scatter, alpha = 0.2)

# Diagonal is a histogram
grid.map_diag(sns.kdeplot)

# Bottom is density plot
grid.map_lower(sns.kdeplot, cmap = plt.cm.OrRd_r);

plt.suptitle('Ext Source and Age Features Pairs Plot', size = 32, y = 1.05);

One more thing to talk about
> Should we always use as more features as we can/have? Well, not really. A good decision/judgement/plan is based on balanced consideration of multiple factors/aspects. Imagine that several features are representing the same factor/aspect, then these features may often play a dominant role in leading to a decision/jugement/plan. 

So, how could we avoid this happening? 
> An easy way, usually useful but not always, is to also have a look at the correlations among your selected features.

For example, in the examples above, we don't see very strong positive/negative between the EXT_SOURCE features and age. It is very probably safe to use all of them in our machine learning model.

An advanced method is feature engineering. `Kaggle competitions are won by feature engineering: those win are those who can create the most useful features out of the data`, which is actually the case of many other competitions such as FlyAI, Tianchi.Aliyun.

> ### 2.5 Feature Engineering
Given the suitable model and optimized parameters, we still need to feed the model with data of good quality. This is where feature engineering plays a significant role. It takes a thick book to cover generally, not all, feature engineering. But don't panic, one major reason is due to the diversity of data and its application scenarios. Basically, feature engineering is a bunch of techniques that collect features out of the original data, which is often followed by feature selection (choosing from most useful {often subjective}, or dimension reducing). 

> Currently, many mature routines of features engineering are already available along with algorithms, methods and tricks.There are also some toolboxes for semi-/autonomous feature engineering out there. Some of them are open sourced for usage and development on top of, some change reasonable fees. It's flexible to choose a proper one depending on the case. 

> In this case, we use two simple but popular method of feature construction:
  - Polynomial features
  - Domain knowledge features
 
#### 2.5.1 Polynomial Features
Literally by its name, polynomial features are polynomial combinations of existing features - polynomial calculation -[\[involves only the operations of addition, subtraction, multiplication, and non-negative integer exponentiation of variables\]](https://en.wikipedia.org/wiki/Polynomial). As the polynomial features capture the interaction between variables, they are called interaction terms. One useful observation - while two variables by themselves may not have a strong influence on the target, combining them together into a single interaction variable might show a relationship with the target.

Good news is that Scikit-Learn library has a useful class called `PolynomialFeatures`, so we don't have to manually code our own function for this purpose. Try not to over use polynomial features because it may lead to overfitting (less generalizable capability, namely, less capable in predicting). 


In [None]:
# Make a new dataframe for polynomial features
poly_features = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test = app_test[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

# imputer for handling missing values
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = 'median')

poly_target = poly_features['TARGET']

poly_features = poly_features.drop(columns = ['TARGET'])

# Need to impute missing values
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)

from sklearn.preprocessing import PolynomialFeatures
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)

In [None]:
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)

In [None]:
poly_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'])[:27]

There are 35 features with individual features raised to powers up to degree 3 and interaction terms. Now, we can see whether any of these new features are correlated with the target.

In [None]:
# Create a dataframe of the features 
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                           'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Add in the target
poly_features['TARGET'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['TARGET'].sort_values()

# Display most negative and most positive
print(poly_corrs.head(10), '\n')
print(poly_corrs.tail(10))

As seen in this result above, the combined features have higher correlation to the TARGET than original individual features. This does not directly lead to the final decision of which features to be used in machine learning, but provides clues instead. When we build machine learning models, we can try with and without these features to determine if they actually help the model learn. Often in machine learning, the only way to choose a better approach is to try it out.

Now let's see what polynomial features we have for machine learning:

In [None]:
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

#### 2.5.2 Domain Knowledge Features
Now let's go to the domain knowledge features. It really depends on the person who performs this work regarding their "domain knowledge" of finance. In this case, let's select a couple of features to slide to machine learning after the long journey. Five features are suggested by [this script](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction/) and [this one](https://www.kaggle.com/jsaguiar/lightgbm-with-simple-features):
>
> - `CREDIT_INCOME_PERCENT`: the percentage of the credit amount relative to a client's income
> - `ANNUITY_INCOME_PERCENT`: the percentage of the loan annuity relative to a client's income
> - `CREDIT_TERM`: the length of the payment in months (since the annuity is the monthly amount due
> - `DAYS_EMPLOYED_PERCENT`: the percentage of the days employed relative to the client's age



In [None]:
app_train_domain = app_train.copy()
app_test_domain = app_test.copy()

app_train_domain['CREDIT_INCOME_PERCENT'] = app_train_domain['AMT_CREDIT'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['ANNUITY_INCOME_PERCENT'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_INCOME_TOTAL']
app_train_domain['CREDIT_TERM'] = app_train_domain['AMT_ANNUITY'] / app_train_domain['AMT_CREDIT']
app_train_domain['DAYS_EMPLOYED_PERCENT'] = app_train_domain['DAYS_EMPLOYED'] / app_train_domain['DAYS_BIRTH']


app_test_domain['CREDIT_INCOME_PERCENT'] = app_test_domain['AMT_CREDIT'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['ANNUITY_INCOME_PERCENT'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_INCOME_TOTAL']
app_test_domain['CREDIT_TERM'] = app_test_domain['AMT_ANNUITY'] / app_test_domain['AMT_CREDIT']
app_test_domain['DAYS_EMPLOYED_PERCENT'] = app_test_domain['DAYS_EMPLOYED'] / app_test_domain['DAYS_BIRTH']

**Visualize New Variables**
We should explore these domain knowledge variables visually in a graph. For all of these, we will make the same KDE plot colored by the value of the TARGET

In [None]:
plt.figure(figsize = (12, 20))
# iterate through the new features
for i, feature in enumerate(['CREDIT_INCOME_PERCENT', 'ANNUITY_INCOME_PERCENT', 'CREDIT_TERM', 'DAYS_EMPLOYED_PERCENT']):
    
    # create a new subplot for each source
    plt.subplot(4, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 0, feature], label = 'target == 0')
    # plot loans that were not repaid
    sns.kdeplot(app_train_domain.loc[app_train_domain['TARGET'] == 1, feature], label = 'target == 1')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % feature)
    plt.xlabel('%s' % feature); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

> ## 3. Machine learning - Classification
Finally! Let's do the machine learning work ;). 

> ### 3.1 Machine learning using original features
In this section, we are going to use all the previous mentioned features, original and by feature engineering, to train a machine learning model. Several machine learning models will be tried, as said previously, to see which is the right one. 

#### * Logistic Regression

In [None]:
from sklearn.preprocessing import MinMaxScaler, Imputer

# Drop the target from the training data
if 'TARGET' in app_train:
    train = app_train.drop(columns = ['TARGET'])
else:
    train = app_train.copy()
    
# Feature names
features = list(train.columns)

# Copy of the testing data
test = app_test.copy()

# Median imputation of missing values
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
imputer.fit(train)

# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(app_test)

# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

Train the model

In [None]:
from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001)

# Train on the training data
log_reg.fit(train, train_labels)

In [None]:
log_reg_pred = log_reg.predict_proba(test)[:, 1]

# Submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

# Save the submission to a csv file
submit.to_csv('log_reg_baseline.csv', index = False)

submit.head(10)

The logistic regression baseline should score around 0.671 when submitted

#### * Random Forest
This is a more popular machine learning model for regression/classification problems in the recent two decades. The algorithm has been trying to be parameter friendly to users by simply setting the number of trees (that accumulate to form a 'forest'). Here we use 100 trees as the parameter.


In [None]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest.fit(train, train_labels)

# Extract feature importances
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Make predictions on the test data
predictions = random_forest.predict_proba(test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline.csv', index = False)

These predictions will also be available when we run the entire notebook.

This model should score around 0.678 when submitted.

#### * Support Vector Machine
Like random forest, SVM is another popular machine learning model, invented a bit earlier in the 1990s. Both RF and SVM used to be the state-of-the-art models in many tasks before deep learning taking many of the fields. Still, RF, SVM and similar traditional models are of the best choices in specific tasks. Important parameters are the penalty coefficient C and standard deviation gamma (for kernel functions). [**](http://)

In [None]:
## I commented this cell, guess why? it is slow if you directly train the data

# from sklearn.svm import SVC
# svm = SVC(C=1.0)
# svm.fit(train, train_labels)

In [None]:
# svm_pred = svm.predict_proba(test)[:, 1]

# # Submission dataframe
# submit = app_test[['SK_ID_CURR']]
# submit['TARGET'] = log_reg_pred

# # Save the submission to a csv file
# submit.to_csv('svm_baseline.csv', index = False)

# submit.head(10)

### 3.2 Machie learning using engineered features 

#### * Random Forest

In [None]:
poly_features_names = list(app_train_poly.columns)

# Impute the polynomial features
imputer = Imputer(strategy = 'median')

poly_features = imputer.fit_transform(app_train_poly)
poly_features_test = imputer.transform(app_test_poly)

# Scale the polymnomial features
scaler = MinMaxScaler(feature_range = (0, 1))

poly_features = scaler.fit_transform(poly_features)
poly_features_test = scaler.transform(poly_features_test)

random_forest_poly = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

In [None]:
# Train on the training data
random_forest_poly.fit(poly_features, train_labels)

# Make predictions on the test data
predictions = random_forest_poly.predict_proba(poly_features_test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline_engineered.csv', index = False)

This model scored 0.678 when submitted to the competition, exactly the same as that without the engineered features. Given these results, it does not appear that our feature construction helped in this case.

### 3.3 Machine learning with Domain Features

#### * Random Forest

In [None]:
app_train_domain = app_train_domain.drop(columns = 'TARGET')

domain_features_names = list(app_train_domain.columns)

# Impute the domainnomial features
imputer = Imputer(strategy = 'median')

domain_features = imputer.fit_transform(app_train_domain)
domain_features_test = imputer.transform(app_test_domain)

# Scale the domainnomial features
scaler = MinMaxScaler(feature_range = (0, 1))

domain_features = scaler.fit_transform(domain_features)
domain_features_test = scaler.transform(domain_features_test)

random_forest_domain = RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1)

# Train on the training data
random_forest_domain.fit(domain_features, train_labels)

# Extract feature importances
feature_importance_values_domain = random_forest_domain.feature_importances_
feature_importances_domain = pd.DataFrame({'feature': domain_features_names, 'importance': feature_importance_values_domain})

# Make predictions on the test data
predictions = random_forest_domain.predict_proba(domain_features_test)[:, 1]

In [None]:
# Make a submission dataframe
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = predictions

# Save the submission dataframe
submit.to_csv('random_forest_baseline_domain.csv', index = False)

This scores 0.679 when submitted which probably shows that the engineered features do not help in this model (however they do help in the Gradient Boosting Model at the end of the notebook).

> ## 4 Model Interpretation: Feature Importances
As a simple method to see which variables are the most relevant, we can look at the feature importances of the random forest. Given the correlations we saw in the exploratory data analysis, we should expect that the most important features are the EXT_SOURCE and the DAYS_BIRTH. We may use these feature importances as a method of dimensionality reduction in future work.

In [None]:
def plot_feature_importances(df):
    """
    Plot importances returned by a model. This can work with any measure of
    feature importance provided that higher importance is better. 
    
    Args:
        df (dataframe): feature importances. Must have the features in a column
        called `features` and the importances in a column called `importance
        
    Returns:
        shows a plot of the 15 most importance features
        
        df (dataframe): feature importances sorted by importance (highest to lowest) 
        with a column for normalized importance
        """
    
    # Sort features according to importance
    df = df.sort_values('importance', ascending = False).reset_index()
    
    # Normalize the feature importances to add up to one
    df['importance_normalized'] = df['importance'] / df['importance'].sum()

    # Make a horizontal bar chart of feature importances
    plt.figure(figsize = (10, 6))
    ax = plt.subplot()
    
    # Need to reverse the index to plot most important on top
    ax.barh(list(reversed(list(df.index[:15]))), 
            df['importance_normalized'].head(15), 
            align = 'center', edgecolor = 'k')
    
    # Set the yticks and labels
    ax.set_yticks(list(reversed(list(df.index[:15]))))
    ax.set_yticklabels(df['feature'].head(15))
    
    # Plot labeling
    plt.xlabel('Normalized Importance'); plt.title('Feature Importances')
    plt.show()
    
    return df

In [None]:
# Show the feature importances (from random forest) for the default features
feature_importances_sorted = plot_feature_importances(feature_importances)

As expected, the most important features are those dealing with `EXT_SOURCE` and `DAYS_BIRTH`. I'll simply copy more from this [notebook](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction/) since it's realy easily understandable. 
> We see that there are only a handful of features with a significant importance to the model, which suggests we may be able to drop many of the features without a decrease in performance (and we may even see an increase in performance.) Feature importances are not the most sophisticated method to interpret a model or perform dimensionality reduction, but they let us start to understand what factors our model takes into account when it makes predictions.

> ## 5. Feature Engineering does help
It is hereby definitely a good idea to show that feature engineering helps improve performance after talking so much about features, right? They do improve a lot the performance in predicting client behaviors with other classification models. 

#### * Light Gradient Boosting Machine - with original features

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import gc

def model(features, test_features, encoding = 'ohe', n_folds = 5):
    
    """Train and test a light gradient boosting model using
    cross validation. 
    
    Parameters
    --------
        features (pd.DataFrame): 
            dataframe of training features to use 
            for training a model. Must include the TARGET column.
        test_features (pd.DataFrame): 
            dataframe of testing features to use
            for making predictions with the model. 
        encoding (str, default = 'ohe'): 
            method for encoding categorical variables. Either 'ohe' for one-hot encoding or 'le' for integer label encoding
            n_folds (int, default = 5): number of folds to use for cross validation
        
    Return
    --------
        submission (pd.DataFrame): 
            dataframe with `SK_ID_CURR` and `TARGET` probabilities
            predicted by the model.
        feature_importances (pd.DataFrame): 
            dataframe with the feature importances from the model.
        valid_metrics (pd.DataFrame): 
            dataframe with training and validation metrics (ROC AUC) for each fold and overall.
        
    """
    
    # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])
    
    
    # One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'
    
    # Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] =  label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)
    
    # Catch error if label encoding scheme is not valid
    else:
        raise ValueError("Encoding must be either 'ohe' or 'le'")
        
    print('Training Data Shape: ', features.shape)
    print('Testing Data Shape: ', test_features.shape)
    
    # Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = True, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []
    
    # Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
        
        # Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
        
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()
        
    # Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
    
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))
    
    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 
    
    return submission, feature_importances, metrics

In [None]:
submission, fi, metrics = model(app_train, app_test)
print('Baseline metrics')
print(metrics)

In [None]:
fi_sorted = plot_feature_importances(fi)


In [None]:
submission.to_csv('baseline_lgb.csv', index = False)

This submission should score about 0.735 on the leaderboard.

#### * Light Gradient Boosting Machine - with original features

In [None]:
app_train_domain['TARGET'] = train_labels

# Test the domain knolwedge features
submission_domain, fi_domain, metrics_domain = model(app_train_domain, app_test_domain)
print('Baseline with domain knowledge features metrics')
print(metrics_domain)

In [None]:
fi_sorted = plot_feature_importances(fi_domain)

> Again, we see that some of our features made it into the most important. Going forward, we will need to think about whatother domain knowledge features may be useful for this problem (or we should consult someone who knows more about the financial industry!)

In [None]:
submission_domain.to_csv('baseline_lgb_domain_features.csv', index = False)

>This model scores about 0.754 when submitted to the public leaderboard indicating that the domain features do improve the performance! Feature engineering is going to be a critical part of this competition (as it is for all machine learning problems)