#Introduction : Manual Feature Engineering

in this notebook, we will explore making features by hand for the home credit default risk competetion.

in order to beeter this score, we will have to include more informaiton from the other dataframes.

- bereau : information abour client's previous loans with other financial institutions reported to home credit.

- bureau_balance : monthly informaiton about the previous loans. Each month has its own row.


In [None]:
# pandas and numpy for data manipulation

import pandas as pd
import numpy as np

#matplotlib and seaborn for plotting
import matlotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Suppress warnings from pandas
import warnings
warnings.filterwarings('ignore')

plt.style.use('fivethirtyeight')

##Example : Counts fo a client's previous loans

To illustrate the general precess of manual feature engineering, we will first simply get the count of a client's previous loans at other financial institutions.

This requires a number of pandas operations we will make heavy use of throughout the notebook :

- groupby : group a df by a cols. in this case we will groupby the unique client, the sk_id_curr
- agg : perform a calculation on the grouped data such as taking the mean of cols. We can either all the function directly.
- merge : match the aggregated statictics to the appropriate client. We need to merge the original training data with the calculated stats on the sk_id_cur cols which will insert NAN in any cell for which the client does not have the corresponding statistic

We also use the rename function quite a bit specifying the cols to be renamed as a dict. This is useful in order to keep track of the new variables we create.

In [None]:
#REad in bureau

bureau = pd.read_csv('../input/bureau.csv')

bureau.head()

In [None]:
#Groupby the client id, count the number of previous loans, and rename the column
bureau=bureau.columns.str.lower()
previous_loan_counts = bureau.groupby('sk_id_curr' , as_index=False)['sk_id_curr'].count().rename(columns={'sk_id_bureau' : 'previous_loan_counts'})

previous_loan_counts.head()

In [None]:
#join to the training df
train = pd.read_csv('../input/appication_train.csv')
train = train.merge(previous_loan_counts, on='sk_id_curr', how='left')

#fill the missing values with 0

train['previous_loan_count'] = train['previous_loan_count'].fillna(0)
train.head()

##Assessing Usefulness of new Variable with r value

To determine if the new variable is useful, we can calculate the Pearson correlation coeffient btw this variable and the target.

We can also visually inspect a relationship with the target using the Kernel Density Estimakte(KDE) plot.

KDE

In [None]:
#Plots the distribution of a variable colored by value of the target
def kde_target(var_name, df)  :

  #Calculate the correaltino coefficient btw the new variable and the target
  corr = df['target'].corr(df[var_name])

  #calcylate medians for repaid vs not repaid
  avg_repaid = df.ix[df['target'] ==0, var_name].median()
  avg_not_repaid = df.ix[df['target'] ==1, var_name].median()

  plt.figure(figsize=(12,6))

  #plot the distribution for target ==0 and target==1

  sns.kdeplot(df.ix[df['target'] ==0, var_name], label = 'target==0')
  sns.kdeplot(df.ix[df['target']==1, var_name], label = 'target ==1')


  #label the plot
  plt.xlabel(var_name)
  plt.ylabel('density')
  plt.title('%s distribution'% var_name)
  plt.legend()

  #print out the correlation
  print('The correlation btw %s and the target is %0.4f' %(var_name, corr))
  print('median value for loan that was not repaid = %0.4f' %avg_not_repaid)
  print('median value for loan that was repaid = %0.4f'% avg_repaid)
  

We can test this function using the ext_source_3 variable which we found to be one of the most important variables according to a random forest and gradient boosting machine.

In [None]:
kde_target('ext_source_3', train)

In [None]:
kde_target('previous_loan_counts', train)

###Aggregating Numeric Columns

In [None]:
#Group by the client id, calculate aggregation statistics

bureau_agg = bureau.drop('sk_id_bureau', axis=1).groupby('sk_id_curr', as_index=False).agg(['count','mean','max','min','sum']).reset_index()

bureau_agg.head()

In [None]:
#List of col names
columns = ['sk_id_curr']

#Iterate through the variables names
for var in bureau_agg.columns.levels[0] :
  #Skip the id name
  if var != 'sk_id_curr' :
    #Iterate through the stat names
    for stat in bureau_agg.columns.levels[1][:-1] :
      #make a new column name for the variable and stat
      columns.append('bureau_%s_%s' %(var, stat))

In [None]:
#Assign the list of columns names a sthe df column names
bureau_agg.columns = columns
bureau_agg.head()

In [None]:
#Merge with the training data
train = train.merge(bureau_agg, on='sk_id_curr', how='left')
train.head()

Correlations of Aggregated Values with Target

We can calculate the correlation of all new values with the target. Again, we can use these as an approximation of the variables which may be important for modeling.

In [None]:
#list of new corrlations
new_corrs = []

#Iterate through the columns
for col in columns :
  #Calculate correlation with the target
  corr = train['Target'].corr(train[col])

  #Append the list as a tuple

  new_corrs.append((col, corr))
  

IIn the code below, we sort the correlations by the magnitude (abs) using the sorted function. WE also make use of an anonymous lambda function

In [None]:
#sort the correlations by the abs values
#make sure to reverse to put the largest vlaues at the front of list
new_corrs = sorted(new_corrs, key= lambda x : abs(x[1]), reverse=True)

new_corrs[:15]

None of the new variables have a significant correlation with the TARGET. we can look at the kde plot of the highest correlated variable, bureau_days_credit_mean, with the target in terms of absolute magnitude correlation.

In [None]:
lde_target('bureau_days_credit_mean', train)

The Multiple Comparisons problem

##Function for Numeric Aggregations

Let's encapsulate all of the previous work into a function. This will allow us to compute aggregate stats for numeric olumns accross any df. We will reuse this functino when we want to apply the same operations for other df.

In [None]:
def agg_numeric(df, group_var, df_name) :
  '''Aggregates the numeric values in a df. This can be used to create features for each instance of the grouping variable.

  parameters :
  - df : the df to calculate the statistics on
  - group_var : the variable by whih to group df
  - df_name :the variable used to rename

  return
  --------
  agg
  '''
  #Remove id variables other than grouping variable
  for col in df :
    if col != group_var and 'sk_id' in col :
      df=df.drop(columns=col)

    group_ids = df[group_var]
    numeric_df = df.select_dtypes('number')
    numeric_df[group_var] = group_ids

    #Group by the specified variable and calculate the statistics
    agg = numeric_df.groupby(group_var).agg(['count','mean','max','min','sum']).reset_index()

    #Need to create new column names
    columns = [group_var]

    #Iterate through the variables names
    for var in agg.columns.levels[0] :
      #Skip the gruping variable
      if var!= group_var :
        for stat in agg.columns.levels[1][:-1] :
          #make a new cols names for the variable and stat
          columns.append('%s %s %s' %(df_name, var, stat))

    agg.columns = columns
    return agg

In [None]:
bureau_agg_new = agg_numeric(bureau.drop(columns= ['sk_id_bureau']), group_var = 'sk_id_curr', df_name = 'bureau')
bureau_agg_new.head()

In [None]:
#Function to calculate corr with te target for a df

def target_corrs(df) :

  corrs= []

  for col in df.columns :
    print(col)
    if col != 'target' :
      corr = df['target'].corr(df[col])

      corrs.append(col, corr)

  corrs = sorted(corrs, key=lambda x : abs(x[1]), reverse=True)
  #abs(x[1] : corrs는 append로 col(colname)과 corr(corr_value)를 받기 때문에 sort는 상관계수를 바탕으로 하는 것이 맞음)
  return corrs

In [None]:
categorical = pd.get_dummies(bureau.selec_dtypes('object'))
categorical['sk_id_curr'] = bureau['sk_id_curr']
categorical.head()

In [None]:
categorical_grouped = categorical.groupby('sk_id_curr').agg['sum','mean']
categorical_grouped.head()

In [None]:
categorical_grouped.columns.levels[0][:10]
#levels[0] is row, levels[1] is cols

In [None]:
categorical_grouped.columns.levels[1]

In [None]:
group_var = 'sk_id_curr'

#need to create new column names
columns = []

#iterate through the variabel names
for var in categorical_grouped.columns.levels[0] :
  #skip the grouping variable
  if var != group_var :
    for stat in ['count','count_norm'] :
      #make a new column name for the varaible and stat
      columns.append( '%s %s' % (var,stat))

#rename the columns
categorical_grouped.columns = columns
categorical_grouped.head()

In [None]:
train = train.merge(categorical_grouped, left_on = 'sk_id_curr', right_on = True, how = 'left')

train.head()

In [None]:
train.shape()

In [None]:
train.iloc[:10, 123:]

Function to handle categorical variables

to make the code more efficient, we can now write a function to handle he categorical variables for us. This will take the same form as the agg_numeric function in that it accepts a df and a grouping variable. Then it will calclate the counts and normalized counts of each category for all cateogrical variables in the df.

In [None]:
def count_categorical(df, group_var, df_name) :
  ''' parameters
  - df : df to calculate the value counts for
  - group_var : the variable by which to group the df. for each unique value of this variable, the fianl df will have one row
  - df_name : variable added to the fron of column names to keep track of columns
  return
  -------
  categorical : a df with counts and normalized counts of each unique category in every categorical variable with one row for every unique value of the 'group_var'
  '''

  #select the categorical columns
  categorical = pd.get_dummies(df.select_dtypes('object'))

  #make sure to put the identifying id on the column
  categorical[group_var] = df[group_var]

  #groupby the group var and calculate the sum and mean
  categorical = categorical.groupby(group_var).agg(['sum','mean'])

  column_names = []

  #iterate through the columns in level 0 :
  for var in categorical.columns.levels[0] :
    for stat in ['count','count_norm'] :
      #make a new column name
      column_names.append('%s %s %s' % (df_name,var,stat))


  categorical.columns = column_names

  return categorical

In [None]:
bureau_counts = count_categorical(bureau, group_var = 'sk_id_curr', df_name = 'bureau')
brreau_count.head()

###Applying Operations to another dataframe

We will now turn to the bureau balance df. This df had monthly information about **each clinet's previous loans with other financial institutions**.

Instead of grouping this df by the sk_id_curr which is the client id, we will first gorup the df by the sk_id-bureau which is the id o f the previous loan. This will give us ***one row*** of the df for each loan. Then, we can group by the sk_di_curr and calculate the aggregations across the loans of each client. The final result will be a df with one row for each client, with stats calculated for their loans.

In [None]:
bureau_balance = pd.read_csv('../input/bureau_balance.cs')

bureau_balance.head()

In [None]:
bureau_balance_counts = count_categorical(bureau_balance, group_var = 'sk_id_bureau' df_name = 'bureau_balance')
bureau_balance_count.head()

now we can handle the one numeric column. the months_balance column has the 'months of balance relative to application date.' this might not necessarily be that important as a numeric variable, and in future work we might want to consider this as a time variable. For now, we cna just calculate the same aggregation statistics as previously.

In [None]:
#calculate value count tatistics for each sk_id_curr
bureau_balance_agg = agg_numeric(bureau_balance, group_var = 'sk_id_bureau', df_name = 'bureau_balance')

bureau_balance_agg.head()

The above dataframes have the calculations done on each loan. Now we need to aggregate these for each client. We cna do this by merging the dataframes together first and then since all the variables are numeric, we just need to aggregate the statistics again, thsi timegrouping by the sk_id_curr

In [None]:
#df grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index= True, left_on='sk_id_bureau', how='outer')

#merge to include the sk_id_bureau
bureau_by_loan = bureau_by_loan.merge(bureau[['sk_id_bureau','sk_id_curr']], on='sk_id_bureau', how='left')

bureau_by_loan.head()

In [None]:
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop('sk_id_bureau', axis=1)), group_var = 'sk_id_curr', df_name = 'client')
bureau_balance_by_client.head()

To recap, for the bureau_balance df we :
1. Calculate numeric stats grouping by each loan
2. Made value counts of each categorical variable goruping by loan
3. Merged the stats and the value counts on the loans
4. Calculated numeric stats for the resulting dataframe goruping by the client id

The final resulting df has one row for each client, with statistics calculate dfor all of their loans with monthly balance inforamtion.

Some of these variables are a little confusing, so let's try to explain a few :

- client_bureau_balance_months_balance_mean_mean : 

for each loan calculate the mean value of months_balance. Then for each client, calculate the mean of this value for all of their loans.
- Client_bureau_balance_status_x_count_norm_sum ⁉
For each loan, Calculate #occurences of status==x diveded by #total status values for the loan. Then for each client, add up the values for each loan.

#Putting the Functions Together




We now have all the pieces in place to take the infromation from the previous loans at other institutions and the monthly payments inforamtion abour these loans and put them into the main training df. Let's do a reset of all the variables and then use the functions we built to do this from the ground up. This demonstrate the benefit of using functinos for repeatable workflows

In [None]:
#Free up memory by deleting old objects
import gc
gc.enable()

def train, bureau, bureau_balance, bureau_agg, bureau_agg_new, bureau_balance_agg, bureau_balance_counts, bureau_by_loan, bureau_balance_by_client, bureau_counts
gc.collec()

In [None]:
#read in new ciplies of all the df
train.pd_read_csv('../input/application_train.csv')
bureau = pd.read_csv('../input/bureau.csv')
bureau_balance = pd.read_csv('../input/bureau_balance.csv')


Counts of Bureau DF

In [None]:
bureau_counts = count_categorical(bureau, group_var ='sk_id_curr', df_name = 'bureau')
bureau_counts.head()

Aggregated stats if bureau dataframe

In [None]:
bureau_agg = agg_numeric(bureau.drop('sk_id_curr',axis=1), group_var='sk_id_curr', df_name = 'bureau')
bureau_agg.head()

Value counts of Bureau Balance df by loan

In [None]:
bureau_balance_counts = count_categorical(bureau_balance, group_var='sk_id_bureau', df_name = 'bureau_balance')
bureau_balance_counts.head()

Aggregated stats of Bureau Balance df by client

In [None]:
#df grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index=True, left_on='sk_id_bureau', how='outer')

#merge to include the sk_id_curr
bureau_by_loadn = bureau[ ['sk_id_curr', 'sk_id_bureau']].merge(bureau_by_loan, on='sk_id_bureau',how='left')

#Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop('sk_id_bureau', axis=1), group_var='sk_id_curr', df_name='client')



###Insert Computed Features into Training Data

In [None]:
original_features = list(train.columns)
print('original number of features : ', len(original_features))

In [None]:
#merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'sk_id_curr', how='left')

#merge with the stats of bureau
train = train.merge(bureau_agg, on='sk_id_curr', how='left')

#merge with the monthly information grouped by client
train = train.mege(burea_balance_by_client, on='sk_id_curr', how='left') 

In [None]:
new_features = list(train.columns)
print('# features using previous loans from other institutions data : ', len(new_features))

#Feature Engineering Outcomes

After all that work, now we want to take a look at the variables we ahve created. We can look at the **percentage of missing values**, the **correlations **fo variables with the target, and also the correlation of variables with the other variables. The correlations btw variables acan show if we have collinear variable, that is, variables that are highly correlated with one another. Often, we want to remove one in a pair of collinear variables because having both variables would be redundant. We cna also use the percentage of missing values to remove features with a subtrantial majority of values that are not present. Feature selection will be an important focus going forward, because reducing the number of fetures can help the model learn during training and also generalize better to the testing data. The 'curse of dimensionality' is the name given to the issues caused by having too many features ( too high of a dimension). As the number of variables increases, the number of datapoints needed to learn the relationship between these variables and the target value increases exponentially.

##Missing values

An important consideration is the missing values in the df. Columns with too many missing values might have to be dropped.

In [None]:
# Function to calculate missing values by column # Funct

def missing_values_table(df) :
  mis_val = df.isnull().sum()

  mis_val_percent = (mis_val / len(df)) * 100

  mis_val_table = pd.concat([mis_val, mis_val_percent], aixs=1)

  #Rename the columns
  mis_val_table_ren_columns = mis_val_table.rename({0 : 'Missing Values', 1 : '% of Total values'})

  #sort the table by percentage of missing descending
  mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_column.iloc[:,1] !=0].sort_values('% of total values', ascending=False).round(1)

  #print some summary information
  print ('Your selected df has' + str(df.shape[1] + 'columns' + 'There are' + str(mis_val_table_ren_columns.shape[0] + 'columns that have missing values'))

  return mis_val_table_ren_columns

In [None]:
missing_train = missing_values_trable(train)

missing_train

In [None]:
missing_train_vars = list(missing_train.index[missing_train['% of total values'] > 90])

len(missing_train_vars)

Calculate information for testing data

In [None]:
test = pd.read_csv('../input/application_test.csv')

#merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'sk_id_curr', how='left')

#merge with the stats of bureau
test = test.merge(bureau_agg, on='sk_id_curr', how='left')

#merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on='sk_id_curr', how='left')

In [None]:
print('shape of testing data : ', test.shape)

In [None]:
train_labels = train['target']

train, test = train.align(test, join='inner', axis=1)

train['target'] = train_labels

In [None]:
missing_test = missing_values_table(test)
missing_test.head()

In [None]:
missing_test_vars = list(missing_test_vars.index[missing_test['% of total values'] >90 ])

len(missing_test_vars)

In [None]:
missing_columns = list(set(missing_test_vars + missing_train_vars))
print('There are %d columns with more than 90%% missing in either the train or testing data.' % len(missing_columns))

In [None]:
#drop the missing columns
train = train.drop('missing_columns', axis=1)
test = test.drop('missing_columns', axis=1)

##Correlations



In [None]:
corrs = train.corr()

In [None]:
corrs = corrs.sort_values('target', ascending=False)

#Ten most positive correaltions
pd.DataFrame(corrs['target'].head(10))

In [None]:
#Tem most negative correlations
pd.DataFrame(corrs[['target'].tail(10)])

The highest correlated variable with the target, is a variable we created. however, just because the variable is correlated does not mean that it will be useful, and we have to rememver that if we generate hundreds of new variables, some are going to be correlated with the target simply because of random noise.

Viewing the correlations skeptically, it does appear that several of the newly created variables may be useful. To assess the 'usefullness' of variables, we will look at the feature importances rerturned by the model.

In [None]:
kde_target(var_name = 'client_bureau_balance_counts_mean', df=train)

This variable represents the average number of monthly records per loan for each client. Based on the distribution, clients witha greater number of average monthly records per loan were more likely to repay their loans with home credit.

In [None]:
kde_target(var_name='bureau_CREDIT_ACTIVE_Active_count_norm', df=train)

###collinear Variables

We can calculate not only the correlations of the variables with the target, but also the correlation of each variable with every other variable. This will allow us to see if there are highly collinear variables that should perhaps be removed from the data.

In [None]:
#set the threshold
threshold= 0.8

above_threshold_vars= {}

for col in corrs :
  above_threshold_vars[col] = list(corrs.index[corrs[col] > threshold])

In [None]:
#track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair=[]

#iterate through columns and correalted columns
for key, value in above_threshold_vars.items() :
  cols_seen.append(key)
  for x in value :
    if x == key :
      next
    else:
      #only want to remove one in a pair
      if x not in cols_seen :
        cols_to_remove.append()
        cols_to_remove_pair.append(key)

cols_to_remove = list(set(cols_to_remove))

print('# columns to remove', len(cols_to_remove))

In [None]:
train_corrs_removed = train.drop('cols_to_remove', axis=1)
test_corrs_removed = test.drop('cols_to_remove', axis=1)



In [None]:
train_corrs_removed.to_csv('train_bureau_corrs_removed.csv', index=False)
test_corrs_removed.to_csv('test_bureau_corrs_removed.csv', index=False)

#Modeling

To actually test the performance of these new datasets, we will try using them for machinelearning! 

For all datasets, use the model shown below
- control : only the data in the application files.
- test one : the data in the application files with all of the data recorded from the bureau and bureau_balance fiels
- test two : the data in the application files with all of the data recorded from the bureau and bureau_balance files with highly correlated variables removed.

In [None]:
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import gc

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
def model(features, test_features, encoding='ohe', n_folds=5):
  '''Train and test a light gradient boosting model using cross validation.'''
  #Extract the ids
  train_ids = features['sk_id_curr']
  test_ids = test_features['sk_id_curr']

  #Extract the labels for training
  labels = train['target']

  #Remove the ids and target
  features = features.drop(['sk_id_curr','target'])
  test_features = test_features.drop(['sk_id_curr'])

  #one hot encoding
  if encoding == 'ohe' :
    features = pd.get_dummies(features)
    test_features = pd.get_dummies(test_features)

    #Align the df by the columns
    features, test_features = features.align(test_features, join='inner', axis=1)

    #No categorical indices to records
    cat_indices = 'auto'
  elif encoding =='le' :
    label_encoder = LabelEncoder()

    #List for storing categorical indices
    cat_indices = []

    #Iterate through each column
    for i, col in enumerate(features) :
      if features[col].dtype =='object' :
        #map the categorical features to integers
        features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))

        #record the categorical indices
        cat_indices.append(i)

      else :
        raise ValueError('Encoding must be either 'ohe' or 'le' ')

  print('Training data shape' : features.shape)
  print('test data shape' : test_features.shape)

  #Convert to np arrays
  features = np.array(features)
  test_features = np.array(test_features)

  #Create the kfold object
  k_fold = KFOLD(n_splits=n_folds, shuffle=False, random_state = 42)

  #empty array for feature importances
  feature_importance_values = np.zeros(len(features_names))

  #empty array for test prediction
  test_predictions = np.zeros(len(test_features.shape[0]))

  #empty array for out of fold validation predictions
  out_of_fold = np.zeros(features.shape[0])

  #Lists for recording validation and training scores
  valid_scores=[]
  train_scores=[]

  for train_indices, valid_indices in k_fold.split(features) :

    #Training data for the fold
    train_features, train_labels = features[train_indices], labels[traub_indices]
    #Validation data for the fold
    valid_features, valid_labels = features[valid_indices], labels[valid_indices]

    #Create the model
    model = lgb.LGBMClassifier(n_estimators=100, objective = 'binary', class_weight = 'balanced', learning_rate = 0.05, reg_alpha = 0.1, reg_lambda = 0.1, random_state=42)

    #train the model
    model.fit(train_features, train_labels, eval_metric ='auc', eval_set=[(train_features, train_labels),(valid_features, valid_labels)], eval_name=['train','valid'], categorical_feature = cat_indices, early_stopping_rounds = 100, verbose = 100)

    #Record the best iteratino
    best_iteration = model.best_iteration_

    #Records the feature importances
    feature_importance_values += model.feature_importances_ / k_fold.n_splits

    #Make predictions
    test_predictions +=model.predict_proba(test_features, num_iteration = best_iteration)[:,1] / k_fold.n_splits

    #Records the out of fold predictions
    out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:,1]

    #Record the best score
    valid_score = model.best_score_['valid']['auc']
    train_score = model.best_score_['train']['auc']

    valid_scores.append(valid_score)
    train_scores.append(train_score)

    #clean up memory
    gc.enable()

    del model, train_features, valid_features
    gc.collect()

    #make the submission df
    submission = pd.DataFrame({'sk_id_curr' : test_ids, 'target' : test_predictions})

    #make the feature importance df
    feature_importances = pd.DataFrame({'feature' : feature_names, 'importance' : feature_importance_values})

    #overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)

    #add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))

    #needed for creating df of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')


    #df of validation scores
    metrics = pd.DataFrame({'fold' : fold_names,
                            'train' : train_scores,
                            'valid' : valid_scores})
    
    return submission, feature_importances, metrics

In [None]:
def plot_feature_importances(df) :
  df = df.sort_values('importance', ascending=False).reset_index()

  df['importance_normalized'] = df['importance'] / df['importance'].sum()

  #make a barh of feature importances
  plt.figure(figsize=(10,6))
  ax= plt.subplot()

  ax.barh(list(reversed(list(df.index[:15]))),
          df['importance_normalized'].head(15),
          align = 'center', edgecolor = 'k')
  
  #Set the yticsk and labels
  ax.set_yticks(list(reversed(list(df.index[:15]))))
  ax.set_yticklabels(df['feature'].head(15))

  #plot labeling
  plt.xlabel('normalized importance')
  plt.title(' feature importance')
  plt.show()

  return df

Control

The first step in any experiment is establishing a control. For this we will use the function defined above(that implements a Gradient Boosting machine model) and the single main data souce(application)

In [None]:
train_control = pd.read_csv('../input/application_train.csv')
test_control = pd.read_csv('../input/application_test.csv')

In [None]:
submission, fi , metrics = model(train_control, test_control)

In [None]:
metrics

In [None]:
fi_sorted = plot_feature_importances(fi)

In [None]:
submission.to_csv('control.csv', index=False)

Test One

In [None]:
submission_raw, fi_raw, metrics_raw = model(train, test)

In [None]:
metrics_raw

In [None]:
fi_raw_sorted = plot_feature_importances(fi_raw)

In [None]:
top_100 = list(fi_raw_sotred['feature'])[:100]
new_features = [x for x in top_100 if x not in list(fi['feature'])]

print('%% of top 100 features created from the bureau data = %d.00' %len(new_features))

In [None]:
submission_raw.t