# Introduction

This is a simple XGBoost based notebook for predicting customer transactions without using any EDA. It follows a simple pipeline architecture with data loading, balancing imbalanced data, and gridsearch cross validation. You can use similar approach for finding your first quick and dirty but robust solution for similar problems. Solution can then be iteratively refined as you learn more about the problem and data science approaches that can be used for problem solving.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### Data Description

You are provided with an anonymized dataset containing numeric feature variables, the binary target column, and a string ID_code column.

The task is to predict the value of target column in the test set.

Training data contains:

• ID_code (string);

• target;

• 200 numerical variables, named from var_0 to var_199;

Test data contains contains:

• ID_code (string);

• 200 numerical variables, named from var_0 to var_199;

There are no missing data in train and test datasets. Let's check the numerical values in train and test dataset.

### File descriptions

train.csv - the training set.

test.csv - the test set. 

The test set contains some rows which are not included in scoring.

sample_submission.csv - a sample submission file in the correct format.


# Load Data

In [None]:
def load_data():
    df_train = pd.read_csv('../input/santander-customer-transaction-prediction/train.csv',index_col='ID_code')
    df_test = pd.read_csv('../input/santander-customer-transaction-prediction/test.csv', index_col='ID_code')

    
    return df_train,df_test


In [None]:
#Label encoding selected categorical columns, while leaving other columns as it is
from sklearn import preprocessing

def label_encoding(sel_cat,inpX):
    for col in sel_cat:
        if col in inpX.columns:
            le = preprocessing.LabelEncoder()
            le.fit(list(inpX[col].astype(str).values))
            inpX[col] = le.transform(list(inpX[col].astype(str).values))
    return inpX


In [None]:
# Returns list of categorical columns, and part of dataset with only categorical columns
def categorical_cols(input_df):
    # Selecting numeric columns in df_train
    print(input_df.select_dtypes('object').columns)
    sel_train = input_df.select_dtypes('object').columns.values
    #print(type(sel_train))

    train = input_df[sel_train]
    #print(train.describe())
    return sel_train, train

# Dealing with Imbalanced Sampling

In [None]:
from sklearn.model_selection import train_test_split

#features = sel_features+num_id+sel_cards
#train = df_train[features]
def balanced_sampling(input_df, factor): 
    
    train = numeric_cols(input_df)
    y= train['target']
    # Selecting target 1 and target 0  
    X_target = train[train.target==1]
    X_notarget= train[train.target==0]
    total_target = X_target.shape
    print("Target Size : ",total_target[1],total_target[0])
    scale_factor = factor
    X_notarget1=X_notarget.sample(scale_factor*total_target[0])
    X=pd.concat([X_target,X_notarget1], ignore_index=True)
    y= X['target']
    print(X.shape)
    print(X.sample(10))

    #dropping target column from X
    X.drop(["target"],axis=1,inplace=True)
    
    
    ### Train-test split with Stratification
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,  test_size=0.25)
    return X_train, X_test, y_train, y_test


In [None]:
def numeric_cols(input_df):
    # Selecting numeric columns in df_train
    print(input_df.select_dtypes('number').columns)
    sel_train = input_df.select_dtypes('number').columns.values
    print(type(sel_train))

    train = input_df[sel_train]
    print(train.describe())
    return train

In [None]:
def preprocess(inp):
# Filling 0.0 in place of NaN
    inp.fillna(0.0, inplace=True)
    inp.sample(10)
    return inp 

# XGBoost Pipeline with Cross Validation

In [None]:
df_train,df_test = load_data()
print(f'Train dataset has {df_train.shape[0]} rows and {df_train.shape[1]} columns.')
print(f'Test dataset has {df_test.shape[0]} rows and {df_test.shape[1]} columns.')

In [None]:
X_train, X_test, y_train, y_test = balanced_sampling(df_train,3)

In [None]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
model=xgb.XGBClassifier(tree_method='gpu_hist',random_state=1,learning_rate = 0.01,max_depth = 4, subsample = 0.8, colsample_bytree =  1, gamma = 1)
#model.fit(X_train, y_train)
#model.score(X_test,y_test)

In [None]:
param_grid = {
    'n_estimators': [2000,4000]   
}

gbm = GridSearchCV(model, param_grid, cv=3)
gbm.fit(X_train, y_train)

In [None]:
print("Best parameters set found on development set:")
print()
print(gbm.best_params_)
print()
print("Grid scores on development set:")


In [None]:
# prediction
y_pred=gbm.predict(X_test)

In [None]:
from sklearn import metrics
def eval2(y_test,y_pred):
    # Model Accuracy, how often is the classifier correct?
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    return 0

In [None]:
eval2(y_test,y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 
def performance_analysis(y_test,y_pred):
    results = confusion_matrix(y_test, y_pred) 
    print('Confusion Matrix :')
    print(results) 
    print('Accuracy Score :',accuracy_score(y_test, y_pred))
    print ('Report : ')
    print (classification_report(y_test, y_pred))
    return

performance_analysis(y_test,y_pred)

In [None]:
def sub3(inpt,clf):
    # Use df_test with selected columns for final submission
    y_preds = clf.predict_proba(inpt)[:,1] 
    sample_submission = pd.read_csv('../input/santander-customer-transaction-prediction/sample_submission.csv', index_col='ID_code')
    sample_submission['target'] = y_preds
    sample_submission.to_csv('santander_xgcv_2.csv')
    return 0


In [None]:
sub3(df_test,gbm)

## Conclusion

We can see that we get poor recall and f-scores for minority class. This can be addressed by increasing representation of the minority class through SMOTE or some other suitable techniques. 

## Note

Please share, upvote and comment to help me create and share more content for the community.
Thank you all.