# Problem Statement

Credit Card Lead Prediction Happy Customer Bank is a mid-sized private bank that deals in all kinds of banking products,like Savings accounts, Current accounts, investment products, credit products, among other offerings.

The bank also cross-sells products to its existing customers and to do so they use different kinds of communication like tele-calling, e-mails, recommendations on net banking, mobile banking, etc.

In this case, the Happy Customer Bank wants to cross sell its credit cards to its existing customers. The bank has identified a set of customers that are eligible for taking these credit cards.

Now, the bank is looking for your help in identifying customers that could show higher intent towards a recommended credit card, given:
Customer details (gender, age, region etc.) Details of his/her relationship with the bank (Channel_Code,Vintage,'Avg_Asset_Value etc.)

#  Table of Content

* __Step 1: Importing the Relevant Libraries__
    
* __Step 2: Data Inspection__
    
* __Step 3: Data Cleaning__
    
* __Step 4: Exploratory Data Analysis__
    
* __Step 5: Building Model__

### Step 1: Importing the Relevant Libraries 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics

import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

### Step 2: Data Inspection

In [None]:
test = pd.read_csv("../input/credit-card/test_mSzZ8RL.csv")
train = pd.read_csv("../input/credit-card/train_s3TEQDk.csv")

In [None]:
train.shape,test.shape

* __We have 245725 rows and 11 columns in Train set whereas Test set has 105312 rows and 10 columns.__

In [None]:
#ratio of null values
train.isnull().sum()/train.shape[0] *100

In [None]:
#ratio of null values
test.isnull().sum()/test.shape[0] *100

* __We have 12% of missing values in Credit_Product column.__

In [None]:
#categorical features
categorical = train.select_dtypes(include =[np.object])
print("Categorical Features in Train Set:",categorical.shape[1])

#numerical features
numerical= train.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Train Set:",numerical.shape[1])

In [None]:
#categorical features
categorical = test.select_dtypes(include =[np.object])
print("Categorical Features in Test Set:",categorical.shape[1])

#numerical features
numerical= test.select_dtypes(include =[np.float64,np.int64])
print("Numerical Features in Test Set:",numerical.shape[1])

### Step 3: Data Cleaning 

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

* __Credit Product has some missing values in both train and test data.__

* __We neead to fill Missing Value credit product column using mode because column contain catagorical data.__

In [None]:
train['Credit_Product'].isnull().sum(),test['Credit_Product'].isnull().sum()

In [None]:
print(train['Credit_Product'].value_counts())
print('******************************************')
print(test['Credit_Product'].value_counts())

In [None]:
#Imputing with Mode
train['Credit_Product']= train['Credit_Product'].fillna(train['Credit_Product'].mode()[0])
test['Credit_Product']= test['Credit_Product'].fillna(test['Credit_Product'].mode()[0])

In [None]:
train['Credit_Product'].isnull().sum(),test['Credit_Product'].isnull().sum()

__We have succesfully imputed the missing values from the column Credit_Product.__

### Step 4: Exploratory Data Analysis

In [None]:
train.columns

In [None]:
train.head()

__1.Gender__

In [None]:
train['Gender'].value_counts()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('Gender',data=train,color='lightseagreen')

__From above graph it is clear that the bank have more males customers than females.__ 

__2.Region_Code__

In [None]:
train['Region_Code'].value_counts()

In [None]:
plt.figure(figsize=(30,5))
sns.countplot('Region_Code',data=train,palette='ocean')

__More number of customers from region code RG268, RG283, RG254, RG284, RG277 and RG280 respectively.__

__3. Occupation__

In [None]:
train['Occupation'].value_counts()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('Occupation',data=train,palette='spring')

__Most of customers are self employed__

__4. Channel_Code__

In [None]:
train['Channel_Code'].value_counts()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('Channel_Code',data=train,palette='ocean')

__Most of customers having channel code X1.__

__5. Credit_Product__

In [None]:
train['Credit_Product'].value_counts()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('Credit_Product',data=train,palette='ocean')

__Least Number of customer credited their product__

__6. Is_Active__

In [None]:
train['Is_Active'].value_counts()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('Is_Active',data=train,color='purple')

__Least Number of customers are active so bank required to increase their customer interaction so customers always know about bank products and leads to increase in active customers.__

__7. Is_Lead__

In [None]:
train['Is_Lead'].value_counts()

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('Is_Lead',data=train,color='darkorange')

__From above graph it is clear that leaat number of leads possible to generate.So, Bank require to apply corrective measures to increase lead.__

__8. Age__

In [None]:
print("Histogram by Age")

plt.figure(figsize = (8 , 6))
sns.distplot(train.query('Is_Lead == 1').Age, bins = 20, color="green")
mean_age = train.Age.mean()
plt.axvline(mean_age,0,1, color = "blue")

__Majority of the credit card holders are between 45 to 55 years old__

# Step 5: Building Models

In [None]:
train.dtypes

In [None]:
train.columns

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
var_mod = ['Gender','Region_Code','Occupation','Channel_Code','Credit_Product','Is_Active']
for i in var_mod:
    train[i] = le.fit_transform(train[i])
    test[i] = le.fit_transform(test[i])

In [None]:
# separating the independent and dependent variables

# storing all the independent variables as X
X = train.drop(['Is_Lead','ID'], axis=1)

# storing the dependent variable as y
y = train['Is_Lead']

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import roc_auc_score

def cross_val(X, y, model, params, folds=9):

    skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=21)
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
        print(f"Fold: {fold}")
        x_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        x_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

        alg = model(**params)
        alg.fit(x_train, y_train,
                eval_set=[(x_test, y_test)],
                early_stopping_rounds=100,
                verbose=400)

        pred = alg.predict_proba(x_test)[:, 1]
        roc_score = roc_auc_score(y_test, pred)
        print(f"roc_auc_score: {roc_score}")
        print("-"*50)
    
    return alg

* __XGBClassifier__

In [None]:
xgb_params= {'n_estimators': 20000, 
             'max_depth': 6, 
             'learning_rate': 0.0201, 
             'reg_lambda': 29.326, 
             'subsample': 0.818, 
             'colsample_bytree': 0.235, 
             'colsample_bynode': 0.820, 
             'colsample_bylevel': 0.453}

In [None]:
from xgboost import XGBClassifier
xgb_model = cross_val(X, y, XGBClassifier, xgb_params)

In [None]:
submission = pd.read_csv('../input/credit-card/sample_submission_eyYijxG.csv')
test = test.drop('ID', axis=1)
final_predictions = xgb_model.predict_proba(test)[:,1]
submission['Is_Lead'] = final_predictions
#only positive predictions for the target variable
submission['Is_Lead'] = submission['Is_Lead'].apply(lambda x: 0 if x<0 else x)
submission.to_csv('my_submission.csv', index=False)

As a learner this is my first experience with this type of competition.Whatever I learned from analytics vidhya tried to apply here.

1.First of all I found that there are missing values in the data. The column which has missing value is categorical so filled with mode.Then I did visualization in that I found that except ID every column is important for our analysis.

2.Then I started model building first I chose DecisiontreeClassifier because it is a classification problem but I got poor results so I switched to logistic and linear regression. Linear regression gave me better result than logistic and decision tree.But this is not upto the mark.So I decided to use ensemble learnings In that I used catboost, random forest,Xgboost,Light bgm stacking and averaging from all of this cat boost and Xgboost gave better result. Finally I decided to use Xgboost over catboost because of better AUC.
It was a great experience. 
Thank you........


__END__