#**Problem Statement**

##*Jantahack Cross-sell Prediction*

The Problem statement is [here](https://datahack.analyticsvidhya.com/contest/janatahack-cross-sell-prediction/#ProblemStatement).

**Data Description**
	              
- id	       :            Unique ID for the customer
- Gender	      :         Gender of the customer
- Age           :          Age of the customer
- Driving_License	0 : Customer does not have DL, 1 : Customer already has DL.
- Region_Code	: Unique code for the region of the customer
- Previously_Insured	1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance
- Vehicle_Age	-   Age of the Vehicle 
- Vehicle_Damage
1 : Customer got his/her vehicle damaged in the past.
0 : Customer didn't get his/her vehicle damaged in the past.
- Annual_Premium	The amount customer needs to pay as premium in the year
- Policy_Sales_Channel	Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
- Vintage -	Number of Days, Customer has been associated with the company
- Response	1 :  Customer is interested, 0 : Customer is not interested

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('../input/janatahack-crosssell-prediction/train.csv')
df.head()

In [None]:
del df['id']

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df_test=pd.read_csv('../input/janatahack-crosssell-prediction/test.csv')
df_test.head()

In [None]:
df_test.shape

In [None]:
df.info()

In [None]:
df_test.info()

### Visualizing train data

In [None]:
sns.countplot(df['Gender']);

In [None]:
sns.countplot(df['Gender'],hue=df['Driving_License']);

- Most of the customers had already DL.

In [None]:
plt.figure(figsize=(10,10))
sns.catplot(x="Vehicle_Age", y="Response", hue="Gender", kind="bar", data=df);

- People whose vehicle age is < 2 years tends to buy insurance compared to others.

In [None]:
sns.catplot(x="Vehicle_Age", y="Response", hue="Previously_Insured", kind="bar", data=df);

- Previously_Insured
- 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance

- Most of the people who have insured doesnt  want to insure it again.

In [None]:
sns.catplot(x="Vehicle_Age", y="Response", hue="Driving_License", kind="bar", data=df);

In [None]:
sns.catplot(x="Gender", y="Response", hue="Previously_Insured", kind="bar", data=df.query("Vehicle_Age == '> 2 Years'"))

- Vehicle_age > 2 years.
- previous insurance was very less which is vehicle age > 2 years.

In [None]:
sns.catplot(x="Gender", y="Response", hue="Previously_Insured", kind="bar", data=df.query("Vehicle_Age == '1-2 Year'"))

- Vehicle_Age  is between 1-2 years.
- insurance was very less.

In [None]:
sns.lineplot(df['Policy_Sales_Channel'],df['Vintage'],hue=df['Gender'])

- Policy_Sales_Channel Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc.
- Vintage - Number of Days, Customer has been associated with the company

In [None]:
sns.boxplot(df['Age'])

In [None]:
df[df['Age']>=80]['Response'].value_counts()

In [None]:
# index=df[df['Age']>=80].index

In [None]:
# df.drop(labels=index,inplace=True)

In [None]:
df['Annual_Premium'].max(),df['Annual_Premium'].min()

In [None]:

def outliers(df,features):
  for c in features:
    Q1=np.percentile(df[c],25)
    Q3=np.percentile(df[c],75)
    IQR=Q3-Q1
    outliers=df[(df[c] < (Q1-1.5 * IQR)) | (df[c] > (Q3 + 1.5 * IQR))]
    return outliers.index

In [None]:
outliers(df,['Annual_Premium'])

In [None]:
df.drop(labels=outliers(df,['Annual_Premium','Age','Vintage']),inplace=True)

In [None]:
sns.countplot(df['Response']);

- it is imbalaced Dataset.

In [None]:
df['Gender']=df['Gender'].replace(['Male','Female'],[1,0])
df['Vehicle_Age']=df['Vehicle_Age'].replace(['< 1 Year','1-2 Year','> 2 Years'],[1,2,3])
df['Vehicle_Damage']=df['Vehicle_Damage'].replace(['Yes','No'],[1,0])


df_test['Gender']=df_test['Gender'].replace(['Male','Female'],[1,0])
df_test['Vehicle_Age']=df_test['Vehicle_Age'].replace(['< 1 Year','1-2 Year','> 2 Years'],[1,2,3])
df_test['Vehicle_Damage']=df_test['Vehicle_Damage'].replace(['Yes','No'],[1,0])

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot=True);

In [None]:
X=df.drop(['Response'],axis=1)
y=df.Response.values

In [None]:
X.columns

In [None]:
y


In [None]:
from imblearn.over_sampling import SMOTE
sm=SMOTE()

In [None]:
X_res,y_res=sm.fit_sample(X,y)

In [None]:
X_res.shape,y_res.shape

In [None]:
from collections import Counter
print("Orginal Dataset Shape {}".format(Counter(y)))
print("Applying Smote dataset shape {}".format(Counter(y_res)))

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_res,y_res,test_size=0.3,random_state=23,stratify=y_res,shuffle=True)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler
sc=StandardScaler()


In [None]:
X_train_res=sc.fit_transform(X_train)
X_test_res=sc.transform(X_test)

In [None]:
X_train_res

In [None]:
# !pip install xgboost

In [None]:
# from xgboost import  XGBClassifier
# xg=XGBClassifier()

In [None]:
# model=xg.fit(X_train_res,y_train)

In [None]:
# y_pred=model.predict(X_test_res)

- We can also use Xgboost.

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostClassifier
cb=CatBoostClassifier(task_type='GPU',loss_function='Logloss',iterations=9500,l2_leaf_reg=8,depth=8)

In [None]:
model_cb=cb.fit(X_train_res,y_train)

In [None]:
y_pred=model_cb.predict(X_test_res)

In [None]:
from sklearn.metrics import accuracy_score,roc_auc_score,roc_curve,precision_score,f1_score,classification_report,confusion_matrix,plot_confusion_matrix

In [None]:
print("Accuarcy on training", accuracy_score(y_train,model_cb.predict(X_train_res)))
print("Accuarcy on testing",accuracy_score(y_test,model_cb.predict(X_test_res)))

In [None]:
precision_score(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
print(plot_confusion_matrix(model_cb,X_test_res,y_test,values_format='.1f',cmap='Blues'))

In [None]:
cb_probs=model_cb.predict_proba(X_test_res)[:,1]
cb_probs

In [None]:
cb_auc=roc_auc_score(y_test,cb_probs)
print("cb_area",cb_auc)

In [None]:
fpr,tpr,th=roc_curve(y_test,cb_probs)

In [None]:
plt.plot([0,1],[0,1],linestyle='--',color='red')
plt.plot(fpr,tpr,marker='*',label='RF auc {}'.format(cb_auc.round(2)))
plt.legend()

In [None]:
feat_importances = pd.Series(model_cb.feature_importances_, index=X.columns)
feat_importances.nlargest(15).plot(kind='barh')
#feat_importances.nsmallest(20).plot(kind='barh')
plt.show()

In [None]:
# parameters={"learning_rate"    : [0.10, 0.15, 0.20,0.30] ,
#  "max_depth"        : [ 3,5,8,10, 12, 15],
#  "min_child_weight" : [ 1, 3, 5, 7 ],
#  "gamma"            : [ 0.0, 0.1, 0.2 , 0.3],
#  "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
#  }

# parameters={'depth':[4,6,7,8],
#             'learning_rate':[0.14,0.16,0.18,0.2],
#             'iterations':[5000,5500,6000,7000],
#             'l2_leaf_reg': [2,6,10,14]
           
#             }

In [None]:
# from sklearn.model_selection import RandomizedSearchCV
# import time

In [None]:
# rd_obj=RandomizedSearchCV(model_cb,parameters,scoring='accuracy',cv=20)

In [None]:
# rd_obj

In [None]:
# start=time.time()
# rd_obj.fit(X_train_res,y_train)
# end=time.time()
# print("Total time taken is {}".format(end-start))

In [None]:
# rd_obj.best_params_

In [None]:
# best_split=rd_obj.best_estimator_
# best_split

In [None]:
#  model_cv=best_split.fit(X_train_res,y_train)

In [None]:
# from sklearn.model_selection import cross_val_score
# print(cross_val_score(model_cv,X_train_res,y_train,cv=10,scoring='accuracy').mean())

In [None]:
df_test.head()

In [None]:
df_test_copy=df_test.copy()

In [None]:
df_test_copy.drop('id',axis=1,inplace=True)

In [None]:
df_test_copy.columns

In [None]:
df_test_copy=sc.transform(df_test_copy)

In [None]:
df_test_copy

In [None]:
predictions=model_cb.predict_proba(df_test_copy)[:,1]

In [None]:
predictions

In [None]:
final=pd.DataFrame()
final['id']=df_test['id']
final['Response']=predictions

In [None]:
final

In [None]:
final.to_csv('final_cb.csv',index=False)

In [None]:
#if you like my work ,Please upvote it.