# Customer Churn Prediction :

In this project we'll see how to perform a data preprocessing and data prediction in intermediate level.
So, in this dataset we have train on the dataset which has multiple numerical and categorical feaatures and predict over the data.


![](https://miro.medium.com/max/750/1*8_Md5Ns2OKeW9F8XRRCMKg.jpeg)


## UPVOTE if you like this notebook :)
You can see my other works in [sagnik1511](https://kaggle.com/sagnik1511/notebooks) or in [github](github.com/sagnik1511)

## Libraries :

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Gathering & Primary Visualization:

At first we have to read the data .
Then we'll overview on the data and prepare a process how to edit the data.

In [None]:
train_df=pd.read_csv('../input/churn-risk-rate-hackerearth-ml/train.csv')
train_df.head()

In [None]:
test_df=pd.read_csv('../input/churn-risk-rate-hackerearth-ml/test.csv')
test_df.head()

In [None]:
train_df.info()

In [None]:
test_df.info()

Now we are going to check the number of output/target values has.

In [None]:
train_df['churn_risk_score'].unique()

It is visible that there are 6 target values and we have to classify them.

It is the final **objective**.

# Data Preprocessing :

After visualizing the data we have planned some moves to process the data. 

In this process we have seen that both the train and test data leakage.

So , basically we are going to take the following moves to prepare trainable and predictable data.

#### Filling leakages :

As the data have lekagaes we have to fill those.
and in this process we have taken two techniques. --

|Data Types|Will be filled with|
|---|---|
|Categorical data|a string named 'None'|
|Numerical Data|Mean of the present values|


In [None]:
for col in test_df.columns:
    if train_df[col].isnull().sum()!=0 or test_df[col].isnull().sum()!=0:
        if train_df[col].dtype=='int64':
            value=int(np.mean(train_df[col]))
            train_df[col].fillna(value,inplace=True)
            test_df[col].fillna(value,inplace=True)
        elif train_df[col].dtype=='float64':
            value=np.mean(train_df[col])
            train_df[col].fillna(value,inplace=True)
            test_df[col].fillna(value,inplace=True)
        else:
            train_df[col].fillna('None',inplace=True)
            test_df[col].fillna('None',inplace=True)

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

#### Dropping Unnecessary Fearures:

we have seen that there are some name and id features, when we are going to predict over the data , we can definitely tell these features only helps to label the dta , but it won't be helpful when it is time for prediction so we are going to drop those features.

In [None]:
train_df.shape

In [None]:
cl_train_df=train_df.drop(labels=['customer_id','Name','security_no','referral_id','feedback'],axis=1)
cl_test_df=test_df.drop(labels=['customer_id','Name','security_no','referral_id','feedback'],axis=1)

In [None]:
cl_train_df.shape

# Change the time series datas :

In this part we are going to numerate and create seperate columns for each value.

In [None]:
type(cl_train_df['joining_date'][0])

As the type of the data isn't datettime series , we have to do it manually.

In [None]:
def add_dates(data):
    df=data
    day=[]
    month=[]
    year=[]
    for i in range(len(data)):
        year.append(int(data['joining_date'][i][:4]))
        month.append(int(data['joining_date'][i][5:7]))
        day.append(int(data['joining_date'][i][8:10]))
    df['day']=day
    df['month']=month
    df['year']=year
    return df

In [None]:
train_1=add_dates(cl_train_df)
test_1=add_dates(cl_test_df)

In [None]:
train_1.drop('joining_date',1,inplace=True)
test_1.drop('joining_date',1,inplace=True)

In [None]:
train_1.head()
    

In [None]:
def add_time(data):
    df=data
    hour=[]
    mint=[]
    second=[]
    for i in range(len(data)):
        hour.append(int(data['last_visit_time'][i][:2]))
        mint.append(int(data['last_visit_time'][i][3:5]))
        second.append(int(data['last_visit_time'][i][6:8]))
        
    data['minute']=mint
    data['hour']=hour
    data['sec']=second
    data.drop('last_visit_time',1,inplace=True)
    return df

In [None]:
train=add_time(train_1)
test=add_time(test_1)

In [None]:
train.head()

#### Feature Encoding :

Now we have check if any categorical feature has more than 20 unique values , then we will omit that cause too much variety in dtaa will simply make the dataset more complex to predict correctly.

In [None]:
for col in test_1.columns:
    if train_1[col].dtype=='object':
        print(col)

In [None]:
for col in test.columns:
    if train[col].dtype=='object':
        if train[col].nunique() >20:
            train.drop(col,1,inplace=True)
            test.drop(col,1,inplace=True)
        else:
            k=0
            for val in train[col].value_counts().index:
                train[col].replace(val,k,inplace=True)
                test[col].replace(val,k,inplace=True)
                k+=1
            
            
        
    

In [None]:
train.head()

In [None]:
test.head()

## Train-Test splitting :

In [None]:
X_train=train.drop('churn_risk_score',1)
y_train=train['churn_risk_score']
X_test=test

In [None]:
X_train.shape,X_test.shape,y_train.shape

#### P.S. we can change the feedback column too.Which I have missed previously.Following kernels will be using NLP to encode those features.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

In [None]:
paragraph=[]
for line in train_df['feedback']:
    paragraph.append(line)

In [None]:
paragraph

In [None]:
wordnet=WordNetLemmatizer()

In [None]:
corpus=[]

for i in range(len(paragraph)):
    review=re.sub('[^a-zA-Z]',' ',paragraph[i])
    review=review.lower()
    review=review.split()
    review=[wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review=' '.join(review)
    corpus.append(review)
    
corpus

In [None]:
xx=pd.DataFrame(corpus)

In [None]:
xx.columns=['name']

In [None]:
xx.head()

In [None]:
xx['name'].nunique()

In [None]:
feedback=xx['name'].unique()
feedback

In [None]:
for i in range(9):
    xx.replace(feedback[i],i,inplace=True)
xx.head()

In [None]:
df1=pd.DataFrame({'1':xx['name'],'2':train_df['feedback']})
df1.head(15)

In [None]:
X_train['feedback']=xx['name']

In [None]:
paragraph=[]
for line in test_df['feedback']:
    paragraph.append(line)

In [None]:
corpus=[]

for i in range(len(paragraph)):
    review=re.sub('[^a-zA-Z]',' ',paragraph[i])
    review=review.lower()
    review=review.split()
    review=[wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review=' '.join(review)
    corpus.append(review)
    
corpus

In [None]:
xx=pd.DataFrame({'name':corpus})

In [None]:
for i in range(9):
    xx.replace(feedback[i],i,inplace=True)
xx.head()

In [None]:
df1=pd.DataFrame({'1':xx['name'],'2':test_df['feedback']})
df1.head(15)

In [None]:
X_test['feedback']=xx['name']

In [None]:
X_train.head()

In [None]:
X_test.head()

Basically what we did here is lemmatize every sentences and then encoded them.

# Data Prediction :

Now we are goinf to predict the dataset using two type of classifiers.


1. RandomForestClassifier    ( from sklearn.ensemble )
2. CatBoostClassifier        ( from catboost )

In [None]:
train_1=pd.concat([X_train,train_df['churn_risk_score']],axis=1)
train_1.head()

In [None]:
def create_submission(test,model,file_name):
    y_pred=model.predict(X_test)
    y_pred=y_pred.reshape(y_pred.shape[0])
    y_pred=y_pred.tolist()
    subs=pd.DataFrame({'customer_id':test_df['customer_id'],'churn_risk_score':y_pred})
    subs.to_csv('file_name',index=False)
    
    
    
def train_model(epochs,df,model):
    for i in range(epochs):
        print('Epoch ',i+1,' initiated................................')
        func=model
        df=train_1.sample(frac=0.8)
        x_train=df.drop('churn_risk_score',1)
        y_train=df['churn_risk_score']
        func.fit(x_train,y_train)
        print('Accuracy over this random data : ',func.score(x_train,y_train) )
    return model

In [None]:
X_test.head()

In [None]:
from catboost import CatBoostClassifier as cbtc
from sklearn.ensemble import RandomForestClassifier as rfc

Now we have used catboost over the train and test data.Lets see what accuracy we cam achieve.

After checking for hours we have taken selected features in our dataset , so that it may predict the best solution.

In [None]:
df1=X_train.drop(labels=['joined_through_referral','year','month' ,  'minute', 'sec',],axis=1)
df2=X_test.drop(labels=['joined_through_referral','year', 'month', 'minute', 'sec',],axis=1)

In [None]:
model=cbtc(verbose=0)
model.fit(df1,y_train)
print(model.score(df1,y_train))


In [None]:

model=cbtc()
model.fit(df1,y_train)
print(model.score(df1,y_train))
y_pred=model.predict(df2)
y_pred=y_pred.reshape(y_pred.shape[0])
y_pred=y_pred.tolist()
subs=pd.DataFrame({'customer_id':test_df['customer_id'][:10],'churn_risk_score':y_pred[:10]})
subs.to_csv('submission_catboost.csv',index=False)

Now we are going to do this on RandomForestClassifier .Lets see how much accuracy we can achieve.

In [None]:
model=rfc(random_state=0,n_jobs=2,n_estimators=700,verbose=2)
model.fit(df1,y_train)
print(model.score(df1,y_train))
y_pred=model.predict(df2)
subs=pd.DataFrame({'customer_id':test_df['customer_id'][:10],'churn_risk_score':y_pred[:10]})
subs.to_csv('submission_rfc.csv',index=False)

HURRAH !!!!!

We've completed the whole project.

If you like this do not forget to upvote .

and if you have any query or feedback , do comment.

# Thank You for visiting :)