## **Machine Learning Project for Customer Churn Dataset**



*   cust_id: Unique customer ID
*   date: The date of the month in which transactions are aggregated (in YYYY-MM-DD format). Each row represents a summary of transactions for that month.
*   mobile_eft_all_cnt: Number of mobile electronic funds transfer (EFT) transactions made by the customer during the month.
*   active_product_category_nbr: Number of active product categories the customer has during the month.
*   mobile_eft_all_amt: Total amount (€) of the customer's mobile EFT transactions during the month.
*   cc_transaction_all_amt: Total amount (€) of the customer's credit card transactions during the month.
*   cc_transaction_all_cnt: Number of the customer's credit card transactions during the month.

*  gender: Customer’s gender
*  age: Customer’s age in years
*  province: Region code of the customer’s residence
*  religion: Customer’s religion
*  work_type: Customer’s employment status
*  work_sector: Sector in which the customer works
*  tenure: Customer’s tenure in months
* ref_date: Reference date when the churn label is assigned
* churn: Customer’s churn status within 6 months after the reference date (1 = churned, 0 = not churned)













### **Data Load Steps**


*   Csv files are retrived from Google Drive




In [2]:
# Importing libraries
import builtins
import pandas as pd
import numpy as np

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
customer_history = pd.read_csv('/content/drive/MyDrive/Customer Churn/customer_history.csv')
customers = pd.read_csv('/content/drive/MyDrive/Customer Churn/customers.csv')
referance_data_test = pd.read_csv('/content/drive/MyDrive/Customer Churn/referance_data_test.csv')
referance_data = pd.read_csv('/content/drive/MyDrive/Customer Churn/referance_data.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/Customer Churn/sample_submission.csv')

In [5]:
print(customer_history.head()) ;
print(customers.head()) ;
print(referance_data_test.head()) ;
print(referance_data.head()) ;
print(sample_submission.head())

   cust_id        date  mobile_eft_all_cnt  active_product_category_nbr  \
0        0  2016-01-01                 1.0                            2   
1        0  2016-02-01                 1.0                            2   
2        0  2016-03-01                 2.0                            2   
3        0  2016-04-01                 4.0                            2   
4        0  2016-05-01                 3.0                            3   

   mobile_eft_all_amt  cc_transaction_all_amt  cc_transaction_all_cnt  
0              151.20                     NaN                     NaN  
1              178.70                     NaN                     NaN  
2               37.38                     NaN                     NaN  
3              100.90                     NaN                     NaN  
4              132.28                     NaN                     NaN  
   cust_id gender  age province religion      work_type work_sector  tenure
0        0      F   64      NOH        U 

In [7]:
# all dataframes are copied for future usage.
customer_history_copy = customer_history.copy()
customers_copy = customers.copy()
referance_data_test_copy = referance_data_test.copy()
referance_data_copy = referance_data.copy()
sample_submission_copy = sample_submission.copy()

### **Data Cleaning and Preprocessing**

In [8]:
df_list = [customer_history, customers, referance_data_test, referance_data,sample_submission]

In [9]:
def desc_df(dfs):
    for name, obj in globals().items():
        if isinstance(obj, pd.DataFrame) and any(obj is df for df in dfs):
          print(obj.info())

desc_df(df_list)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5359609 entries, 0 to 5359608
Data columns (total 7 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   cust_id                      int64  
 1   date                         object 
 2   mobile_eft_all_cnt           float64
 3   active_product_category_nbr  int64  
 4   mobile_eft_all_amt           float64
 5   cc_transaction_all_amt       float64
 6   cc_transaction_all_cnt       float64
dtypes: float64(4), int64(2), object(1)
memory usage: 286.2+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176293 entries, 0 to 176292
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   cust_id      176293 non-null  int64 
 1   gender       176293 non-null  object
 2   age          176293 non-null  int64 
 3   province     176293 non-null  object
 4   religion     176293 non-null  object
 5   work_type    176293 non-null  ob

In [10]:
def check_null(dfs):
    null_list = []
    for name, obj in globals().items():
        if isinstance(obj, pd.DataFrame) and any(obj is df for df in dfs):
            len = builtins.len(obj)
            nulls = obj.isnull().sum()
            if nulls.any():
                null_ratio = round((nulls[nulls > 0] / len),2)
                summary = pd.DataFrame({
                    'null_count': nulls[nulls > 0],
                    '%null': null_ratio
                })
                null_list.append((name, summary))
    return null_list

check_null(df_list)

[('customer_history',
                          null_count  %null
  mobile_eft_all_cnt          112334   0.02
  mobile_eft_all_amt          112334   0.02
  cc_transaction_all_amt      166746   0.03
  cc_transaction_all_cnt      166746   0.03),
 ('customers',
               null_count  %null
  work_sector       30134   0.17)]

In [11]:
def check_duplicates(dfs):
    duplicates_list = []
    for name, obj in globals().items():
        if isinstance(obj, pd.DataFrame) and any(obj is df for df in dfs):
            dups = obj.duplicated().sum()
            if dups.any():
                duplicates_list.append((name, dups[dups > 0]))
    return duplicates_list

check_duplicates(df_list)

[]

In [21]:
duplicated_test= customers.cust_id.value_counts()
duplicated_test[duplicated_test > 1]

Unnamed: 0_level_0,count
cust_id,Unnamed: 1_level_1


In [28]:
duplicated_test2= customer_history.cust_id.value_counts()
duplicated_test2[duplicated_test2>1].size

176293

In [16]:
customers.head()

Unnamed: 0,cust_id,gender,age,province,religion,work_type,work_sector,tenure
0,0,F,64,NOH,U,Part-time,Technology,135
1,1,F,57,ZUI,O,Full-time,Finance,65
2,2,F,62,NOB,M,Self-employed,Healthcare,224
3,3,F,22,ZUI,C,Student,,47
4,5,M,27,ZUI,U,Full-time,Finance,108


In [17]:
customer_history.head()

Unnamed: 0,cust_id,date,mobile_eft_all_cnt,active_product_category_nbr,mobile_eft_all_amt,cc_transaction_all_amt,cc_transaction_all_cnt
0,0,2016-01-01,1.0,2,151.2,,
1,0,2016-02-01,1.0,2,178.7,,
2,0,2016-03-01,2.0,2,37.38,,
3,0,2016-04-01,4.0,2,100.9,,
4,0,2016-05-01,3.0,3,132.28,,


In [31]:
referance_data_test.head()


Unnamed: 0,cust_id,ref_date
0,1,2019-02-01
1,2,2019-01-01
2,9,2019-03-01
3,15,2019-06-01
4,19,2019-01-01


In [38]:
referance_data.sort_values("cust_id")


Unnamed: 0,cust_id,ref_date,churn
0,0,2017-09-01,0
1,3,2018-10-01,0
2,5,2018-03-01,1
3,6,2018-04-01,1
4,7,2018-05-01,0
...,...,...,...
133282,199995,2018-09-01,0
133283,199996,2018-06-01,0
133284,199997,2018-12-01,0
133285,199998,2018-02-01,1


In [39]:
duplicated_test3= referance_data.cust_id.value_counts()
duplicated_test3[duplicated_test3>1]

Unnamed: 0_level_0,count
cust_id,Unnamed: 1_level_1


In [42]:
referance_data.cust_id.duplicated().sum()

np.int64(0)

In [43]:
referance_data_test.head()

Unnamed: 0,cust_id,ref_date
0,1,2019-02-01
1,2,2019-01-01
2,9,2019-03-01
3,15,2019-06-01
4,19,2019-01-01


In [44]:
print(referance_data_test.ref_date.min(),referance_data_test.ref_date.max())

2019-01-01 2019-06-01


In [45]:
print(referance_data.ref_date.min(),referance_data.ref_date.max())

2017-07-01 2018-12-01
