# __THE TELCO CHURN CHALLENGE FOR REXEL__

## 1. Introduction

The objective of this challenge is to prevent customer to stop using TELCO Inc phoning services.

There are many reasons why customers may churn. It's crucial to detect those customers before they leave.

One of the most effective way to achive that goal is to use the data.

Based on historical data, we are going to detect customers who may leave and suggest actions that can avoid the leaving.

## 2. The data

We have 2 datasets to achieve the challenge. _The training dataset_, will be use to train, test and evaluate machine learning models. The validation dataset is for the final submission of the challenge.

In [3]:
import numpy as np
import pandas as pd

data = pd.read_csv('data/training.csv', na_values=[" "])

In [4]:
print("The dataset shape: ", data.shape)

The dataset shape:  (11981, 19)


The data we have for this challenge has __11981__ rows and __19__ columns (or variables). Let's visualize the first 6 rows of the data:

In [5]:
data.head().T

Unnamed: 0,0,1,2,3,4
CUSTOMER_ID,C100000,C100001,C100006,C100008,C100010
COLLEGE,zero,one,zero,zero,one
DATA,660,317.647,208.696,265.018,440
INCOME,19995,31477,66742,40864,43321.5
OVERCHARGE,0,155,0,183,200
LEFTOVER,0,15,13,0,0
HOUSE,897338,393396,937197,986430,394622
LESSTHAN600k,False,True,False,False,True
CHILD,4,0,4,3,2
JOB_CLASS,3,1,2,3,3


In [6]:
data.dtypes

CUSTOMER_ID                     object
COLLEGE                         object
DATA                           float64
INCOME                         float64
OVERCHARGE                       int64
LEFTOVER                         int64
HOUSE                          float64
LESSTHAN600k                    object
CHILD                            int64
JOB_CLASS                        int64
REVENUE                        float64
HANDSET_PRICE                    int64
OVER_15MINS_CALLS_PER_MONTH      int64
TIME_CLIENT                    float64
AVERAGE_CALL_DURATION            int64
REPORTED_SATISFACTION           object
REPORTED_USAGE_LEVEL            object
CONSIDERING_CHANGE_OF_PLAN      object
CHURNED                         object
dtype: object

### 2.1 The data description

* CUSTOMER_ID: A unique customer identifier (categorical)
* COLLEGE: (one or zero), is the customer college educated ? (categorical)
* DATA: Monthly data consumption in Mo (numerical)
* INCOME: Annual salary of the client (numerical)
* OVERCHARGE:Average overcharge per year (numerical)
* LEFTOVER: Average number of leftover minutes per month (numerical)
* HOUSE: Estimated value of the house (numerical)
* LESSTHAN600k: Is the value of the house smaller than 600K ? (catagorical)
* CHILD: The number of children (numerical)
* JOB_CLASS: Self reported type of job (categorical)
* REVENUE: Annual phone bill (numerical)
* HANDSET_PRICE: The price of the handset (phone) (numerical)
* OVER_15MINS_CALLS_PER_MONTH: Average number of long calls (more than 15 minutes) (numerical)
* TIME_CLIENT: The tenure in year (numerical)
* AVERAGE_CALL_DURATION: The average duration of a call (numerical)
* REPORTED_SATISFACTION: The reported level of satisfaction (categorical)
* REPORTED_USAGE_LEVEL: The self reported usage level (categorical)
* CONSIDERING_CHANGE_OF_PLAN: Self reported consideration whether to change operator (categorical)
* CHURNED: Did the customer stay or leave. This is the class (categorical)

### 2.2 The data quality

Before starting any analysis, it's important to guarantee the quality of the data. Especially, we are going to check if there are missing values:

In [7]:
data.isna().sum()

CUSTOMER_ID                      0
COLLEGE                          0
DATA                             0
INCOME                           0
OVERCHARGE                       0
LEFTOVER                         0
HOUSE                          635
LESSTHAN600k                   635
CHILD                            0
JOB_CLASS                        0
REVENUE                          0
HANDSET_PRICE                    0
OVER_15MINS_CALLS_PER_MONTH      0
TIME_CLIENT                      0
AVERAGE_CALL_DURATION            0
REPORTED_SATISFACTION            0
REPORTED_USAGE_LEVEL             0
CONSIDERING_CHANGE_OF_PLAN       0
CHURNED                          0
dtype: int64

There are 2 variables with missing values: HOUSE (the house value) and the LESSTHAN600K (is the house value smaller or higher tha 600K ?). 
 

__What is the type of the missing values ?__

In [16]:
# We retain only rows with missing values in
dataNa = data[data['HOUSE'].isna()]
lessthan600k = dataNa['LESSTHAN600k']

# The percentage of missing values in the column LESSTHAN600K
100*lessthan600k.isna().sum()/lessthan600k.shape[0]

100.0

The variable __LESSTHAN600K__ has missing values because the house were not estimated. __The values are missing not at random (MNAR)__.

There is no suitable method to impute these missing values when they are missing not at random. 

I choose to convert missing values as a level for the variable __LESSTHAN600K__. For the variable __HOUSE__, I will replace missing values by __0__.

In [17]:
data = data.dropna()

### 2.3 The data formatting

Last step before using the data for analysis, is to convert the data into theThere are 2 main data types: categorical and numerical. Let's convert each variable in the appropriate data type:

In [None]:
data['CUSTOMER_ID'] = pd.Categorical(data['CUSTOMER_ID'])
data['COLLEGE'] = pd.Categorical(data['COLLEGE'])
data['LESSTHAN600k'] = pd.Categorical(data['LESSTHAN600k'])
data['JOB_CLASS'] = pd.Categorical(data['JOB_CLASS'])
data['REPORTED_SATISFACTION'] = pd.Categorical(data['REPORTED_SATISFACTION'])
data['REPORTED_USAGE_LEVEL'] = pd.Categorical(data['REPORTED_USAGE_LEVEL'])
data['CONSIDERING_CHANGE_OF_PLAN'] = pd.Categorical(data['CONSIDERING_CHANGE_OF_PLAN'])
data['CHURNED'] = pd.Categorical(data['CHURNED'])