In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Problem 3: Churn Analysis. Part I: Exploratory Data Analysis

*Churn* is when customers stop using the services of a company.
Thus, churn prediction is about identifying customers who are likely to cancel their contracts soon.
This problem is based on Chapter 3 of [Machine Learning Bookcamp](https://www.manning.com/books/machine-learning-bookcamp).

<img src="ML bookcamp.png" alt="Drawing" style="width: 150px;"/>

### Introduction

Imagine that you are working at a telecom company that offers phone and internet services.
The company has a problem, some of your customers are churning.
They no longer are using your services and are going to a different provider.
You would like to prevent that from happening.

You have collected a dataset where you have recorded some informations about your customers: what type of services they used, how much they paid, how long they stayed with you, ect.

In [2]:
# load the data
path = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/telco.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


**Content**

Each row represents a customer, each column contains customer's attributes described below.

The data set includes information about:

- Customers who left within the last month – the column is called Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, age range, and if they have partners and dependents

**Attribute descriptions**

|column | description |
| --- | --- |
| CustomerID | The ID of the customer |
| Gender | male/female |
| SeniorCitizer | whether the customer is a senior citizen (0/1) |
| Partner | whether the customer lives with a partner (yes/no) |
| Dependents | whether the customer has dependents (yes/no) |
| Tenure | number of months since the start of the contract |
| PhoneService | whether they have phone service (yes/no) |
| MultipleLines | whether the customer has multiple phone lines (yes/no/no phone service) |
| InternetService | the type of internet service (no/fiber/optic) |
| OnlineSecurity | if online security is enabled (yes/no/no internet) |
| OnlineBackup | if online backup service is enabled (yes/no/no internet)|
| DeviceProtection | if the device protection service is enabled (yes/no/no internet) |
| TechSupport | if the customer has tech support (yes/no/no internet) |
| StreamingTV | if the TV streaming service is enabled (yes/no/no internet) |
| StreamingMovies | if the movie streaming service is enabled (yes/no/no internet) |
| Contract | the type of contract (monthly/yearly/two years) |
| PaperlessBilling | if the billing is paperless (yes/no) |
| PaymentMethod | payment method (electronic check, mailed check, bank transfer, credit card) |
| MonthlyCharges | the amount charged monthly |
| TotalCharges | the total amount charged |
| Churn | if the client has canceled the contract (yes/no) |

### Fixing column data types 

**Part 1:** Display the data type of each column 

In [7]:
data.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

**Part 2:** The data type of ``TotalCharges`` has been inferred wrongly (``object`` instead of ``float64``).
Why?

**Part 3:** You can force ``TotalCharges`` to be numeric by converting it to numbers using Pandas built-in function [``to_numeric``](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html).
If you pass ``errors='coerce'``,  the function``to_numeric`` will replace all nonnumeric values with a ``NaN``.

**Part 4:** How many missing values does ``TotalCharges`` contain?

**Part 5:** Set ``TotalCharges`` missing values to zero

### Fixing column names

**Part 6:** Notice that the column names don't follow the same naming convention.
Some names start with a lower letter, whereas other names start with a capital letter.

Make the names uniform by lowercasing everything.

### Churn rates

Looking at the data before training a Machine Learning model is important. 
The more you know about the data, the better the model you can build afterward.

**Part 7:** The column ``churn`` is categorical, with two values ``yes`` and ``no``. 
Convert the ``churn`` values to numbers (1=yes, 0=no)

**Part 8:** The dataset is imbalanced if one class label of the ``churn`` attribute  has a very high number of observations and the other has a very low number of observations.
Is your dataset a balance or imbalanced dataset?


(it is an imbalanced dataset; the majority of the customers didn't churn)

**Part 9:** The ``churn rate`` is defined as the proportion of churned users.
Compute the ``churn rate`` of the dataset.

You can use the churn rate of each variable (`gender`, `seniorcitizen`, ect) to identify what are the characteristics of people who churn.

**Part 10:** 
The `gender` variable can take two values, female and male. 
Compute the churn rate for female and male costumers.

**Part 11:** Compare the male and female churn ratios with the global churn ratio (from Part 9). 
Would knowing the gender of a customer help you identify whether the customer will churn? 

In [20]:
# The gender churn rates are very similar to the global churn rate. 
# Knowing the gender of the customer doesn't help us identify whether they will churn

**Part 12:** Compute the gender rates for the attributes `seniorcitizen`, `partner`, `dependents`, `phoseservice`, `multiplelines`, `internetservice`, `onlinesecurity`, `onlinebackup`, `deviceprotection`, `techsupport`, `streamingtv`, `streamingmovies`, `contract`, `paperlessbilling`, and `paymentmethod`.

**Part 13:** What attributes you think will be important for detecting churn?