# Analyze Telco Customer Churn
In preparation for a presentation to Telco executives about customer churn, the CFO is asking for an analysis and predictions for the factors that most impact churn at the company.

This notebook will prepare the data for analysis.

## Data Sources
- WA_Fn-UseC_-Telco-Customer-Churn: [Kaggle Churn data set](https://www.kaggle.com/blastchar/telco-customer-churn/version/1)

## Changes
- 10-12-2021: Started project

## Import Libraries

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime

## File Locations

In [2]:
today = datetime.today()
churn_file = Path.cwd() / "data" / "raw" / "WA_Fn-UseC_-Telco-Customer-Churn.csv"
summary_file = Path.cwd() / "data" / "processed" / f"summary_{today:%b-%d-%Y}.pkl"

In [3]:
df = pd.read_csv(churn_file)

## Data Definition

### Identify all column names and data types
There are 21 columns in the dataset:
1. **customerID**: A unique identifer for each customer with a mix of 4 numbers and five capitalized letters. **Data type: object**.
2. **gender**: A binary gender classification for each customer (Male, Female). **Data type: object**.
3. **SeniorCitizen**: A binary classification for each customer as a senior citizen (1, 0). **Data type: integer**.
4. **Partner**: A binary classification for whether the customer has a partner (Yes, No). **Data type: object**.
5. **Dependents**: A binary classification for whether the customer has dependents (Yes, No). **Data type: object**.
6. **tenure**: The number of months the customer has stayed with the company. **Data type: integer**.
7. 'PhoneService'
8. 'MultipleLines'
9. 'InternetService'
10. 'OnlineSecurity'
11. 'OnlineBackup'
12. 'DeviceProtection'
13. 'TechSupport'
14. 'StreamingTV'
15. 'StreamingMovies'
16. 'Contract'
17. 'PaperlessBilling'
18. 'PaymentMethod'
19. 'MonthlyCharges'
20. 'TotalCharges'
21. 'Churn'

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [11]:
df['Dependents'].unique()

array(['No', 'Yes'], dtype=object)

## Column Cleanup
- Remove all leading and trailing spaces

In [4]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
