# 01. EDA & Data Preprocessing Notebook

## Table of content

- [Introduction](#01.1-introduction)
- [Data Collection](#01.2-data-collection)


### 01.1 Introduction and Setup

This notebook addresses the challenge of customer churn prediction, a critical issue for businesses aiming to retain their customers and reduce revenue loss. By analyzing historical customer data, we seek to identify patterns and factors that contribute to customer attrition. The insights gained from this exploratory data analysis (EDA) and preprocessing will lay the foundation for building effective predictive models, enabling proactive strategies to improve customer retention.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler



import os
import sys



In [2]:
# get the current working Directory
current_dir = os.getcwd()
print(current_dir)

# getting project root directry
project_dir = os.path.dirname(current_dir)
print(project_dir)

# joining path to form src folder's absolute path
src_dir_path = os.path.join(project_dir, 'src')
print(src_dir_path)

# appending src package to python PATH
sys.path.append(src_dir_path)

c:\Users\abdoi\DataspellProjects\customer-churn-prediction-docker\notebooks
c:\Users\abdoi\DataspellProjects\customer-churn-prediction-docker
c:\Users\abdoi\DataspellProjects\customer-churn-prediction-docker\src


### 01.2 Data Collection

In [16]:
from utils import download_kaggle_dataset

download_kaggle_dataset("blastchar/telco-customer-churn", "../data/raw/")

Dataset URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Dataset blastchar/telco-customer-churn downloaded and extracted to ../data/raw/.


### 01.3 Data overview & inspection

In [5]:
df = pd.read_csv("../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")


df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [7]:
f"DataFrame shape: {df.shape}"

'DataFrame shape: (7043, 21)'

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
