### **Lifecycle of Machine Learning Project** 

- Understanding the problem statement
- Data Collection
- Exploratory Data Analysis
- Data Cleaning
- Data Preprocessing and Feature Engineering
- Model Training
- Choose the best model

### **About the dataset**

This dataset contains customer information from a telecom company and is used to predict customer churn (whether a customer will leave the service or not).

## **1) Problem Statement**

- The Telecom industry faces revenue loss when the existing customers discontinue their services. 
- The goal is to analyze the customer data and build a predictive model that can identify the customers who are likely to churn in the future.

**In this project we are going to use the data to build a classification model**

- This model is to predict whether the customer will churn or not based on the given dataset
- By accurately predicting churn, the company can take the customer retention actions, optimize their marketing campaign and improve customer satisfaction.

## **2) Data Collection**
- The dataset is part of telecom company.
- The dataset contains 7043 rows and 21 columns.

Dataset Link - [https://www.kaggle.com/datasets/blastchar/telco-customer-churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn)

**Dataset Description**

- **customerID** - Unique Identifier for each customer
- **gender** - Customer Gender
- **SeniorCitizen** - Indicates if the customer is Senior citizen or not
- **Partner** - Whether the customer has partner/spouse
- **Dependents** - Whether the customer has dependents(children/others)
- **tenure** - Number of months the customer has stayed with the company 
- **PhoneService** - Whether the customer has phone line or not (voice and non-voice services)
- **MultipleLines** - Whether the customer has multiple phone lines
- **InternetService** - Type of Internet Service 
- **OnlineSecurity** - Whether the customer has online security add-on(antivirus, malware protection,etc.)
- **OnlineBackup** - Whether the customer has online backup service like cloud
- **DeviceProtection** - Whether the customer has device protection (insurance support for device)
- **TechSupport** - premium technical support service
- **StreamingTV** - Whether the Customer uses TV Streaming services
- **StreamingMovies** - Whether the Customer uses Movie streaming services or not
- **Contract** - Type of contract customer signed (monthly/yearly/two years)
- **PaperlessBilling** - Customer recieves bills on email instead of paper mail
- **PaymentMethod** - How customer pays their bills
- **MonthlyCharges** - Amount charged to the customer each month
- **TotalCharges** - Total amount the customer paid entire tenure
- **Churn** (target variable) - Whether the customer left the company or not

**Import Data and Required Packages**

- Importing Pandas, Numpy, Matplotlib, Seaborn and Warnings Library

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings as wn 

wn.filterwarnings("ignore")
pd.set_option("display.max_columns",None)

**Import the CSV as Pandas DataFrame**

In [6]:
data = pd.read_csv("data/churn-data.csv")

**Showing Top 5 Records**

In [7]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


**Shape of the Dataset**

In [8]:
data.shape

(7043, 21)

**Summary of the dataset**

In [9]:
data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


**Check datatypes in dataset**

In [10]:
data.info()

<class 'pandas.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   str    
 1   gender            7043 non-null   str    
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   str    
 4   Dependents        7043 non-null   str    
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   str    
 7   MultipleLines     7043 non-null   str    
 8   InternetService   7043 non-null   str    
 9   OnlineSecurity    7043 non-null   str    
 10  OnlineBackup      7043 non-null   str    
 11  DeviceProtection  7043 non-null   str    
 12  TechSupport       7043 non-null   str    
 13  StreamingTV       7043 non-null   str    
 14  StreamingMovies   7043 non-null   str    
 15  Contract          7043 non-null   str    
 16  PaperlessBilling  7043 non-null   str    
 17  Paymen

- **TotalCharges** column type is str and it contains some missing values
- Convert this column datatype to float

In [None]:
# converting the TotalCharges to numeric type
data["TotalCharges"] = pd.to_numeric(data["TotalCharges"], errors= "coerce")