# Data Cleaning and Preparation 

Problem Statement: Analyzing Customer Churn in a Telecommunications Company 
Dataset: "Telecom_Customer_Churn.csv" 
Description: The dataset contains information about customers of a telecommunications 
company and whether they have churned (i.e., discontinued their services). The dataset 
includes various attributes of the customers, such as their demographics, usage patterns, and 
account information. The goal is to perform data cleaning and preparation to gain insights 
into the factors that contribute to customer churn. 
Tasks to Perform: 
1. Import the "Telecom_Customer_Churn.csv" dataset. 
2.  Explore the dataset to understand its structure and content. 
3.  Handle missing values in the dataset, deciding on an appropriate strategy. 
4. Remove any duplicate records from the dataset. 
5.  Check for inconsistent data, such as inconsistent formatting or spelling variations, and standardize it. 
6.  Convert columns to the correct data types as needed. 
7. Identify and handle outliers in the data. 
8. Perform feature engineering, creating new features that may be relevant to predicting customer churn. 
9.  Normalize or scale the data if necessary. 

10. Split the dataset into training and testing sets for further analysis. 
11. Export the cleaned dataset for future analysis or modeling.

#### 1.Import the "Telecom_Customer_Churn.csv" dataset.

In [1]:
import pandas as pd

In [2]:
df= pd.read_csv('Telco-Customer-Churn.csv')

In [3]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


#### 2.Explore the dataset to understand its structure and content.


In [4]:
df.shape

(7043, 21)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### 3.Handle missing values in the dataset, deciding on an appropriate strategy.

In [6]:
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [7]:
df.fillna(df.mean(),inplace=True)

  df.fillna(df.mean(),inplace=True)


In [11]:
# Strategy to handle missing values (e.g., fill with median for numerical, mode for categorical)
for column in df.select_dtypes(include=['float64', 'int64']).columns:
    df[column].fillna(df[column].median(), inplace=True)

for column in df.select_dtypes(include=['object']).columns:
    df[column].fillna(df[column].mode()[0], inplace=True)

#### 4. Remove any duplicate records from the dataset.

In [12]:
df.duplicated().sum()

0

In [13]:
df.drop_duplicates

<bound method DataFrame.drop_duplicates of       customerID  gender  SeniorCitizen Partner Dependents  tenure  \
0     7590-VHVEG  Female              0     Yes         No       1   
1     5575-GNVDE    Male              0      No         No      34   
2     3668-QPYBK    Male              0      No         No       2   
3     7795-CFOCW    Male              0      No         No      45   
4     9237-HQITU  Female              0      No         No       2   
...          ...     ...            ...     ...        ...     ...   
7038  6840-RESVB    Male              0     Yes        Yes      24   
7039  2234-XADUH  Female              0     Yes        Yes      72   
7040  4801-JZAZL  Female              0     Yes        Yes      11   
7041  8361-LTMKD    Male              1     Yes         No       4   
7042  3186-AJIEK    Male              0      No         No      66   

     PhoneService     MultipleLines InternetService OnlineSecurity  ...  \
0              No  No phone service      

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### Step 5: Check for Inconsistent Data

In [15]:
df.gender.unique()

array(['Female', 'Male'], dtype=object)

In [17]:
# Standardize inconsistent entries (e.g., converting to lowercase)
df['gender'] = df['gender'].str.lower().str.strip()

In [18]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [21]:
df['TotalCharges'].dtype()

TypeError: 'numpy.dtypes.ObjectDType' object is not callable

In [22]:
total_charges_dtype = df['TotalCharges'].dtype  # Use .dtype without parentheses
print("Data type of TotalCharges column:", total_charges_dtype)

Data type of TotalCharges column: object


In [24]:
df['TotalCharges']=df['TotalCharges'].astype(float)

ValueError: could not convert string to float: ' '

In [25]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### Step 6: Convert Columns to the Correct Data Types

In [29]:
# Convert relevant columns to appropriate data types
df['Churn'] = df['Churn'].astype(int)  # Example: converting churn to integer if it’s not already



ValueError: invalid literal for int() with base 10: 'No'

In [30]:
df['Churn'].unique()

array(['No', 'Yes'], dtype=object)

In [32]:
# Replace non-numeric values with appropriate numeric equivalents
df['Churn'] = df['Churn'].replace({'No': 0, 'Yes': 1})

# Alternatively, if you want to drop the rows with invalid entries
# df = df[df['SomeColumn'].isin(['Yes', 'No'])]  # Keep only valid entries

# Convert the column to integer
df['Churn'] = df['Churn'].astype(int)


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### 7 Identify and Handle Outliers

In [37]:
Q1= df['MonthlyCharges'].quantile(0.25)
Q3= df['MonthlyCharges'].quantile(0.75)
IQR= Q3-Q1

#define bounds
lower= Q1-1.5*IQR
upper=Q3+1.5*IQR

#remove outliuers
df= df[(df['MonthlyCharges']>=lower)&df['MonthlyCharges']<=upper]

#### 8: Perform Feature Engineering

In [40]:
# Create new features that may help in predicting churn
df['TotalCharges'] = df['MonthlyCharges'] * df['tenure']  # Example: calculating total charges

# Example: Create a new feature indicating if the customer is a senior citizen
#df['is_senior'] = (df['age'] >= 60).astype(int)

#### Step 9: Normalize or Scale the Data

In [44]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
num_cols=['MonthlyCharges','TotalCharges','tenure']
df['num_cols']=scaler.fit_transform(df['num_cols'])

KeyError: 'num_cols'

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [48]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_cols = ['MonthlyCharges', 'TotalCharges', 'tenure']
df[num_cols] = scaler.fit_transform(df[num_cols])

In [49]:
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,female,0,Yes,No,-1.277445,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,-1.160323,-0.993448,0
1,5575-GNVDE,male,0,No,No,0.066327,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,-0.259629,-0.151588,0
2,3668-QPYBK,male,0,No,No,-1.236724,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,-0.362660,-0.959071,1
3,7795-CFOCW,male,0,No,No,0.514251,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),-0.746535,-0.166072,0
4,9237-HQITU,female,0,No,No,-1.236724,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,0.197365,-0.944189,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,male,0,Yes,Yes,-0.340876,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,0.665992,-0.107915,0
7039,2234-XADUH,female,0,Yes,Yes,1.613701,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),1.277533,2.274525,0
7040,4801-JZAZL,female,0,Yes,Yes,-0.870241,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,-1.168632,-0.862849,0
7041,8361-LTMKD,male,1,Yes,No,-1.155283,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,0.320338,-0.875214,1


In [50]:
df['TotalCharges']

0      -0.993448
1      -0.151588
2      -0.959071
3      -0.166072
4      -0.944189
          ...   
7038   -0.107915
7039    2.274525
7040   -0.862849
7041   -0.875214
7042    2.072500
Name: TotalCharges, Length: 7043, dtype: float64

#### 10: Split the Dataset#### 

In [52]:
from sklearn.model_selection import train_test_split
X= df.drop(columns=['Churn'])
y=df['Churn']

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)


In [53]:
X_train.shape

(5634, 20)

In [54]:
X_test.shape

(1409, 20)

In [56]:
y_train.shape

(5634,)

#### 11. Export the Cleaned Dataset

In [57]:
# Export the cleaned dataset for future analysis or modeling
# cleaned_file_path = 'Cleaned_Telecom_Customer_Churn.csv'
# data.to_csv(cleaned_file_path, index=False)