# Data Cleaning and Preparation 
Problem Statement: Analyzing Customer Churn in a Telecommunications Company 
Dataset: "Telecom_Customer_Churn.csv" 
Description: The dataset contains information about customers of a telecommunications 
company and whether they have churned (i.e., discontinued their services). The dataset 
includes various attributes of the customers, such as their demographics, usage patterns, and 
account information. The goal is to perform data cleaning and preparation to gain insights 
into the factors that contribute to customer churn. 
Tasks to Perform: 
1. Import the "Telecom_Customer_Churn.csv" dataset. 
2.  Explore the dataset to understand its structure and content. 
3.  Handle missing values in the dataset, deciding on an appropriate strategy. 
4. Remove any duplicate records from the dataset. 
5.  Check for inconsistent data, such as inconsistent formatting or spelling variations, 
and standardize it. 
6.  Convert columns to the correct data types as needed. 
7. Identify and handle outliers in the data. 
8. Perform feature engineering, creating new features that may be relevant to 
predicting customer churn. 
9.  Normalize or scale the data if necessary. 
10. Split the dataset into training and testing sets for further analysis. 
11. Export the cleaned dataset for future analysis or modeling. 

- Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy import stats

- Task 1: Import the dataset

In [2]:
df = pd.read_csv('Telecom_Customer_Churn.csv')

- Task 2: Explore the dataset

In [3]:
print("First five rows of the dataset:")
print(df.head())
print("\nDataset information:")
print(df.info())
print("\nSummary statistics:")
print(df.describe())

First five rows of the dataset:
  Customer ID  Gender  Age Married  Number of Dependents          City  \
0  0002-ORFBO  Female   37     Yes                     0  Frazier Park   
1  0003-MKNFE    Male   46      No                     0      Glendale   
2  0004-TLHLJ    Male   50      No                     0    Costa Mesa   
3  0011-IGKFF    Male   78     Yes                     0      Martinez   
4  0013-EXCHZ  Female   75     Yes                     0     Camarillo   

   Zip Code   Latitude   Longitude  Number of Referrals  ...   Payment Method  \
0     93225  34.827662 -118.999073                    2  ...      Credit Card   
1     91206  34.162515 -118.203869                    0  ...      Credit Card   
2     92627  33.645672 -117.922613                    0  ...  Bank Withdrawal   
3     94553  38.014457 -122.115432                    1  ...  Bank Withdrawal   
4     93010  34.227846 -119.079903                    3  ...      Credit Card   

  Monthly Charge Total Charges  Tota

In [4]:
# Check column names
print("\nColumn names in the dataset:")
df.columns = df.columns.str.strip()  # Normalize column names by stripping whitespace
print(df.columns.tolist())


Column names in the dataset:
['Customer ID', 'Gender', 'Age', 'Married', 'Number of Dependents', 'City', 'Zip Code', 'Latitude', 'Longitude', 'Number of Referrals', 'Tenure in Months', 'Offer', 'Phone Service', 'Avg Monthly Long Distance Charges', 'Multiple Lines', 'Internet Service', 'Internet Type', 'Avg Monthly GB Download', 'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Streaming TV', 'Streaming Movies', 'Streaming Music', 'Unlimited Data', 'Contract', 'Paperless Billing', 'Payment Method', 'Monthly Charge', 'Total Charges', 'Total Refunds', 'Total Extra Data Charges', 'Total Long Distance Charges', 'Total Revenue', 'Customer Status', 'Churn Category', 'Churn Reason']


In [5]:
# Check for missing values
missing_values = df.isnull().sum()
print("\nMissing values in each column:")
print(missing_values)


Missing values in each column:
Customer ID                             0
Gender                                  0
Age                                     0
Married                                 0
Number of Dependents                    0
City                                    0
Zip Code                                0
Latitude                                0
Longitude                               0
Number of Referrals                     0
Tenure in Months                        0
Offer                                3877
Phone Service                           0
Avg Monthly Long Distance Charges     682
Multiple Lines                        682
Internet Service                        0
Internet Type                        1526
Avg Monthly GB Download              1526
Online Security                      1526
Online Backup                        1526
Device Protection Plan               1526
Premium Tech Support                 1526
Streaming TV                         1526
St

- Task 3: Handle missing values

In [7]:
# Assuming we drop rows with missing target values, fill others with mean
if 'Churn Category' in df.columns:  # Check for the correct target column
    df.dropna(subset=['Churn Category'], inplace=True)  # Drop rows where 'Churn Category' is missing
    df.fillna(df.mean(numeric_only=True), inplace=True)  # Fill other numeric missing values with mean
else:
    print("Column 'Churn Category' not found in the dataset. Please check the column names.")

- Task 4: Remove duplicate records

In [8]:
print("\nNumber of duplicate records before removal:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Number of duplicate records after removal:", df.duplicated().sum())


Number of duplicate records before removal: 0
Number of duplicate records after removal: 0


- Task 5: Check for inconsistent data

In [9]:
# Example: Standardizing categorical variables
df['Gender'] = df['Gender'].str.strip().str.lower()
df['Internet Service'] = df['Internet Service'].str.strip().str.lower()  # Use the correct column name

- Task 6: Convert columns to correct data types

In [10]:
if 'Churn Category' in df.columns:
    df['Churn Category'] = df['Churn Category'].map({'Yes': 1, 'No': 0})

- Task 7: Identify and handle outliers

In [11]:
# Simple outlier detection using Z-score
z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
outliers = (z_scores > 3).any(axis=1)
print("\nNumber of outliers detected:", outliers.sum())
df = df[~outliers]  # Remove outliers


Number of outliers detected: 329


- Task 8: Feature engineering

In [12]:
# Example: Create a new feature for total charges
df['TotalCharges'] = df['Monthly Charge'] * df['Tenure in Months']

- Task 9: Normalize or scale the data if necessary

In [13]:
from sklearn.preprocessing import StandardScaler

# Check for missing values before scaling
print("\nMissing values before scaling:")
print(df.isnull().sum())

# Fill any remaining NaN values with mean or drop them
df.fillna(df.mean(numeric_only=True), inplace=True)

# Verify that there are no missing values
print("\nMissing values after filling:")
print(df.isnull().sum())

# Select numeric columns for scaling
numeric_cols = df.select_dtypes(include=[np.float64, np.int64]).columns
scaler = StandardScaler()

# Apply scaling
scaled_features = scaler.fit_transform(df[numeric_cols])

# Updating the DataFrame with scaled features
scaled_df = pd.DataFrame(scaled_features, columns=numeric_cols)
df = pd.concat([df.select_dtypes(exclude=[np.float64, np.int64]), scaled_df], axis=1)


Missing values before scaling:
Customer ID                             0
Gender                                  0
Age                                     0
Married                                 0
Number of Dependents                    0
City                                    0
Zip Code                                0
Latitude                                0
Longitude                               0
Number of Referrals                     0
Tenure in Months                        0
Offer                                 857
Phone Service                           0
Avg Monthly Long Distance Charges       0
Multiple Lines                        142
Internet Service                        0
Internet Type                          96
Avg Monthly GB Download                 0
Online Security                        96
Online Backup                          96
Device Protection Plan                 96
Premium Tech Support                   96
Streaming TV                           96
St

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


- Task 10: Split the dataset into training and testing sets

In [14]:
X = df.drop('Churn Category', axis=1, errors='ignore')  # Features
y = df['Churn Category']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- Task 11: Export the cleaned dataset for future analysis or modeling

In [16]:
df.to_csv('Cleaned_Telecom_Customer_Churn.csv', index=False)

print("\nData cleaning and preparation completed. Cleaned dataset saved as 'Cleaned_Telecom_Customer_Churn.csv'.")


Data cleaning and preparation completed. Cleaned dataset saved as 'Cleaned_Telecom_Customer_Churn.csv'.
