In [175]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw_churn.csv")
df.shape

(7043, 21)

`TotalCharges` is stored as an object because some rows contain empty strings.
Then, we need to turn it into numerical.

In [176]:
df["TotalCharges"].head()

0      29.85
1     1889.5
2     108.15
3    1840.75
4     151.65
Name: TotalCharges, dtype: object

In [177]:
df["TotalCharges"].value_counts().head()

TotalCharges
         11
20.2     11
19.75     9
20.05     8
19.9      8
Name: count, dtype: int64

In [178]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df["TotalCharges"].isna().sum()

np.int64(11)

In [179]:
df.loc[df["TotalCharges"].isna(), ["tenure", "MonthlyCharges", "TotalCharges"]]

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
488,0,52.55,
753,0,20.25,
936,0,80.85,
1082,0,25.75,
1340,0,56.05,
3331,0,19.85,
3826,0,25.35,
4380,0,20.0,
5218,0,19.7,
6670,0,73.35,


There are 11 missing values. The missingnes is very small (<0.2%). It is safe to drop.

In [180]:
df = df.dropna(subset=["TotalCharges"])

In [181]:
df.shape

(7032, 21)

Clean target Variable `Churn`:

In [182]:
df["Churn"] = df["Churn"].map({"Yes":1, "No":0})
df["Churn"].value_counts()

Churn
0    5163
1    1869
Name: count, dtype: int64

Standardize categorical variables:

In [183]:
cat_cols = df.select_dtypes(include="object").columns

for col in cat_cols:
    df[col] = df[col].str.strip()

Drop unhelpful identifier `customerID`:

In [184]:
df = df.drop(columns=["customerID"])

In [185]:
df["tenure"].min(), df["tenure"].max()

(np.int64(1), np.int64(72))

In [186]:
df["tenure_group"] = pd.cut(
    df["tenure"],
    bins=[0, 12, 24, 48, 72],
    labels=["0–12", "12–24", "24–48", "48–72"]
)

In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   gender            7032 non-null   object  
 1   SeniorCitizen     7032 non-null   int64   
 2   Partner           7032 non-null   object  
 3   Dependents        7032 non-null   object  
 4   tenure            7032 non-null   int64   
 5   PhoneService      7032 non-null   object  
 6   MultipleLines     7032 non-null   object  
 7   InternetService   7032 non-null   object  
 8   OnlineSecurity    7032 non-null   object  
 9   OnlineBackup      7032 non-null   object  
 10  DeviceProtection  7032 non-null   object  
 11  TechSupport       7032 non-null   object  
 12  StreamingTV       7032 non-null   object  
 13  StreamingMovies   7032 non-null   object  
 14  Contract          7032 non-null   object  
 15  PaperlessBilling  7032 non-null   object  
 16  PaymentMethod     7032 non-nu

In [188]:
df.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
tenure_group        0
dtype: int64

In [189]:
df.to_csv("../data/clean_churn.csv", index=False)

### Data Cleaning Summary
- Converted `TotalCharges` to numeric and removed invalid rows
- Removed customer identifier
- Encoded churn as a binary variable
- Standardized categorical variables
- Created tenure groups for segmentation analysis