In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [11]:
df=pd.read_csv('E:\Customer-churn-prediction\data\WA_Fn-UseC_-Telco-Customer-Churn.csv')

Missing values were caused by customers with zero tenure

In [12]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)


Encode the target variable

In [13]:
df['Churn']=df['Churn'].apply(lambda x: 1 if x=='Yes' else 0)

Drop not usefull columns

In [14]:
df=df.drop(columns=['customerID'],inplace=False)

Identify numerical and categorical columns

In [15]:
cat_col=df.select_dtypes(include=['object']).columns.tolist()
num_col=df.select_dtypes(include=['int64','float64']).columns.tolist()

One-Hot Encode categorical features 

In [16]:
df_encoded=pd.get_dummies(df,columns=cat_col,drop_first=True)

Train-Test Split

In [17]:
X=df_encoded.drop('Churn',axis=1)
y=df_encoded['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [18]:
feature_names=X.columns

Feature scaling

In [19]:
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

Saving the data

In [20]:
import joblib
joblib.dump(scaler, 'E:/Customer-churn-prediction/models/scaler.pkl')
joblib.dump(feature_names, 'E:/Customer-churn-prediction/models/feature_names.pkl')

['E:/Customer-churn-prediction/models/feature_names.pkl']

In [21]:
X_train_scaled.shape,X_test_scaled.shape

((5634, 30), (1409, 30))

In [22]:
np.save("E:/Customer-churn-prediction/data/X_train_scaled.npy", X_train_scaled)
np.save("E:/Customer-churn-prediction/data/X_test_scaled.npy", X_test_scaled)
y_train.to_csv("E:/Customer-churn-prediction/data/y_train.csv", index=False)
y_test.to_csv("E:/Customer-churn-prediction/data/y_test.csv", index=False)

## Feature Engineering Summary

In this notebook, raw customer data was transformed into a clean, model-ready format.

Key steps performed:
- Missing values were handled appropriately.
- Categorical variables were encoded into numerical representations.
- Numerical features were scaled to ensure compatibility with machine learning models.
- The final dataset is suitable for supervised learning without data leakage.

This dataset will be used in the next phase: model training and evaluation.
