# Task 2: End-to-End ML Pipeline with Scikit-learn Pipeline API


## Objective
Build a reusable and production-ready machine learning pipeline for predicting customer churn using the Telco Churn Dataset.


## Dataset
**Telco Churn Dataset**  
It contains customer details such as services used, account information, and churn status (Yes/No).



## Step 1: Load the Dataset
We begin by loading the Telco Customer Churn dataset using pandas. This helps us view the data structure and get an idea of the features and target variable.

In [3]:
import pandas as pd

# Load dataset
df = pd.read_csv(r"c:\Users\CS\Telco-Customer-Churn.csv.csv")

# Display first few rows
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Step 2: Data Preprocessing

In this step, we will:

- Separate features and target
- Handle missing values
- Identify numerical and categorical columns



In [4]:

# Drop customerID column if it's not useful for prediction
df = df.drop('customerID', axis=1)

# Convert 'Churn' column to numeric (Yes → 1, No → 0)
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Convert 'TotalCharges' to numeric (some values may be blank/space)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Fill missing values in 'TotalCharges' with median
df['TotalCharges'] = df['TotalCharges'].fillna(df['TotalCharges'].median())

# Separate features and target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Split into training and testing datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns


## Step 3: Create Preprocessing Pipeline

In this step, we build two separate preprocessing pipelines:
- **Numerical Pipeline**: Fills missing values with the median and scales features using `StandardScaler`.
- **Categorical Pipeline**: Fills missing values with the most frequent value and applies `OneHotEncoder`.

We combine both pipelines using `ColumnTransformer` to automatically apply the correct transformation to each type of feature.


In [5]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Numeric pipeline: handle missing values + scale
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Categorical pipeline: handle missing + one-hot encode
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine into a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features)
])


## Step 4: Build and Train the Model Pipeline

In this step, we build a complete end-to-end pipeline by combining:
- The preprocessing steps from Step 3
- A machine learning model (`RandomForestClassifier`)

We use `Pipeline` to ensure that all preprocessing and model training steps happen together automatically.

### What This Does:
- Applies preprocessing (imputation, encoding, scaling)
- Trains the model in a single command: `pipeline.fit(X_train, y_train)`
- Makes the pipeline reusable and production-ready


In [6]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Combine preprocessor and model in a single pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

# Train the pipeline on the training data
model_pipeline.fit(X_train, y_train)


0,1,2
,steps,"[('preprocessing', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Step 5: Evaluate the Model

We evaluate the model's performance using:

- **Accuracy Score**: Measures overall correctness.
- **Classification Report**: Shows precision, recall, and F1-score for each class.
- **Confusion Matrix**: Shows how many predictions were right or wrong, and which types of errors were made.

### Why This Matters:
These metrics help us understand how well our model performs in predicting customer churn vs non-churn.


In [7]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict on test set
y_pred = model_pipeline.predict(X_test)

# Evaluate the performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.7970191625266146

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.91      0.87      1036
           1       0.66      0.47      0.55       373

    accuracy                           0.80      1409
   macro avg       0.75      0.69      0.71      1409
weighted avg       0.78      0.80      0.79      1409


Confusion Matrix:
 [[946  90]
 [196 177]]


## Step 6: Hyperparameter Tuning with GridSearchCV

We use `GridSearchCV` to find the best combination of hyperparameters for our Random Forest model.  
Tuning improves model performance by testing different values for:

- `n_estimators`: Number of decision trees in the forest
- `max_depth`: Maximum depth of each tree



In [8]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10, 20]
}

# Setup GridSearchCV using cross-validation
grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# View best results
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)


Best Parameters: {'model__max_depth': 10, 'model__n_estimators': 100}
Best Cross-Validation Accuracy: 0.7953492587088121


## Step 7: Export the Trained Model

We use the `joblib` library to save the best model from our GridSearchCV as a `.pkl` file.

This saved file includes:
- Preprocessing steps (imputation, scaling, encoding)
- Tuned Random Forest model
- Full pipeline that can be reused anytime



In [9]:
import joblib

# Save the best model from GridSearchCV
joblib.dump(grid_search.best_estimator_, 'telco_churn_model.pkl')

print("Model saved as 'telco_churn_model.pkl'")


Model saved as 'telco_churn_model.pkl'


## Final Observation

In this task, an end-to-end Machine Learning pipeline was implemented using the **Telco Customer Churn** dataset. The goal was to predict whether a customer will churn based on features like contract type, tenure, monthly charges, and more.

A **Random Forest Classifier** was used within a `Pipeline`, and **GridSearchCV** was applied for hyperparameter tuning. The best parameters found were:

- `max_depth`: **10**
- `n_estimators`: **100**

### Model Performance:
- **Best Cross-Validation Accuracy**: 79.5%
- **Test Accuracy**: 79.7%
- **Class 0 (No Churn)** was predicted with high precision and recall.
- **Class 1 (Churn)** had slightly lower metrics due to class imbalance, but still meaningful.

### Key Takeaways:
- Pipelines help streamline preprocessing and model training.
- Class imbalance affects churn prediction, but the model still captured meaningful patterns.
- This setup can be directly used in real-world business scenarios for customer retention strategies.

Overall, the task successfully demonstrated how to build, tune, and evaluate a complete ML pipeline using Scikit-learn.
