# Task
Build an end-to-end machine learning pipeline using the scikit-learn Pipeline API to predict customer churn on the "https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud-object-storage/data/telco-customer-churn/telco.csv" dataset. The pipeline should include data preprocessing (scaling and encoding), model training (Logistic Regression and Random Forest), hyperparameter tuning using GridSearchCV, and the final pipeline should be exported using joblib.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import joblib


## Load the dataset

### Subtask:
Load the Telco Churn dataset from the provided URL.


:**Reasoning**:
Import pandas and load the dataset from the URL into a DataFrame.



In [None]:
import pandas as pd
# Assuming dataset is in the same folder as 'Telco-Customer-Churn.csv'
data = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn (1).csv")

print(data.head())
print(data.info())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

## Initial Data Exploration and Preparation

### Subtask:
Perform basic data exploration, handle missing values, and identify categorical and numerical features.

**Reasoning**:
Display the first few rows, check the data types of each column, and look for missing values to understand the structure and identify potential issues in the dataset. Also, identify numerical and categorical columns.

In [None]:
import pandas as pd # Already imported in the previous cell, but good practice to include
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer # Import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
import joblib

# Target variable
y = data["Churn"].map({"Yes": 1, "No": 0})  # Encode target
X = data.drop(["Churn", "customerID"], axis=1) # Drop 'customerID' here

# Identify categorical and numerical columns
cat_cols = X.select_dtypes(include=["object"]).columns
num_cols = X.select_dtypes(exclude=["object"]).columns

# Preprocessing: scale numeric, one-hot encode categorical
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ]
)

## Define Preprocessing Steps

### Subtask:
Create a ColumnTransformer to handle scaling of numerical features and one-hot encoding of categorical features.

**Reasoning**:
Import necessary libraries for preprocessing. Define the transformers for numerical and categorical features. Create a ColumnTransformer to apply the defined transformers to the appropriate columns.

In [None]:
# Logistic Regression Pipeline
logreg_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

# Random Forest Pipeline
rf_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier())
])


## Create Machine Learning Pipelines

### Subtask:
Define pipelines for Logistic Regression and Random Forest models, incorporating the preprocessing steps.

**Reasoning**:
Import the necessary libraries for creating pipelines and the models. Create a pipeline for Logistic Regression and another for Random Forest, each starting with the defined preprocessor and followed by the respective model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Define Hyperparameter Grids

### Subtask:
Set up hyperparameter grids for GridSearchCV for both models.

**Reasoning**:
Define dictionaries containing the hyperparameters to search over for both Logistic Regression and Random Forest models.

In [None]:
# Parameter grids
logreg_params = {
    "classifier__C": [0.01, 0.1, 1, 10],
    "classifier__penalty": ["l2"]
}

rf_params = {
    "classifier__n_estimators": [100, 200],
    "classifier__max_depth": [5, 10, None],
    "classifier__min_samples_split": [2, 5]
}

# GridSearchCV for Logistic Regression
grid_logreg = GridSearchCV(logreg_pipeline, param_grid=logreg_params,
                           cv=5, scoring="f1", n_jobs=-1)
grid_logreg.fit(X_train, y_train)

# GridSearchCV for Random Forest
grid_rf = GridSearchCV(rf_pipeline, param_grid=rf_params,
                       cv=5, scoring="f1", n_jobs=-1)
grid_rf.fit(X_train, y_train)




## Evaluate Models

### Subtask:
Evaluate the best models on the test set using appropriate metrics.

**Reasoning**:
Import necessary metrics for evaluation. Use the best estimators from the GridSearchCV results to predict on the test set. Calculate and display evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrix for both models.

In [None]:
# Logistic Regression Evaluation
y_pred_logreg = grid_logreg.predict(X_test)
print("Logistic Regression:")
print(classification_report(y_test, y_pred_logreg))
print("Best Params:", grid_logreg.best_params_)

# Random Forest Evaluation
y_pred_rf = grid_rf.predict(X_test)
print("Random Forest:")
print(classification_report(y_test, y_pred_rf))
print("Best Params:", grid_rf.best_params_)


Logistic Regression:
              precision    recall  f1-score   support

          No       0.85      0.92      0.88      1036
         Yes       0.71      0.56      0.63       373

    accuracy                           0.82      1409
   macro avg       0.78      0.74      0.76      1409
weighted avg       0.82      0.82      0.82      1409

Best Params: {'classifier__C': 0.01, 'classifier__penalty': 'l2'}
Random Forest:
              precision    recall  f1-score   support

          No       0.74      1.00      0.85      1036
         Yes       0.00      0.00      0.00       373

    accuracy                           0.74      1409
   macro avg       0.37      0.50      0.42      1409
weighted avg       0.54      0.74      0.62      1409

Best Params: {'classifier__max_depth': 5, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Export the Best Pipeline

### Subtask:
Export the best performing pipeline using joblib.

**Reasoning**:
Compare the performance of the two models based on the evaluation metrics and select the best one. Import the `joblib` library and save the best performing pipeline to a file.

In [None]:
# Pick the best model (example: RF if better F1-score)
best_model = grid_rf if grid_rf.best_score_ > grid_logreg.best_score_ else grid_logreg

joblib.dump(best_model, "churn_pipeline.pkl")
print("Pipeline saved as churn_pipeline.pkl")


Pipeline saved as churn_pipeline.pkl


In [None]:
# Load saved pipeline
pipeline = joblib.load("churn_pipeline.pkl")

# Predict on new data
sample = X_test.iloc[:5]
print("Predictions:", pipeline.predict(sample))


Predictions: ['Yes' 'No' 'No' 'Yes' 'No']


# Telco Customer Churn Prediction Pipeline

This notebook demonstrates an end-to-end machine learning pipeline built using scikit-learn to predict customer churn for a telecommunications company.

## Dataset

The dataset used in this project is the Telco Customer Churn dataset, available from:
https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud-object-storage/data/telco-customer-churn/telco.csv

## Pipeline Overview

The pipeline includes the following steps:

1.  **Data Loading**: Load the dataset into a pandas DataFrame.
2.  **Data Exploration and Preparation**: Perform basic data exploration, handle missing values, and identify categorical and numerical features. The 'TotalCharges' column is handled as part of the one-hot encoding due to its mixed data type.
3.  **Preprocessing**: Create a `ColumnTransformer` to apply `StandardScaler` to numerical features and `OneHotEncoder` to categorical features.
4.  **Model Definition**: Define `Pipeline` objects for two different classifiers:
    *   Logistic Regression
    *   Random Forest
5.  **Hyperparameter Tuning**: Use `GridSearchCV` to find the best hyperparameters for each model based on the F1-score.
6.  **Model Evaluation**: Evaluate the performance of the best models on a held-out test set using metrics such as accuracy, precision, recall, and F1-score.
7.  **Model Export**: Export the best performing pipeline using `joblib` for future use.

## Code Structure

The notebook is structured into several sections, each addressing a specific step in the machine learning pipeline.

*   **Load the dataset**: Loads the data from the specified URL.
*   **Initial Data Exploration and Preparation**: Includes basic data checks and column identification.
*   **Define Preprocessing Steps**: Sets up the `ColumnTransformer`.
*   **Create Machine Learning Pipelines**: Defines the `Pipeline` objects for Logistic Regression and Random Forest.
*   **Define Hyperparameter Grids**: Specifies the parameter search space for `GridSearchCV`.
*   **Evaluate Models**: Trains and evaluates the models using `GridSearchCV` and reports the results.
*   **Export the Best Pipeline**: Saves the best performing pipeline to a file.
*   **Load and Test Saved Pipeline**: Demonstrates how to load the saved pipeline and make predictions.

## Requirements

The following libraries are required to run this notebook:

*   pandas
*   numpy
*   scikit-learn
*   joblib

These can be installed using pip: