## End-to-End Machine Learning Pipeline  with Data Analysis
This hands-on lab will guide you through creating an end-to-end machine learning pipeline in Python using Jupyter Notebook. We will cover the following steps:

	1.	Installing and importing necessary libraries.
	2.	Downloading and collecting a dataset.
	3.	Performing and Exploratory Data Analysis
	4.	Processing the data (cleaning and feature engineering).
	5.	Model Development:
		•	Pipeline Setup: We defined a pipeline for data preprocessing, feature engineering, and model training.
		•	Multiple Models: Logistic Regression, Random Forest, and SVM models were tested.
		•	Hyperparameter Tuning: GridSearchCV was used to find the best model and hyperparameters.
		•	Model Evaluation: The best model was evaluated on the test set, and its performance was analyzed.
		•	Inference Pipeline: The best model was saved and loaded into an inference pipeline, which can be deployed via an API for real-time predictions.

### 1. Installing the Libraries

(If you’ve already installed these libraries, you can skip this step.)

In [None]:
# !pip install pandas numpy scikit-learn joblib flask matplotlib seaborn

### 2. Importing the Necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from joblib import dump


### 3. Data Preprocessing
#### A. Data Collection

In [None]:
def get_data():
    from sklearn.datasets import load_breast_cancer
    data = load_breast_cancer(as_frame=True)
    df = pd.concat([data['data'], data['target']], axis=1)
    return df

# Load the Breast Cancer Wisconsin dataset
data = get_data()

# Display the first few rows
data.head()

#### B. Exploratory Data Analysis (EDA)

Let’s perform some data analysis to understand the distribution of the features and their relationships with the target variable.

In [None]:
# Distribution of the target variable
sns.countplot(x=data['target'])
plt.title('Distribution of Target Variable (Malignant vs Benign)')
plt.show()


#### C. Mutual information and feature importance:
- mutual_info_classif: This function calculates the mutual information between each feature and the target, which helps in identifying how much information each feature contributes to predicting the target.
- Visualization: The code creates a bar plot to visualize the importance of each feature based on mutual information scores.

In [None]:
from sklearn.feature_selection import mutual_info_classif

# Split the data into features and target
X = data.drop("target", axis=1)
y = data["target"]

def Show_Feature_Score(X,y):
    # Compute the mutual information scores
    mi_scores = mutual_info_classif(X, y, random_state=42)

    # Create a DataFrame to display the scores
    mi_scores_df = pd.DataFrame({
        'Feature': X.columns,
        'Mutual Information': mi_scores
    })

    # Sort the DataFrame by mutual information scores
    mi_scores_df = mi_scores_df.sort_values(by='Mutual Information', ascending=False)

    # Visualize the mutual information scores
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Mutual Information', y='Feature', data=mi_scores_df, palette="viridis")
    plt.title('Mutual Information Scores for Each Feature')
    plt.show()

Show_Feature_Score(X,y)

#### D. Feature Engineering

In [None]:
def add_combined_feature(X):
    X = X.copy()  # Ensure we're modifying a copy of the DataFrame
    
    # Example feature: combining two features
    X['Combined_radius_texture'] = X['mean radius'] * X['mean texture']
    
    return X

### 3. Model Development

#### A. Build the Training Pipeline

In [None]:
from sklearn.preprocessing import FunctionTransformer

# Define the feature engineering and preprocessing pipeline
preprocessing_pipeline = Pipeline([
    ('feature_engineering', FunctionTransformer(add_combined_feature)),
    ('scaler', StandardScaler())
])

# Define the models and their hyperparameters for GridSearchCV
models = [
    {
        'classifier': [LogisticRegression(max_iter=1000)],
        'classifier__C': [0.1, 1.0, 10]
    }
]

# Updated pipeline with additional feature engineering and data transformation steps
training_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing_pipeline),
    ('classifier', LogisticRegression()) # Placeholder, will be replaced by GridSearchCV
])

# Split the data into training and testing sets
X = data.drop(columns=['target'])
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### B. Hyperparameter Tuning and Model Selection

In [None]:
# Use GridSearchCV to find the best model and hyperparameters
grid_search = GridSearchCV(training_pipeline, models, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.2f}")

# Best model
best_model = grid_search.best_estimator_

#### C. Evaluate the Best Model

In [None]:
def evaluate_model(model, X_test, y_test):

    # Make predictions on the test set
    y_pred = model.predict(X_test)
    # Evaluate the model's performance
    print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    # Confusion Matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

In [None]:
# Evaluate the best Model
evaluate_model(best_model, X_test, y_test)

In [None]:
# Save the best model and the preprocessing steps
dump(best_model, 'best_cancer_model_pipeline.joblib')

## Create a new pipeline, with a different feature engineering, and different models:
1- Random Forest:

from sklearn.ensemble import RandomForestClassifier

with : 
- number_estimator: 50, 100, 200
- max_depth = None, 10, 20

2- Support Vector Classifier: 

from sklearn.svm import SVC

with:
- C : 0.1, 1, 10
- kernel: linear, rbf

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [3]:
# Jeu de données exemple
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Définition du pipeline générique : StandardScaler + PCA + modèle
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=2)),
("clf", RandomForestClassifier())
])

In [5]:
# Grille des hyperparamètres pour RandomForest et SVC
param_grid = [
{
"clf": [RandomForestClassifier(random_state=42)],
"clf__n_estimators": [50, 100, 200],
"clf__max_depth": [None, 10, 20]
},
{
"clf": [SVC(random_state=42)],
"clf__C": [0.1, 1, 10],
"clf__kernel": ["linear", "rbf"]
}
]

In [6]:
# GridSearchCV pour tester toutes les combinaisons
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)


print("Meilleur modèle :", search.best_estimator_)
print("Meilleurs hyperparamètres :", search.best_params_)
print("Score sur le test set :", search.score(X_test, y_test))

Meilleur modèle : Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=2)),
                ('clf', SVC(C=1, random_state=42))])
Meilleurs hyperparamètres : {'clf': SVC(random_state=42), 'clf__C': 1, 'clf__kernel': 'rbf'}
Score sur le test set : 0.9
