# Pipline in Machine Learning

In machine learning, a pipleine is a sequence of data processing steps that are chained together to automate and streamline the machine learning workflow. A pipeline allows you to cmbine multiple data preprocessing and model training steps into a single object, making it easier to organize and manage your machine learning code

**Here are the key components of a pipline:**

``` **Data preprocessing steps:** ``` pipelines typically start with data preprocessing steps, such as feature scalling, featrue encoding, handling missing values, or dimensionality reduction. These steps ensure that the data is in the appropriate format and quality for model training.

**Model Training:** After the data preprocessing steps, the pipleine includes the training of a machine learning model. This can be a classifier for classification tasks, a regressor for regression tasks, or any other type of model depending on the problem at hand.

```**Model Evaluation:**``` Once the model is trained, the pipline often incorporates steps for evaluating its performance. this may involve metrics calculation, cross-validation, or any other evaluation technique to assess the model's effectiveness.

```**Prediction:**``` After the model has been evaluated, the pipeline allows you to make predictions on new, unseen data using the trained model. this step applies generating predictions.

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Lab elEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset from seaborn
df = sns.load_dataset("titanic")

# select features and target variables
x = df[['pclass', 'sex', 'age', 'fare', 'embarked']]
y = df['survived']

# split the data into train test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# define the column transformer for imputing missing
numeric_features = ['age', 'fare']
categorical_features = ['pclass', 'sex', 'embarked']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# create a pipeline with the preprocessor and RandomForestClassifier
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# fit the pipeline on the training data
pipeline.fit(x_train, y_train)

# make predictions on the test data
y_pred = pipeline.predict(x_test)

# calculating accuracy score
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.7821229050279329


# Hyperparameter Tunning in pipeline

Hyperparameter tunning in a pipeline involves optimizing the hyperparamters of the different steps in the pipeline to find the best combination that maximizes the model's performance. here's and example of hyperparameter tunning in a pipeline and selecting the best model on the titanic dataset:

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# load the dataset fro seaborn 
df = sns.load_dataset("titanic")

# select features and target variables
x = df.drop("survived", axis=1)
y = df['survived']

#split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

#create a pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore')),
    ('model', RandomForestClassifier(random_state=42))
])

#Define the hyperparameters to tune
hyperparameters = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 5, 10],
    'model__min_samples_split': [2,5, 10]
}

#perform grid search cross-validation
grid_search = GridSearchCV(pipeline, hyperparameters, cv=5)
grid_search.fit(x_train, y_train)

# get the best model
best_model = grid_search.best_estimator_

# make prediction on the test data using the best model
y_pred  = best_model.predict(x_test)

# calculating the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuaracy", accuracy)

#print the best paramter
print("Best Hyperparameters", grid_search.best_params_)



Accuaracy 1.0
Best Hyperparameters {'model__max_depth': None, 'model__min_samples_split': 2, 'model__n_estimators': 100}
