`Pipelines in ScikitLearn`


Pipeline in scikit-learn (sklearn) is a powerful tool for building and managing machine learning workflows. It allows you to streamline the process of applying a sequence of data transformation and modeling steps, making your code cleaner, more organized, and easier to maintain. Here's why pipelines are useful and how to use them:

# Why Use Pipelines:

- Simplicity and Clarity: Pipelines make your code more readable and maintainable by encapsulating multiple steps in a single object. This can help reduce the risk of errors and make your code easier to understand.

- Reproducibility: Pipelines ensure that the same data preprocessing and modeling steps are applied consistently to the training and testing data. This helps maintain the integrity of your experiments and makes it easier to reproduce your results.

- Parameter Tuning: When performing hyperparameter tuning or cross-validation, pipelines ensure that the same preprocessing steps are applied to each fold of the cross-validation, preventing data leakage.

- Integration with Grid Search: Pipelines can be used seamlessly with tools like GridSearchCV to search for the best hyperparameters and preprocessing options simultaneously.


In [40]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
# For dimensionality-reducn
from sklearn.decomposition import PCA

In [41]:
# Loading Dataset and dividing it into training and testing data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=13)


In [42]:
# Creating Scaler ,pca and Model Instance
scaler = StandardScaler()
pca = PCA(n_components=2)
model = SVC()


In [43]:
# Creating a Pipeline
pipeline = Pipeline([('scaler', scaler),
                     ('pca', pca),
                     ('model', model)])

In [44]:
# Using fit() to make our ML Model learn from training dataset
# Here Pipeline make sure to do scaling,dimensionality-reduction etc bu itself
pipeline.fit(X_train, y_train)

# Using predict() to get prediction after learning from training dataset
y_pred = pipeline.predict(X_test)
print(y_pred[:10])
print(y_test[:10])

[1 1 0 2 2 0 2 2 0 1]
[1 1 0 2 2 0 2 2 0 1]


In [45]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy-score:{accuracy}")

Accuracy-score:0.9210526315789473


`Our Model is 92% accurate without putting any effort`
