In [None]:
# What is a Pipeline in Machine Learning?

# A pipeline contains multiple transformers (or even models!) and performs operations on data IN SEQUENCE. Compare this to ColumnTransformers that perform operations on data IN PARALLEL. When a
# pipeline is fit on data, all of the transformers inside it are fit. When data is transformed using a pipeline, the data is transformed by the first transformer first, the second transformer second, etc. A pipeline
# can contain any number of transformers as long as they have .fit() and .transform() methods. These are called 'steps'.
# If needed, one single estimator, or model, can be placed at the end of a pipeline. You will learn more about that later.
# The important thing to remember is that pipelines are ordered, so the order you use to build them matters.
# Pipelines can even contain ColumnTransformer AND ColumnTransformers can contain pipelines. You will learn how to do this in a later lesson.



# Why Should I Use Pipelines for Machine Learning?

# Reasons to use pipelines:
# 1. Pipelines use less code than doing each transformer individually. Since each transformer is fit in a single .fit() call, and the data is transformed by all of the transformers in the pipeline in a single
# .transform() call, pipelines use significantly less code.
# 2. Pipelines make your preprocessing workflow easier to understand. By reducing the code and displaying a diagram of the pipeline you can show your readers clearly how your data is being
# transformed before modeling.
# 3. Pipelines are easy to use in production models. When you are ready to deploy your model to use in new data, a preprocessing pipeline can ensure that new data can be quickly and easily
# preprocessed for modeling.
# 4. Pipelines can prevent data leakage. Pipelines are designed to only be fit on training data. Later you will learn a technique called 'cross-validation' and pipelines will simplify performing this without
# leaking data.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

In [None]:
# Load the data
path = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQG5QTgHn7O1FaenQgpiHadFAza6cfG-cXznWh9a_Z-QWsbsrv3iJ5MpDdSSKTK7ZpTpRosOkK_LR_E/pub?output=csv'
df = pd.read_csv(path, index_col='CountryYear')
df.head()