## 1. Building a Pipeline for a Classification Task with Numerical only Features

### Step 1: Import Necessary Libraries

In [1]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


### Step 2: Load and Split the Dataset

**Load Dataset:** We load the Iris dataset.

**Split Dataset:** We split the data into training and test sets using an 80-20 split.

In [2]:
# Load sample dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 3: Define Feature Transformations
**Define Features:** We define which features are numerical. For demonstration, we treat all features as numerical.

**Create Transformers:**

**PCA:** Initialize PCA to reduce the data to 2 principal components.

**SelectKBest:** Initialize SelectKBest with ANOVA F-value to select the top 3 features.

In [3]:
# Define numerical features
numerical_features = [0, 1, 2, 3]

# Create feature transformers
pca = PCA(n_components=2)                      # Reduce data to 2 principal components
selection = SelectKBest(score_func=f_classif, k=3)  # Select the top 3 features based on ANOVA F-value


### Step 4: Combine Feature Transformations with FeatureUnion
**FeatureUnion:** Combine PCA and SelectKBest into a single transformer using FeatureUnion.

In [4]:
# Create a FeatureUnion
combined_features = FeatureUnion([
    ('pca', pca),                              # Apply PCA
    ('select', selection)                      # Apply SelectKBest
])


### Step 5: Apply Transformations to Columns with ColumnTransformer
**ColumnTransformer:** Apply the combined transformations to numerical features using ColumnTransformer.

In [5]:
# Create a ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', combined_features, numerical_features)   # Apply combined features to numerical data
])


### Step 6: Create and Fit the Pipeline
**Pipeline:** Create a pipeline that first preprocesses the data and then applies Logistic Regression.

**Fit Pipeline:** Fit the pipeline on the training data.

In [6]:
# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),            # Preprocess data
    ('classifier', LogisticRegression())       # Classify using Logistic Regression
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)


### Step 7: Evaluate the Pipeline
**Evaluate Pipeline:** Evaluate the pipeline on the test data by calculating the accuracy.

In [7]:
# Evaluate the pipeline on the test data
accuracy = pipeline.score(X_test, y_test)
print("Test Data Accuracy:", accuracy)


Test Data Accuracy: 1.0


## 2. Building a Pipeline for a Classification Task with Numerical and Categorical Data

### Step 1: Import Necessary Libraries

In [8]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


### Step 2: Load and Inspect the Dataset

**Load Dataset:** We load the Titanic dataset from OpenML.

**Inspect Dataset:**  Display the first few rows of the dataset.
**Drop Missing Values:** We drop rows with missing values for simplicity.




In [20]:
# Load sample dataset (using Titanic dataset for demonstration)
data = fetch_openml(name='titanic', version=1, as_frame=True)
X = data.data
y = data.target

# Display the first few rows of the dataset
X.head()



  warn(


Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Step 3: Drop Rows with Missing Values

**Drop Missing Values:** Drop rows with missing values for the specified columns.

**Update Target Variable:** Filter y to include only the rows that are still present in X

**Split Dataset:** We split the data into training and test sets using an 80-20 split.

In [21]:
# Drop rows with missing values for the specified columns
X = X.dropna(subset=['age', 'fare', 'embarked', 'sex', 'pclass'])

# Update the target variable to match the cleaned feature set
y = y.loc[X.index]

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



### Step 4: Define Feature Transformations
**Define Features:** We specify which features are numerical and which are categorical.

**Create Transformers for Numerical Data:** We create a pipeline for numerical features that includes scaling, PCA, and feature selection.

**Create Transformer for Categorical Data:** We use OneHotEncoder to handle categorical features.

In [22]:
# Define numerical and categorical features
numerical_features = ['age', 'fare']
categorical_features = ['sex', 'embarked', 'pclass']

# Create feature transformers for numerical data
num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),  # Scaling step
    ('features', FeatureUnion([    # Combine PCA and SelectKBest
        ('pca', PCA(n_components=2)), 
        ('select', SelectKBest(score_func=f_classif, k=2))
    ]))
])

# Create feature transformer for categorical data
cat_transformer = OneHotEncoder(handle_unknown='ignore')



### Step 5: Apply Transformations to Columns with ColumnTransformer
**ColumnTransformer:** Combine the transformations for numerical and categorical features using ColumnTransformer.

In [23]:
# Create a ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_transformer, numerical_features),   # Apply numerical transformations
    ('cat', cat_transformer, categorical_features)  # Apply categorical transformations
])


### Step 6: Create and Fit the Pipeline
**Pipeline:** Create a pipeline that first preprocesses the data and then applies Logistic Regression.

**Fit Pipeline:** Fit the pipeline on the training data.

In [25]:
# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),                 # Preprocess data
    ('classifier', LogisticRegression())  # Classify using Logistic Regression
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)


### Step 7: Evaluate the Pipeline
**Evaluate Pipeline:** Evaluate the pipeline on the test data by calculating the accuracy.

In [26]:
# Evaluate the pipeline on the test data
accuracy = pipeline.score(X_test, y_test)
print("Test Data Accuracy:", accuracy)


Test Data Accuracy: 0.7655502392344498


## 3. Incorporating GridSearchCV

## Step 1: Import Necessary Libraries¶

In [27]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, GridSearchCV


### Step 2: Load and Inspect the Dataset
**Load Dataset:** We load the Titanic dataset from OpenML.

**Inspect Dataset:** Display the first few rows of the dataset.

In [28]:
# Load sample dataset (using Titanic dataset for demonstration)
data = fetch_openml(name='titanic', version=1, as_frame=True)
X = data.data
y = data.target

# Display the first few rows of the dataset
print(X.head())


   pclass                                             name     sex      age  \
0     1.0                    Allen, Miss. Elisabeth Walton  female  29.0000   
1     1.0                   Allison, Master. Hudson Trevor    male   0.9167   
2     1.0                     Allison, Miss. Helen Loraine  female   2.0000   
3     1.0             Allison, Mr. Hudson Joshua Creighton    male  30.0000   
4     1.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female  25.0000   

   sibsp  parch  ticket      fare    cabin embarked boat   body  \
0    0.0    0.0   24160  211.3375       B5        S    2    NaN   
1    1.0    2.0  113781  151.5500  C22 C26        S   11    NaN   
2    1.0    2.0  113781  151.5500  C22 C26        S  NaN    NaN   
3    1.0    2.0  113781  151.5500  C22 C26        S  NaN  135.0   
4    1.0    2.0  113781  151.5500  C22 C26        S  NaN    NaN   

                         home.dest  
0                     St Louis, MO  
1  Montreal, PQ / Chesterville, ON  
2  Montreal

  warn(


## Step 3: Drop Rows with Missing Values
**Drop Missing Values:** Drop rows with missing values for the specified columns.

**Update Target Variable:** Filter y to include only the rows that are still present in X.

In [29]:
# Drop rows with missing values for the specified columns
X = X.dropna(subset=['age', 'fare', 'embarked', 'sex', 'pclass'])

# Update the target variable to match the cleaned feature set
y = y.loc[X.index]


### Step 4: Split the Dataset
**Split Dataset:** Split the data into training and test sets using an 80-20 split.

In [30]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Step 5: Define Feature Transformations
**Define Features:** Specify which features are numerical and which are categorical.

**Create Transformers for Numerical Data:** Create a pipeline for numerical features that includes scaling, PCA, and SelectKBest combined using FeatureUnion.

**Create Transformer for Categorical Data:** Use OneHotEncoder to handle categorical features.

In [31]:
# Define numerical and categorical features
numerical_features = ['age', 'fare']
categorical_features = ['sex', 'embarked', 'pclass']

# Create feature transformers for numerical data
num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),  # Scaling step
    ('features', FeatureUnion([    # Combine PCA and SelectKBest
        ('pca', PCA(n_components=2)), 
        ('select', SelectKBest(score_func=f_classif, k=2))
    ]))
])

# Create feature transformer for categorical data
cat_transformer = OneHotEncoder(handle_unknown='ignore')


### Step 6: Apply Transformations to Columns with ColumnTransformer
**ColumnTransformer:** Combine the transformations for numerical and categorical features using ColumnTransformer.

In [33]:
# Create a ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_transformer, numerical_features),   # Apply numerical transformations
    ('cat', cat_transformer, categorical_features)  # Apply categorical transformations
])


### Step 7: Create the Pipeline
**Pipeline:** Create a pipeline that first preprocesses the data and then applies Logistic Regression.

In [34]:
# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),                 # Preprocess data
    ('classifier', LogisticRegression(max_iter=1000))  # Classify using Logistic Regression
])


### Step 8: Define Hyperparameter Grid and Perform GridSearchCV
**Parameter Grid:** Define the hyperparameter grid for tuning. This includes parameters for PCA, SelectKBest, and LogisticRegression.

**GridSearchCV:** Initialize GridSearchCV with the pipeline and parameter grid, and perform cross-validation.

**Fit GridSearchCV:** Fit GridSearchCV to the training data.

**Best Parameters and Score:** Print the best parameters and the corresponding score.

In [35]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'preprocessor__num__features__pca__n_components': [1, 2],           # Tune PCA components
    'preprocessor__num__features__select__k': [1, 2],                   # Tune SelectKBest
    'classifier__C': [0.01, 0.1, 1, 10],                                # Tune LogisticRegression C
    'classifier__penalty': ['l1', 'l2'],                                # Tune LogisticRegression penalty
    'classifier__solver': ['liblinear']                                 # Specify solver to handle l1 penalty
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=5)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Best Parameters: {'classifier__C': 1, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear', 'preprocessor__num__features__pca__n_components': 1, 'preprocessor__num__features__select__k': 1}
Best Score: 0.7890123367722387
