In [30]:
# Q1. You are working on a machine learning project where you have a dataset containing numerical and
# categorical features. You have identified that some of the features are highly correlated and there are
# missing values in some of the columns. You want to build a pipeline that automates the feature
# engineering process and handles the missing valuesD
# Design a pipeline that includes the following steps"
# Use an automated feature selection method to identify the important features in the datasetC
# Create a numerical pipeline that includes the following steps"
# Impute the missing values in the numerical columns using the mean of the column valuesC
# Scale the numerical columns using standardisationC
# Create a categorical pipeline that includes the following steps"
# Impute the missing values in the categorical columns using the most frequent value of the columnC
# One-hot encode the categorical columnsC
# Combine the numerical and categorical pipelines using a ColumnTransformerC
# Use a Random Forest Classifier to build the final modelC
# Evaluate the accuracy of the model on the test datasetD
# Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of
# each step. You should also provide an interpretation of the results and suggest possible improvements for
# the pipelineD

import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load a dataset from seaborn (Iris dataset)
iris = sns.load_dataset('iris')

# Separate target variable and predictors
X = iris.drop('species', axis=1)
y = iris['species']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the numerical and categorical pipelines
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine the pipelines using ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, X.select_dtypes(include=['float', 'int']).columns),
    ('cat', cat_pipeline, X.select_dtypes(include=['object']).columns)
])

# Feature selection using SelectKBest
feature_selection = SelectKBest(score_func=f_classif, k=3)

# Build the final pipeline with a Random Forest Classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the model on the test data
accuracy = accuracy_score(y_test, y_pred)
print(f'The accuracy of the Random Forest Classifier model is: {accuracy:.2f}')



The accuracy of the Random Forest Classifier model is: 1.00


In [None]:
Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.