Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have ident#fied that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing valuesD

Design a pipeline that includes the following steps"

Use an automated feature selection method to identify the important features in the datasetC

Create a numerical pipeline that includes the following steps"

Impute the missing values in the numerical columns using the mean of the column valuesC

Scale the numerical columns using standardisationC

Create a categorical pipeline that includes the following steps"

Impute the missing values in the categorical columns using the most frequent value of the columnC

One-hot encode the categorical columnsC

Combine the numerical and categorical pipelines using a ColumnTransformerC

Use a Random Forest Classifier to build the final modelC

Evaluate the accuracy of the model on the test datasetD

Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipelineD

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

## Answers

In [None]:
## Q1: Feature Engineering Pipeline:

# Step 1: Automated Feature Selection
feature_selection = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))

# Explanation: SelectFromModel uses a machine learning model (RandomForestClassifier in this case) to automatically select important features based on their importance scores.

# Step 2: Numerical Pipeline
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Explanation: Impute missing values with the mean and standardize numerical features to ensure consistent scales for model training.

# Step 3: Categorical Pipeline
categorical_features = X.select_dtypes(include=['object']).columns
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

# Explanation: Impute missing values with the most frequent value and one-hot encode categorical features to convert them into a format suitable for machine learning models.

# Step 4: Combine Numerical and Categorical Pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Explanation: Use ColumnTransformer to apply different preprocessing steps to numerical and categorical features.

# Step 5: Final Model Pipeline
model_pipeline = Pipeline([
    ('feature_selection', feature_selection),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Explanation: Combine the feature selection, preprocessing, and the RandomForestClassifier into a final pipeline.

# Step 6: Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Explanation: Split the dataset into training and test sets for model evaluation.

# Step 7: Fit and evaluate the model
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Explanation: Fit the model on the training data and evaluate its accuracy on the test data.

# Interpretation of Results:
# The accuracy score obtained on the test set provides an indication of the model's performance. Higher accuracy suggests better predictive capability.

# Possible Improvements for Pipeline:
# 1. Grid search for hyperparameter tuning in the RandomForestClassifier.
# 2. Consider other feature selection methods for experimentation.
# 3. Evaluate the impact of different imputation strategies for missing values.
# 4. Explore additional preprocessing steps or try different machine learning models.

In [None]:
# Q2: Pipeline with Voting Classifier on Iris Dataset:

In [1]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Step 1: Load and split the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Create individual classifiers
rf_classifier = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

lr_classifier = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# Step 3: Combine classifiers using a Voting Classifier
voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_classifier),
    ('lr', lr_classifier)
], voting='hard')

# Step 4: Fit and evaluate the model
voting_classifier.fit(X_train, y_train)
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 1.0000
