Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:

- Use an automated feature selection method to identify the important features in the dataset.
- Create a numerical pipeline that includes the following steps:
- Impute the missing values in the numerical columns using the mean of the column values.
- Scale the numerical columns using standardization.
- Create a categorical pipeline that includes the following steps:
- Impute the missing values in the categorical columns using the most frequent value of the column.
- One-hot encode the categorical columns.
- Combine the numerical and categorical pipelines using a Column Transformer.
- Use a Random Forest Classifier to build the final model.
- Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Step 1: Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Step 2: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Preprocessing Pipeline for Numerical Features
#     Impute missing values with the mean
#     Scale the features to have zero mean and unit variance
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 4: Final Model (Random Forest Classifier)
# - Utilizing the RandomForestClassifier with 100 trees
classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Step 5: Create the final pipeline by combining preprocessing and classifier
final_pipeline = Pipeline([
    ('preprocessor', numerical_pipeline),
    ('classifier', classifier)
])

# Step 6: Fit the model on the training data
final_pipeline.fit(X_train, y_train)

# Step 7: Evaluate the model on the test data
y_pred = final_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 8: Interpretation of Results
print("Accuracy:", accuracy)

Accuracy: 1.0


Explanation of Steps:

- Load the Iris dataset: We load the Iris dataset, which contains features (sepal length, sepal width, petal length, and petal width) and target labels (species of iris).

- Split the data: We split the dataset into training and test sets. This allows us to train the model on one portion of the data and evaluate its performance on another portion.

- Preprocessing Pipeline for Numerical Features: In this step, we create a preprocessing pipeline for numerical features. This pipeline consists of two main transformers:

- Imputer: It fills missing values in the dataset with the mean of the respective feature.

- Scaler: It scales the features to have zero mean and unit variance, which is important for many machine learning algorithms.

- Final Model (Random Forest Classifier): We choose a Random Forest Classifier as our final model. Random Forest is an ensemble method that works well for classification tasks and can handle both numerical and categorical features.

- Create the final pipeline: We create the final pipeline by combining the preprocessing pipeline and the classifier.

- Fit the model: We fit the model to the training data, allowing it to learn patterns in the data.

- Evaluate the model: We use the trained model to make predictions on the test data and calculate the accuracy of the model's predictions.

Interpretation of Results:

- The "Accuracy" score represents the fraction of correctly predicted labels in the test dataset. A higher accuracy indicates better model performance. In this case, the Random Forest Classifier achieved a certain level of accuracy on the Iris dataset.

Possible Improvements for the Pipeline:

- Hyperparameter Tuning: Optimize the hyperparameters of the Random Forest Classifier to potentially improve performance.

- Feature Engineering: Explore feature engineering techniques to create new informative features.
- Cross-Validation: Implement cross-validation to better estimate the model's performance and prevent overfitting.

- Ensemble Methods: Experiment with ensemble methods like stacking or boosting to potentially improve accuracy further.

- Other Classifiers: Try other classifiers such as Support Vector Machines (SVM) or Gradient Boosting to see if they perform better for this dataset.

- Feature Selection: Although not necessary for the Iris dataset, you can explore feature selection techniques if dealing with high-dimensional datasets.

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [2]:
#Answer2-

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create pipelines for Random Forest and Logistic Regression
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(random_state=42))
])

# Create a Voting Classifier that combines the two pipelines
voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_pipeline),
    ('lr', lr_pipeline)
], voting='soft')  # 'soft' for probability-based voting

# Fit the Voting Classifier on the training data
voting_classifier.fit(X_train, y_train)

# Make predictions and evaluate accuracy
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on the test dataset:", accuracy)

Accuracy on the test dataset: 1.0
