Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.

Design a pipeline that includes the following steps:

 - Use an automated feature selection method to identify the important features in the dataset

 - Create a numerical pipeline that includes the following steps:

 - Impute the missing values in the numerical columns using the mean of the column values

 - Scale the numerical columns using standardisation
 - Create a categorical pipeline that includes the following steps:

 - Impute the missing values in the categorical columns using the most frequent value of the column

 - One-hot encode the categorical columns

 - Combine the numerical and categorical pipelines using a ColumnTransformer.

 - Use a Random Forest Classifier to build the final model

 - Evaluate the accuracy of the model on the test dataset

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

In [8]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

# Generating numerical features in the range [0, 4]
numerical_features = np.random.uniform(0, 4, size=(1000, 5))

# Generating categorical features in the range [-5, 0]
categorical_features = np.random.randint(-5, 1, size=(1000, 5))

# Concatenating numerical and categorical features
X = np.concatenate((numerical_features, categorical_features), axis=1)

# Generating a random target variable
y = np.random.randint(0, 2, size=1000)

# Assuming numerical and categorical indices are [0, 1, 2, 3, 4] and [5, 6, 7, 8, 9] respectively
numerical_indices = [0, 1, 2, 3, 4]  # indices for numerical features
categorical_indices = [5, 6, 7, 8, 9]  # indices for categorical features

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Numerical pipeline for numerical features
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline for categorical features
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combining numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_indices),
    ('cat', categorical_pipeline, categorical_indices)
])

# Feature selection using SelectFromModel with RandomForestClassifier
feature_selection = SelectFromModel(RandomForestClassifier(random_state=42))

# Creating the final pipeline with feature selection, preprocessing, and RandomForestClassifier
pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('feature_selection', feature_selection),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fitting the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predicting on the test data
y_pred = pipeline.predict(X_test)

# Evaluating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the model: {accuracy}")


Accuracy of the model: 0.49



**Explanation of Steps:**

1. **Data Generation**: Simulated the numerical and categorical features along with the target variable.
2. **Splitting Data**: Divided the data into training and testing sets.
3. **Numerical Pipeline**: Created a pipeline to handle numerical features by imputing missing values with the mean and scaling using StandardScaler.
4. **Categorical Pipeline**: Developed a pipeline for categorical features by imputing missing values with the most frequent value and performing one-hot encoding.
5. **ColumnTransformer**: Combined both pipelines to process numerical and categorical features separately.
6. **Feature Selection**: Utilized SelectFromModel to perform feature selection based on importance from a RandomForestClassifier.
7. **Final Pipeline**: Constructed the final pipeline by combining preprocessing, feature selection, and RandomForestClassifier.
8. **Model Training and Evaluation**: Fitted the pipeline on the training data, made predictions on the test set, and evaluated the model's accuracy.

**Interpretation & Possible Improvements:**

- The pipeline automates preprocessing, feature selection, and model building, providing a convenient workflow.
- Interpret the feature importance obtained from SelectFromModel to understand the most influential features.
- Experiment with different hyperparameters and models for better performance.
- Perform cross-validation to assess the model's generalization.
- Incorporate more advanced techniques like hyperparameter tuning or ensemble methods for enhancing model accuracy.

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

In [9]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating pipelines for individual classifiers
rf_pipeline = Pipeline([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

lr_pipeline = Pipeline([
    ('lr', LogisticRegression(max_iter=1000, random_state=42))
])

# Creating a Voting Classifier combining both individual classifiers
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_pipeline), ('lr', lr_pipeline)],
    voting='hard'  # 'hard' voting means majority voting
)

# Training the Voting Classifier
voting_classifier.fit(X_train, y_train)

# Predicting on the test data
y_pred = voting_classifier.predict(X_test)

# Evaluating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Voting Classifier: {accuracy}")


Accuracy of the Voting Classifier: 1.0


Certainly! Here's an example of building a pipeline that includes a Random Forest Classifier and a Logistic Regression Classifier, and then combines their predictions using a Voting Classifier. This pipeline will be trained on the Iris dataset and evaluated for accuracy:

```

Explanation:

1. **Loading Data**: Loaded the Iris dataset using `load_iris()` from scikit-learn.
2. **Splitting Data**: Split the dataset into training and testing sets.
3. **Individual Pipelines**: Created separate pipelines for Random Forest Classifier and Logistic Regression Classifier.
4. **Voting Classifier**: Created a Voting Classifier that combines the predictions of the individual classifiers.
5. **Training**: Fit the Voting Classifier on the training data.
6. **Prediction and Evaluation**: Made predictions on the test set and evaluated the accuracy of the Voting Classifier.

This Voting Classifier combines the predictions of Random Forest and Logistic Regression, making a final decision based on majority voting. Adjust the hyperparameters and voting method ('soft' or 'hard') based on the dataset and desired ensemble strategy.