Assignment

Data Science Masters

Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values. \
Design a pipeline that includes the following steps" \
* Use an automated feature selection method to identify the important features in the datasetC
* Create a numerical pipeline that includes the following steps"
* Impute the missing values in the numerical columns using the mean of the column valuesC
* Scale the numerical columns using standardisationC
* Create a categorical pipeline that includes the following steps"
* Impute the missing values in the categorical columns using the most frequent value of the columnC
* One-hot encode the categorical columnsC
* Combine the numerical and categorical pipelines using a ColumnTransformerC
* Use a Random Forest Classifier to build the final modelC
* Evaluate the accuracy of the model on the test dataset.
Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline. \
Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

### **Q1. Feature Engineering Pipeline with Random Forest Classifier**
Ans: \
In this pipeline, we'll follow these steps:

1. **Automated Feature Selection**: Use techniques like correlation analysis or `SelectFromModel` to identify important features.
2. **Numerical Pipeline**:
   - Impute missing values using the mean of the column.
   - Standardize (scale) numerical columns.
3. **Categorical Pipeline**:
   - Impute missing values using the most frequent value.
   - One-hot encode the categorical columns.
4. **Combine Pipelines**: Use `ColumnTransformer` to combine the numerical and categorical pipelines.
5. **Model**: Use a Random Forest Classifier.
6. **Evaluate**: Evaluate model performance using accuracy.

### **Explanation of Each Step**:

1. **Automated Feature Selection**:
   - The `SelectFromModel` method is used after fitting the Random Forest Classifier. It selects features based on feature importance derived from the model.
   
2. **Numerical Pipeline**:
   - `SimpleImputer(strategy='mean')` fills in missing values with the column's mean.
   - `StandardScaler()` standardizes the numerical columns to have a mean of 0 and a standard deviation of 1.

3. **Categorical Pipeline**:
   - `SimpleImputer(strategy='most_frequent')` fills missing values with the most frequent value in the column.
   - `OneHotEncoder()` one-hot encodes categorical columns, creating binary columns for each category.

4. **ColumnTransformer**:
   - Combines the two pipelines for numerical and categorical data preprocessing.

5. **Model**:
   - A Random Forest Classifier is used as the model. The `n_estimators=100` specifies 100 trees.

6. **Evaluation**:
   - The accuracy of the model is evaluated by comparing the predicted values (`y_pred`) to the true values (`y_test`).

### **Possible Improvements**:
- **Hyperparameter Tuning**: Use `GridSearchCV` or `RandomizedSearchCV` to tune hyperparameters for better model performance.
- **Advanced Feature Selection**: Use more sophisticated methods like Recursive Feature Elimination (RFE) or feature importance from models.
- **Ensemble Methods**: Combine other classifiers like Gradient Boosting or SVM to improve performance.

In [3]:
# Q1
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

# Load the dataset
# Assuming df is a DataFrame containing both numerical and categorical data
df = pd.read_csv('/content/dataset.csv')

# Split the data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define numerical and categorical columns
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

# Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean
    ('scaler', StandardScaler())  # Standardize numerical columns
])

# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical columns
])

# Column transformer to combine the pipelines
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_cols),
    ('cat', categorical_pipeline, categorical_cols)
])

# full pipeline with feature selection and random forest classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selector', SelectFromModel(RandomForestClassifier(n_estimators=100))),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Fit the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the model: {accuracy * 100:.2f}%")



Accuracy of the model: 82.42%


### **Q2. Voting Classifier with Random Forest and Logistic Regression on Iris Dataset**

In this pipeline, we will combine two classifiers using a Voting Classifier, which will combine their predictions:

1. **Random Forest Classifier**.
2. **Logistic Regression**.
3. **Voting Classifier**: Combine the predictions using a voting mechanism.

### **Explanation of Each Step**:

1. **Classifiers**:
   - We use `RandomForestClassifier` and `LogisticRegression` as base classifiers.
   
2. **Voting Classifier**:
   - The `VotingClassifier` combines the predictions from both models. We use `voting='hard'`, which means it will take the majority vote of the two classifiers.

3. **Evaluation**:
   - The accuracy of the voting classifier is evaluated on the test set.

### **Possible Improvements**:
- **Soft Voting**: Use `voting='soft'` to predict based on the probabilities (i.e., the average of probabilities predicted by each classifier), which may improve performance.
- **Hyperparameter Tuning**: Tune hyperparameters for both classifiers to optimize their performance.

In [4]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(max_iter=200, random_state=42)

# Create a voting classifier
voting_classifier = VotingClassifier(estimators=[('rf', rf_classifier), ('lr', lr_classifier)], voting='hard')

# Train the voting classifier
voting_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = voting_classifier.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Voting Classifier: {accuracy * 100:.2f}%")


Accuracy of the Voting Classifier: 100.00%
