### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values:

Design a pipeline that includes the following steps:

* Use an automated feature selection method to identify the important features in the dataset.
* Create a numerical pipeline that includes the following steps:
* Impute the missing values in the numerical columns using the mean of the column values.
* Scale the numerical columns using standardisation.
* Create a categorical pipeline that includes the following steps:
* Impute the missing values in the categorical columns using the most frequent value of the column.
* One-hot encode the categorical columns.
* Combine the numerical and categorical pipelines using a ColumnTransformer
* Use a Random Forest Classifier to build the final model
* Evaluate the accuracy of the model on the test dataset

**Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.**

In [2]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [3]:
# Load your dataset
data = pd.read_csv('diabetes.csv')

In [4]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# Split the dataset into features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

In [6]:
# Step 1: Feature Selection
# Use SelectKBest with a suitable scoring function to select important features
num_features_to_select = 10  # Choose an appropriate number of features
feature_selector = SelectKBest(k=num_features_to_select)

In [7]:
# Step 2: Numerical Pipeline
# Impute missing values in numerical columns with the mean
# Scale numerical columns using standardization
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In [8]:
# Step 3: Categorical Pipeline
# Impute missing values in categorical columns with the most frequent value
# One-hot encode categorical columns
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [10]:
# Step 4: Combine Numerical and Categorical Pipelines using ColumnTransformer
# Specify which columns are numerical and categorical
numerical_cols = X.select_dtypes(include=['number']).columns
categorical_cols = X.select_dtypes(exclude=['number']).columns

# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, numerical_cols),
        ('cat', cat_pipeline, categorical_cols)
    ])

In [11]:
# Step 5: Final Model with Random Forest Classifier
# Create the final pipeline with the preprocessor and the classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

In [12]:
# Step 6: Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
# Step 7: Fit the model
pipeline.fit(X_train, y_train)

In [14]:
# Step 8: Evaluate the model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.74


**Interpretation of the Results:**
- The pipeline includes feature selection, preprocessing of numerical and categorical features, and a Random Forest Classifier for the final model.
- Feature selection helps in selecting the most important features for the model.
- Numerical features are imputed with the mean and scaled using standardization.
- Categorical features are imputed with the most frequent value and one-hot encoded.
- The pipeline is then combined using `ColumnTransformer`.
- The Random Forest Classifier is used as the final model for classification.

**Possible Improvements:**
- Hyperparameter tuning: You can perform a grid search or randomized search to optimize hyperparameters of the Random Forest Classifier for better model performance.
- Feature engineering: Experiment with different feature engineering techniques to create new features that might improve model accuracy.
- Cross-validation: Implement cross-validation to get a more robust estimate of model performance and prevent overfitting.
- Handling class imbalance: If your target classes are imbalanced, consider techniques like oversampling, undersampling, or using different evaluation metrics to address this issue.
- Feature importance analysis: After fitting the model, analyze feature importance to gain insights into which features are driving predictions.
- Model selection: Depending on your problem, you may want to try other classification algorithms besides Random Forest to see if they perform better.

### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [15]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

In [16]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

In [17]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
# Create individual classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)

In [19]:
# Create a Voting Classifier
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_classifier), ('lr', lr_classifier)],
    voting='soft'  # Use 'soft' for weighted voting based on class probabilities
)

In [20]:
# Create a pipeline with the Voting Classifier
pipeline = Pipeline([
    ('voting', voting_classifier)
])

In [21]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

In [22]:
# Make predictions on the test data
y_pred = pipeline.predict(X_test)

In [23]:
# Evaluate the accuracy of the ensemble model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 1.00
