### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that  some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:
* Use an automated feature selection method to identify the important features in the dataset

Create a numerical pipeline that includes the following steps:
* Impute the missing values in the numerical columns using the mean of the column values
* Scale the numerical columns using standardization

Create a categorical pipeline that includes the following steps:
* Impute the missing values in the categorical columns using the most frequent value of the column
* One-hot encode the categorical columns
* Combine the numerical and categorical pipelines using a ColumnTransformer
* Use a Random Forest Classifier to build the final model
* Evaluate the accuracy of the model on the test dataset

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

In [None]:
"""Below is a Python code snippet that creates a machine learning pipeline to automate feature engineering, handle missing values, and build a Random 
Forest Classifier model. This pipeline uses scikit-learn and assumes you have a dataset with both numerical and categorical features."""

In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score

## No actual dataset is used in this code.
## This is just the demonstration of the process for creating the pipelines and connecting them together

# Load your dataset (replace 'your_dataset.csv' with the actual filename)
data = pd.read_csv('your_dataset.csv')

# Separate target variable (e.g., 'target') from features
X = data.drop(columns=['target'])
y = data['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define feature selection model (Random Forest in this case)
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))

# Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, X.select_dtypes(include=['number']).columns),
        ('cat', categorical_pipeline, X.select_dtypes(exclude=['number']).columns)
])

# Create the final pipeline with feature selection and Random Forest Classifier
pipeline = Pipeline([
    ('feature_selector', feature_selector),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


#### Explanation of Steps:

1. We load the dataset and separate the target variable from the features.
2. We split the data into training and testing sets.
3. We define a feature selection model using a Random Forest Classifier to identify important features.
4. We create separate pipelines for numerical and categorical data. The pipelines handle missing values by imputing with the        mean for numerical columns and the most frequent value for categorical columns. Numerical columns are also standardized.
5. We combine the numerical and categorical pipelines using a ColumnTransformer.
6. We create the final pipeline, including feature selection and a Random Forest Classifier.
7. We fit the pipeline on the training data and make predictions on the test data.
8. We evaluate the model's accuracy on the test data.

#### Possible Improvements:

1. Hyperparameter tuning for the Random Forest Classifier to optimize model performance.
2. Cross-validation to get a more robust estimate of model performance.
3. Explore different feature selection methods or consider using feature importance scores from the Random Forest.
4. Further investigate the dataset to ensure appropriate handling of missing values and feature encoding.
5. Consider additional preprocessing steps or feature engineering techniques based on domain knowledge.

This pipeline automates many of the common preprocessing and modeling steps, making it easier to experiment and iterate on your machine learning 
project.

### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create individual classifiers (Random Forest and Logistic Regression)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(solver='liblinear', random_state=42)

# Create a list of classifiers for the Voting Classifier
classifiers = [('Random Forest', rf_classifier), ('Logistic Regression', lr_classifier)]

# Create a Voting Classifier using 'hard' voting (majority vote)
voting_classifier = VotingClassifier(estimators=classifiers, voting='hard')

# Create a pipeline with standardization and the Voting Classifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('voting_classifier', voting_classifier)
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


Accuracy: 1.00


#### Explanation of Steps:

1. We load the Iris dataset, which is a commonly used dataset for classification.
2. We split the dataset into training and testing sets.
3. We create individual classifiers, namely a Random Forest Classifier and a Logistic Regression Classifier.
4. We define a list of classifiers to be used in the Voting Classifier.
5. We create a Voting Classifier using "hard" voting, which means it predicts the class label by majority vote.
6. We create a pipeline that includes standardization (scaling features) and the Voting Classifier.
7. We fit the pipeline on the training data.
8. We predict on the test data using the pipeline.
9. We evaluate the accuracy of the model using scikit-learn's accuracy_score function.

The Voting Classifier combines the predictions of the individual classifiers (Random Forest and Logistic Regression) and makes a final prediction based on majority voting. This ensemble approach can improve predictive performance by leveraging the strengths of multiple classifiers.