<a href="https://colab.research.google.com/github/sameermdanwer/python-assignment-/blob/main/Ensemble_Techniques_And_Its_Types_Assignment_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.


Building a machine learning pipeline to automate feature engineering and handle missing values can greatly streamline your workflow. Here’s a structured approach using Python's scikit-learn library, which allows you to create a clean and efficient pipeline.

Step-by-Step Guide
1. Import Necessary Libraries: You'll need pandas, numpy, and sklearn. Ensure you have these installed.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


2. Load Your Dataset: Load your data into a pandas DataFrame.

In [None]:
df = pd.read_csv('your_dataset.csv')

3. Identify Numerical and Categorical Features: Specify which columns are numerical and which are categorical.

In [None]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()


4. Handle Missing Values: Use SimpleImputer for numerical features (mean or median) and categorical features (most frequent).

5. Create the Preprocessing Pipeline: Use ColumnTransformer to apply different transformations to different column types.

In [None]:
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # or 'median'
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)


6. Split the Data: Separate your target variable from your features and split the data into training and testing sets.

In [None]:
X = df.drop('target_column', axis=1)  # Replace with your target column
y = df['target_column']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


7. Build the Full Pipeline: Include a machine learning model at the end of your pipeline. Here, we can use a Random Forest as an example.

In [None]:
from sklearn.ensemble import RandomForestClassifier  # or any other model

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])


8. Train the Model: Fit the model on the training data.

In [None]:
model.fit(X_train, y_train)


9. Evaluate the Model: Check the performance of your model using the test data.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


# Summary
This pipeline automates the process of handling missing values and scaling numerical features while encoding categorical variables. Adjust the imputer strategies and the model as needed for your specific dataset and objectives. You can also expand this pipeline to include additional feature engineering steps, such as polynomial feature generation or interaction terms, as required for your project.








# Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate  
accuracy.

To build a pipeline with both a Random Forest and Logistic Regression classifier, we can use a Voting Classifier from scikit-learn to combine their predictions. Here’s how to set it up and evaluate its accuracy on the Iris dataset.

Steps
1. Import Libraries: Import necessary libraries from scikit-learn.






In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score


2. Load the Iris Dataset: Load the Iris dataset and split it into features (X) and target (y).

In [None]:
iris = load_iris()
X, y = iris.data, iris.target


3. Split the Data: Split the data into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


4. Build the Classifier Pipelines: Define pipelines for both classifiers: Random Forest and Logistic Regression.

In [None]:
rf_pipeline = Pipeline([
    ('rf', RandomForestClassifier(random_state=42))
])

lr_pipeline = Pipeline([
    ('lr', LogisticRegression(max_iter=200, random_state=42))
])


5. Combine Classifiers with a Voting Classifier: Use the VotingClassifier to combine the predictions of Random Forest and Logistic Regression. Set voting='soft' or voting='hard' depending on the voting strategy you want to use.

In [None]:
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_pipeline), ('lr', lr_pipeline)],
    voting='hard'  # or 'soft' for probabilistic voting
)


6. Train the Voting Classifier: Fit the voting classifier on the training data.

In [None]:
voting_classifier.fit(X_train, y_train)


7. Evaluate the Model: Predict on the test set and calculate the accuracy.

In [None]:
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Voting Classifier Accuracy:", accuracy)


# **Full code**

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define classifier pipelines
rf_pipeline = Pipeline([
    ('rf', RandomForestClassifier(random_state=42))
])

lr_pipeline = Pipeline([
    ('lr', LogisticRegression(max_iter=200, random_state=42))
])

# Combine classifiers with VotingClassifier
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_pipeline), ('lr', lr_pipeline)],
    voting='hard'  # use 'soft' if you want to use probabilities
)

# Train the voting classifier
voting_classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Voting Classifier Accuracy:", accuracy)
