## Q1. You are work#ng on a mach#ne learn#ng project where you have a dataset conta#n#ng numer#cal and categor#cal features. You have #dent#f#ed that some of the features are h#ghly correlated and there are
m#ss#ng values #n some of the columns. You want to bu#ld a p#pel#ne that automates the feature
eng#neer#ng process and handles the m#ss#ng valuesD
Des#gn a p#pel#ne that #ncludes the follow#ng steps"
Use an automated feature select#on method to #dent#fy the #mportant features #n the datasetC
Create a numer#cal p#pel#ne that #ncludes the follow#ng steps"
Impute the m#ss#ng values #n the numer#cal columns us#ng the mean of the column valuesC
Scale the numer#cal columns us#ng standard#sat#onC
Create a categor#cal p#pel#ne that #ncludes the follow#ng steps"
Impute the m#ss#ng values #n the categor#cal columns us#ng the most frequent value of the columnC
One-hot encode the categor#cal columnsC
Comb#ne the numer#cal and categor#cal p#pel#nes us#ng a ColumnTransformerC
Use a Random Forest Class#f#er to bu#ld the f#nal modelC
Evaluate the accuracy of the model on the test datasetD
Note! Your solut#on should #nclude code sn#ppets for each step of the p#pel#ne, and a br#ef explanat#on of
each step. You should also prov#de an #nterpretat#on of the results and suggest poss#ble #mprovements for
the p#pel#neD

Feature Engineering Pipeline:

    Automated Feature Selection:
        Utilize an automated feature selection method, such as Recursive Feature Elimination (RFE) or feature importance from a tree-based model, to identify important features in the dataset.

Numerical Pipeline:

    Impute Missing Values in Numerical Columns:
        Impute missing values in the numerical columns using the mean of the column values. This can be achieved using an imputer from a library like scikit-learn.

    Scale Numerical Columns:
        Scale the numerical columns using standardization (z-score normalization). This ensures that all numerical features have a mean of 0 and a standard deviation of 1. Use a StandardScaler from scikit-learn for this purpose.

Categorical Pipeline:

    Impute Missing Values in Categorical Columns:
        Impute missing values in the categorical columns using the most frequent value (mode) of each column. Employ an imputer from scikit-learn.

    One-Hot Encode Categorical Columns:
        One-hot encode the categorical columns to convert them into a binary matrix representation. This step is crucial for handling categorical variables in machine learning models. You can use OneHotEncoder from scikit-learn.

Column Transformer:

    Combine Numerical and Categorical Pipelines:
        Use a ColumnTransformer from scikit-learn to combine the outputs of the numerical and categorical pipelines. This allows for the simultaneous processing of numerical and categorical features.

Model Building:

    Random Forest Classifier:
        Employ a Random Forest Classifier for building the final predictive model. Random Forests are robust, handle complex relationships well, and are suitable for both regression and classification tasks.

Evaluation:

    Evaluate Model Accuracy:
        Evaluate the accuracy of the trained model on the test dataset. Use metrics such as accuracy, precision, recall, F1-score, and confusion matrix to assess the model's performance.

In [None]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Automated Feature Selection
feature_selection = SelectFromModel(RandomForestClassifier())

# Numerical Pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Final Pipeline
pipeline = Pipeline([
    ('feature_selection', feature_selection),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Fit the model on the training data
pipeline.fit(X_train, y_train)

# Predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate Model Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy}")


## Q2. Bu#ld a p#pel#ne that #ncludes a random forest class#f#er and a log#st#c regress#on class#f#er, and then use a vot#ng class#f#er to comb#ne the#r pred#ct#ons. Tra#n the p#pel#ne on the #r#s dataset and evaluate #ts accuracy.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build individual classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(random_state=42)

# Create a pipeline for each classifier
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', rf_classifier)
])

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', lr_classifier)
])

# Build a Voting Classifier combining the two pipelines
voting_classifier = VotingClassifier(estimators=[
    ('rf', rf_pipeline),
    ('lr', lr_pipeline)
], voting='hard')

# Train the Voting Classifier
voting_classifier.fit(X_train, y_train)

# Evaluate the accuracy on the test set
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Voting Classifier Accuracy: {accuracy}")


Voting Classifier Accuracy: 1.0


Two pipelines are created, each consisting of a StandardScaler for feature scaling and either a Random Forest Classifier or a Logistic Regression Classifier.
A Voting Classifier is then constructed, combining the two pipelines using majority voting (voting='hard').
The pipeline is trained on the training data, and the accuracy is evaluated on the test set.