In [1]:
# Q1. You are working on a machine learning project where you have a dataset containing numerical and 
# categorical features. You have Identified that some of the features are highly correlated and there are 
# missing values in some of the columns. You want to build a pipeline that automates the feature 
# engineering process and handles the missing values.

# Design a pipeline that includes the following steps:
    
#  Use an automated feature selection method to identify the important features in the datasetC
#  Create a numerical pipeline that includes the following steps
#  Impute the missing values in the numerical columns using the mean of the column valuesC
#  Scale the numerical columns using standardisationC
#  Create a categorical pipeline that includes the following steps
#  Impute the missing values in the categorical columns using the most frequent value of the columnC
#  One-hot encode the categorical columnsC
#  Combine the numerical and categorical pipelines using a ColumnTransformerC
#  Use a Random Forest Classifier to build the final modelC
#  Evaluate the accuracy of the model on the test dataset.
    
# Note:- Your solution should include code snippets for each step of the pipeline, and a brief explanation of 
# each step. You should also provide an interpretation of the results and suggest possible improvements for 
# the pipeline.

# Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then 
# use a voting clasifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its 
# accuracy.

## ANS 1

In [2]:
# To build a pipeline that automates the feature engineering process and handles missing values in a machine learning project with numerical and categorical features,
# we can follow the steps below:

# Step 1: Automated Feature Selection
# Use an automated feature selection method to identify the important features in the dataset. There are various techniques available, 
# such as Recursive Feature Elimination (RFE) or feature importance from ensemble models like Random Forest or Gradient Boosting.

# Step 2: Numerical Pipeline
# The numerical pipeline will handle the preprocessing steps for numerical columns.

# 2.1. Impute Missing Values
# Impute the missing values in the numerical columns using the mean of the column values. This can be done using the SimpleImputer class from scikit-learn.

# 2.2. Scale Numerical Columns
# Scale the numerical columns using standardization. Standardization transforms the data to have zero mean and unit variance, which helps to bring 
# all features to a similar scale. This can be achieved using the StandardScaler class from scikit-learn.

# Step 3: Categorical Pipeline
# The categorical pipeline will handle the preprocessing steps for categorical columns.

# 3.1. Impute Missing Values
# Impute the missing values in the categorical columns using the most frequent value of the column. This can be done using the SimpleImputer class from scikit-learn.

# 3.2. One-Hot Encode Categorical Columns
# One-hot encode the categorical columns to convert them into binary vectors. This step creates new columns, where each column represents a unique category
# and indicates whether the original column had that category or not. The OneHotEncoder class from scikit-learn can be used for this purpose.

# Step 4: Combine Numerical and Categorical Pipelines
# Use the ColumnTransformer class from scikit-learn to combine the numerical and categorical pipelines. This allows us to apply different preprocessing steps 
# to different subsets of columns in the dataset.

# Step 5: Random Forest Classifier
# Use a Random Forest Classifier to build the final model. Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. 
# It can handle both numerical and categorical features and is robust to correlated features.

# Step 6: Evaluate Model Performance
# Evaluate the accuracy of the model on the test dataset to assess its performance.

# Here's an example code snippet that demonstrates the pipeline:

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Automated Feature Selection

# Perform feature selection and get the selected features
selected_features = ...

# Separate the selected features into numerical and categorical features
numerical_features = ...
categorical_features = ...

# Step 2: Numerical Pipeline

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 3: Categorical Pipeline

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

# Step 4: Combine Numerical and Categorical Pipelines

preprocessor = ColumnTransformer([
    ('numerical', numerical_pipeline, numerical_features),
    ('categorical', categorical_pipeline, categorical_features)
])

# Step 5: Random Forest Classifier

model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Step 6: Evaluate Model Performance

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

NameError: name 'X' is not defined

# Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then  use a voting clasifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [4]:
## Import the necessary libraries and load the Iris dataset.

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

In [5]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

In [9]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [10]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
# Create the individual classifiers
rf_classifier = RandomForestClassifier()
lr_classifier = LogisticRegression()

In [12]:
# Create the voting classifier
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_classifier), ('lr', lr_classifier)],
    voting='hard'
)

In [13]:
# Build the pipeline
pipeline = Pipeline([
    ('voting_classifier', voting_classifier)
])

In [14]:
# Train the pipeline
pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [15]:
# Make predictions on the test dataset
y_pred = pipeline.predict(X_test)

In [16]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 1.0
