Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features.

You have identified that some of the features are highly correlated and there are missing values in some of the columns.

 You want to build a pipeline that automates the feature engineering process and handles the missing values.


Design a pipeline that includes the following steps:

* Use an automated feature selection method to identify the important features in the dataset.

* Create a numerical pipeline that includes the following steps:
  * Impute the missing values in the numerical columns using the mean of the column values.

  * Scale the numerical columns using standardisation.

* Create a categorical pipeline that includes the following steps:
  * Impute the missing values in the categorical columns using the most frequent value of the column.

  * One-hot encode the categorical columns.

* Combine the numerical and categorical pipelines using a Column Transformer.

* Use a Random Forest Classifier to build the final model.

* Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

 I'll use the well-known Titanic dataset, which contains both numerical and categorical features and includes missing values.



## 1. Import Required Libraries


In [57]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel


## 2. Load and Prepare the Titanic Dataset
You can load the Titanic dataset directly from seaborn, which is a Python visualization library based on matplotlib. The seaborn library provides a clean and ready-to-use version of the Titanic dataset.



In [58]:
import seaborn as sns

# Load Titanic dataset
data = sns.load_dataset('titanic')
# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Drop rows with missing target values
titanic = titanic.dropna(subset=['survived'])

# Define features and target
X = titanic.drop(columns=['survived', 'embark_town', 'alive'])  # Drop 'embark_town' and 'alive' as they are redundant
y = titanic['survived']

# Convert 'sex' to categorical for feature processing
X['sex'] = X['sex'].astype('category')

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




## 3. Define Numerical and Categorical Features


In [59]:
# Identify numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns


## 4. Create Numerical Pipeline


In [60]:
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())  # Standardize numerical features
])


## 5. Create Categorical Pipeline


In [61]:
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent value
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])


## 6. Combine Pipelines Using Column Transformer


In [62]:
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])


## 7. Build and Combine the Full Pipeline


In [63]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])


## 8. Train and Evaluate the Model


In [64]:
# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")


Model Accuracy: 0.7765


## Explanation
1. Loading Data: We load the Titanic dataset and remove rows with missing target values ('survived').

2. Feature and Target Separation: Drop non-essential columns and separate features (X) from the target (y).

3. Feature Types: Identify numerical and categorical features for specific processing.

4. Numerical Pipeline: Impute missing numerical values with the mean and standardize the features.

5. Categorical Pipeline: Impute missing categorical values with the most frequent value and apply one-hot encoding.

6. Column Transformer: Apply appropriate pipelines to numerical and categorical features.

7. Full Pipeline: Combine preprocessing with feature selection and model training using a Random Forest classifier.

8. Model Training and Evaluation: Train the model and evaluate its accuracy on the test dataset.

## Possible Improvements
1. Feature Engineering: Add or create new features based on domain knowledge (e.g., creating features from names or ticket numbers).
2. Hyperparameter Tuning: Use GridSearchCV or RandomizedSearchCV to optimize the Random Forest hyperparameters.
3.Model Evaluation: Evaluate other metrics like precision, recall, and F1-score, especially if the dataset is imbalanced.

This pipeline provides a robust approach to handling the Titanic dataset, ensuring that both numerical and categorical features are processed effectively before model training.








Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

To build a pipeline that includes a Random Forest Classifier and a Logistic Regression Classifier, and then use a Voting Classifier to combine their predictions on the Iris dataset, you can follow these steps:



## Step 1: Import the necessary libraries


In [65]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score


## Step 2: Load the Iris dataset


In [66]:
iris = load_iris()
X = iris.data
y = iris.target


## Step 3: Split the data into training and test sets


In [67]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Step 4: Create individual pipelines for Random Forest and Logistic Regression


In [68]:
pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(random_state=42))
])


## Step 5: Combine the pipelines using a Voting Classifier


In [69]:
voting_clf = VotingClassifier(
    estimators=[('rf', pipeline_rf), ('lr', pipeline_lr)],
    voting='hard'
)


## Step 6: Train the Voting Classifier on the training data


In [70]:
voting_clf.fit(X_train, y_train)


## Step 7: Make predictions and evaluate accuracy


In [71]:
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of the Voting Classifier: {accuracy:.2f}')


Accuracy of the Voting Classifier: 1.00


## Interpretation
* Scaling: Both pipelines include StandardScaler to standardize features, which is important for Logistic Regression.
* Voting Classifier: The VotingClassifier combines the predictions of Random Forest and Logistic Regression. Here, "hard" voting is used, which predicts the label that gets the most votes.

##Suggestions for Improvement

* Hyperparameter Tuning: Consider using GridSearchCV or RandomizedSearchCV to optimize hyperparameters for both classifiers.

* Use of Soft Voting: Instead of hard voting, you could try soft voting, where predictions are based on the average of predicted probabilities.
