##  Q1. You are working on a machine learining project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.
## Design a pipeline that includes the following steps"
## - Use an automated feature selection method to identify the important features in the dataset.
## - Create a numerical pipeline that includes the following steps"
## - Impute the missing values in the numerical columns using the mean of the column values.
## - Scale the numerical columns using standardisation.
## - Create a categorical pipeline that includes the following steps"
## - Impute the missing values in the categorical columns using the most frequent value of the column.
## - One-hot encode the categorical columns.
## - Combine the numerical and categorical pipelines using a ColumnTransformer.
## - Use a Random Forest Classifier to build the final model.
## - Evaluate the accuracy of the model on the test dataset.
### **Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.**

In [136]:
titanic_data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [135]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectKBest, f_classif

import warnings
warnings.filterwarnings("ignore")

# Step 1: Load the Titanic dataset
titanic_data = pd.read_csv("titanic.csv")

# Step 2: Split the dataset into features and target variable
X = titanic_data.drop(['survived', "who", "embark_town"], axis=1)
y = titanic_data['survived']

# Define the column lists
continuous_columns = [f for f in X.columns if X[f].dtype == 'float']
ordinal_columns = ["pclass", "sibsp", "parch", "embarked", "class", "deck"]
nominal_columns = ["sex","adult_male", "alive", "alone"]

# Define the ordinal encoding mapping
ordinal_mapping = [
    ["pclass", [1, 2, 3]],
    ["sibsp", [0, 1, 2, 3, 4, 5, 8]],
    ["parch", [0, 1, 2, 3, 4, 5, 6]],
    ["embarked", ["C", "Q", "S"]],
    ["class", ["First", "Second", "Third"]],
    ["deck", ["A", "B", "C", "D", "E", "F", "G"]]
]

# Define the preprocessing pipeline using ColumnTransformer
preprocessor = ColumnTransformer([
    ("ordinal", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OrdinalEncoder(categories=[mapping[1] for mapping in ordinal_mapping]))
    ]), ordinal_columns),
    ("nominal", Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder())
    ]), nominal_columns),
    ("continuous", Pipeline([
        ("imputer", SimpleImputer(strategy="mean"))
    ]), continuous_columns)
])

# Define the feature selection pipeline using SelectKBest
feature_selector = SelectKBest(score_func=f_classif, k=10)

# Define the final pipeline with preprocessing and modeling
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("feature_selector", feature_selector),
    ("model", RandomForestClassifier())
])

# Step 3: Train a Random Forest classifier with cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv_results = cross_validate(pipeline, X, y, cv=5, scoring=scoring)

# Step 4: Print the cross-validated results
print("Cross-Validation Results:")
for metric in scoring:
    print(metric, ":", cv_results['test_'+metric].mean())


Cross-Validation Results:
accuracy : 1.0
precision : 1.0
recall : 1.0
f1 : 1.0


## Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [2]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Define the pipelines for individual classifiers
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(random_state=42))
])

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(random_state=42))
])

# Step 4: Combine the pipelines using a Voting Classifier
voting_clf = VotingClassifier(
    estimators=[
        ('rf', rf_pipeline),
        ('lr', lr_pipeline)
    ],
    voting='hard'  # Use majority voting
)

# Step 5: Train the pipeline using cross-validation
scores = cross_val_score(voting_clf, X_train, y_train, cv=5)

# Step 6: Fit the pipeline on the full training set
voting_clf.fit(X_train, y_train)

# Step 7: Make predictions on the test set
y_pred = voting_clf.predict(X_test)

# Step 8: Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Cross-validated Accuracy:", scores.mean())
print("Accuracy on Test Set:", accuracy)


Cross-validated Accuracy: 0.9416666666666667
Accuracy on Test Set: 1.0
