
Data Science Masters


Note:  Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository 
           link through your dashboard. Make sure the repository is public.

Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values. 
Design a pipeline that includes the following steps"
 Use an automated feature selection method to identify the important features in the datasetC
 Create a numerical pipeline that includes the following steps"
 Impute the missing values in the numerical columns using the mean of the column valuesC
 Scale the numerical columns using standardisationC
 Create a categorical pipeline that includes the following steps"
 Impute the missing values in the categorical columns using the most frequent value of the columnC
 One-hot encode the categorical columns
 Combine the numerical and categorical pipelines using a ColumnTransformerC
 Use a Random Forest Classifier to build the final modelC
 Evaluate the accuracy of the model on the test dataset. 
Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline. 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Step 1: Automated feature selection
feature_selection_model = RandomForestClassifier()  # You can use any model for feature selection
feature_selector = SelectFromModel(feature_selection_model)

# Step 2: Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values using mean
    ('scaler', StandardScaler())  # Scale numerical columns using standardization
])

# Step 3: Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values using most frequent value
    ('encoder', OneHotEncoder())  # One-hot encode categorical columns
])

# Step 4: Combine numerical and categorical pipelines
preprocessor = ColumnTransformer([
    ('numeric', numerical_pipeline, numerical_features),  # numerical_features contains the numerical column names
    ('categorical', categorical_pipeline, categorical_features)  # categorical_features contains the categorical column names
])

# Step 5: Build the final model pipeline
final_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())  # You can adjust the hyperparameters of RandomForestClassifier
])

# Step 6: Evaluate the model
final_pipeline.fit(X_train, y_train)  # Assuming X_train and y_train are your training data
y_pred = final_pipeline.predict(X_test)  # Assuming X_test is your test data
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)



Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a pipeline with Random Forest and Logistic Regression classifiers
rf_pipeline = Pipeline([
    ('rf', RandomForestClassifier(random_state=42))
])

lr_pipeline = Pipeline([
    ('lr', LogisticRegression(random_state=42))
])

# Combine the individual pipelines into a Voting Classifier
voting_pipeline = VotingClassifier(estimators=[
    ('rf', rf_pipeline),
    ('lr', lr_pipeline)
], voting='hard')  # 'hard' voting combines the predicted class labels

# Train the voting classifier on the training data
voting_pipeline.fit(X_train, y_train)

# Evaluate the accuracy of the voting classifier on the test data
accuracy = accuracy_score(y_test, voting_pipeline.predict(X_test))
print("Accuracy:", accuracy)
