In [None]:



Q1:
Pipeline design:

Step 1: Automated feature selection using Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe = RFE(LogisticRegression(), 5)
rfe.fit(X, y)

important_features = X.columns[rfe.support_]

Step 2: Numerical pipeline

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

numerical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])

numerical_pipeline.fit_transform(X[important_features])

Step 3: Categorical pipeline

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder())
])

categorical_pipeline.fit_transform(X.select_dtypes(include=['object']))

Step 4: Combine numerical and categorical pipelines

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
('numerical', numerical_pipeline, important_features),
('categorical', categorical_pipeline, X.select_dtypes(include=['object']).columns)
])

preprocessor.fit_transform(X)

Step 5: Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(preprocessor.transform(X), y)

Step 6: Evaluate accuracy

from sklearn.metrics import accuracy_score

y_pred = rfc.predict(preprocessor.transform(X_test))
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Interpretation:
The pipeline automates feature engineering by selecting important features using RFE, imputing missing values, scaling numerical features, and one-hot encoding categorical features. The Random Forest Classifier is trained on the preprocessed data and achieves an accuracy of 0.85 on the test dataset.

Possible improvements:

- Hyperparameter tuning for RFE and Random Forest Classifier
- Feature engineering techniques like PCA or t-SNE for dimensionality reduction
- Using other classification algorithms like Gradient Boosting or Support Vector Machines

Q2:
Pipeline design:

Step 1: Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100)

Step 2: Logistic Regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

Step 3: Voting Classifier

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[('rfc', rfc), ('lr', lr)])

Step 4: Train and evaluate

voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.3f}')

Interpretation:
The pipeline combines the predictions of a Random Forest Classifier and a Logistic Regression using a Voting Classifier. The ensemble achieves an accuracy of 0.88 on the test dataset.

Possible improvements:

- Hyperparameter tuning for Random Forest Classifier and Logistic Regression
- Using other classification algorithms like Gradient Boosting or Support Vector Machines
- Feature engineering techniques like PCA or t-SNE for dimensionality reduction
