## Q1. You are work#ng on a mach#ne learn#ng project where you have a dataset conta#n#ng numer#cal and categor#cal features. You have #dent#f#ed that some of the features are h#ghly correlated and there are m#ss#ng values #n some of the columns. You want to bu#ld a p#pel#ne that automates the feature eng#neer#ng process and handles the m#ss#ng valuesD

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score


In [2]:

# Load Titanic dataset from seaborn
titanic_data = sns.load_dataset("titanic")


In [3]:
titanic_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
titanic_data.isnull().sum()


survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [9]:
titanic_data.duplicated().sum()

107

In [13]:
titanic_data=titanic_data.drop_duplicates()

In [14]:
titanic_data.duplicated().sum()

0

In [15]:
titanic_data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [16]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 784 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     784 non-null    int64   
 1   pclass       784 non-null    int64   
 2   sex          784 non-null    object  
 3   age          678 non-null    float64 
 4   sibsp        784 non-null    int64   
 5   parch        784 non-null    int64   
 6   fare         784 non-null    float64 
 7   embarked     782 non-null    object  
 8   class        784 non-null    category
 9   who          784 non-null    object  
 10  adult_male   784 non-null    bool    
 11  deck         202 non-null    category
 12  embark_town  782 non-null    object  
 13  alive        784 non-null    object  
 14  alone        784 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 77.0+ KB


In [17]:

# Split the data into features (X) and target variable (y)
X = titanic_data.drop("survived", axis=1)
y = titanic_data["survived"]


In [18]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [19]:
# Define numerical features and categorical features
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=[np.object]).columns.tolist()


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  categorical_features = X.select_dtypes(include=[np.object]).columns.tolist()


In [22]:
numerical_features,categorical_features

(['pclass', 'age', 'sibsp', 'parch', 'fare'],
 ['sex', 'embarked', 'who', 'embark_town', 'alive'])

In [23]:
# Numerical Pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder())
])

In [24]:
# Combine the numerical and categorical pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])


In [25]:
preprocessor

In [26]:
# Create the final pipeline with feature selection and a Random Forest Classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [27]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)


In [28]:

# Evaluate the accuracy on the test dataset
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test dataset: {accuracy:.2f}")


Accuracy on the test dataset: 1.00



**Explanation of Each Step:**

1. **Load Dataset:**
   - Loaded the Titanic dataset from seaborn.

2. **Split Data:**
   - Split the dataset into features (X) and the target variable (y).
   - Split the data into training and test sets.

3. **Numerical and Categorical Pipelines:**
   - Created separate pipelines for numerical and categorical features.
   - Imputed missing values using the mean for numerical columns and the most frequent value for categorical columns.
   - Applied standardization for numerical columns and one-hot encoding for categorical columns.

4. **ColumnTransformer:**
   - Combined numerical and categorical pipelines using `ColumnTransformer`.

5. **Random Forest Classifier:**
   - Used a Random Forest Classifier as the final model.

6. **Evaluate Model:**
   - Evaluated the accuracy of the model on the test dataset.

**Interpretation of Results:**
   - The code evaluates the accuracy of the Random Forest Classifier on the Titanic dataset, providing a measure of its performance.

**Possible Improvements:**
   - Fine-tune hyperparameters of the Random Forest Classifier.
   - Experiment with different feature selection methods.
   - Explore other imputation strategies for missing values.
   - Consider using cross-validation for a more robust evaluation.

## Q2. Bu#ld a p#pel#ne that #ncludes a random forest class#f#er and a log#st#c regress#on class#f#er, and then use a vot#ng class#f#er to comb#ne the#r pred#ct#ons. Tra#n the p#pel#ne on the #r#s dataset and evaluate #ts accuracy.

In [29]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(random_state=42)

# Create a pipeline with StandardScaler (optional) and the classifiers
pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', rf_classifier)
])

pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', lr_classifier)
])

# Create a Voting Classifier combining the individual classifiers
voting_classifier = VotingClassifier(
    estimators=[('rf', pipeline_rf), ('lr', pipeline_lr)],
    voting='hard'  # Use 'hard' for majority voting
)

# Train the pipeline on the training data
voting_classifier.fit(X_train, y_train)

# Evaluate the accuracy on the test dataset
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy using Voting Classifier: {accuracy:.2f}")


Accuracy using Voting Classifier: 1.00
