Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:

- Use an automated feature selection method to identify the important features in the dataset.
- Create a numerical pipeline that includes the following steps:
- Impute the missing values in the numerical columns using the mean of the column values.
- Scale the numerical columns using standardisation.
- Create a categorical pipeline that includes the following steps:
- Impute the missing values in the categorical columns using the most frequent value of the column.
- One-hot encode the categorical columns.
- Combine the numerical and categorical pipelines using a ColumnTransformer.
- Use a Random Forest Classifier to build the final model.
- Evaluate the accuracy of the model on the test dataset.

Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to comb#ne their predictions. Tra#n the pipeline on the iris dataset and evaluate its accuracy.

In [1]:
import pandas as pd

data = pd.read_csv('/Users/aakanksha/My_Codes/data-science-master-course/data/WA_Fn-UseC_-HR-Employee-Attrition.csv')
data

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,23,2,Medical,1,2061,...,3,80,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,Research & Development,6,1,Medical,1,2062,...,1,80,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,Research & Development,4,3,Life Sciences,1,2064,...,2,80,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,Sales,2,3,Medical,1,2065,...,4,80,0,17,3,2,9,6,0,8


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [3]:
data.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [4]:
data.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeNumber              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
Over18                      0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSince

In [5]:
X = data.drop(labels=['Attrition'], axis=1)
y = data[['Attrition']]

In [6]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['Attrition'] = encoder.fit_transform(data['Attrition'])
data[['Attrition']]

Unnamed: 0,Attrition
0,1
1,0
2,1
3,0
4,0
...,...
1465,0
1466,0
1467,0
1468,0


In [7]:
from sklearn.impute import SimpleImputer #handling missing values
from sklearn.preprocessing import OneHotEncoder #handling categorical features
from sklearn.preprocessing import StandardScaler #feature scaling
from sklearn.pipeline import Pipeline #automating the entire process using pipeline 
from sklearn.compose import ColumnTransformer #automating the entire process using pipeline 

In [8]:
# Assuming you have a DataFrame called 'df' containing numerical and categorical features

# Select numerical features
numerical_features = data.select_dtypes(include=['float64', 'int64'])

# Select categorical features
categorical_features = data.select_dtypes(include=['object'])

categorical_features

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus,Over18,OverTime
0,Travel_Rarely,Sales,Life Sciences,Female,Sales Executive,Single,Y,Yes
1,Travel_Frequently,Research & Development,Life Sciences,Male,Research Scientist,Married,Y,No
2,Travel_Rarely,Research & Development,Other,Male,Laboratory Technician,Single,Y,Yes
3,Travel_Frequently,Research & Development,Life Sciences,Female,Research Scientist,Married,Y,Yes
4,Travel_Rarely,Research & Development,Medical,Male,Laboratory Technician,Married,Y,No
...,...,...,...,...,...,...,...,...
1465,Travel_Frequently,Research & Development,Medical,Male,Laboratory Technician,Married,Y,No
1466,Travel_Rarely,Research & Development,Medical,Male,Healthcare Representative,Married,Y,No
1467,Travel_Rarely,Research & Development,Life Sciences,Male,Manufacturing Director,Married,Y,Yes
1468,Travel_Frequently,Sales,Medical,Male,Sales Executive,Married,Y,No


In [9]:
numerical_features.shape

(1470, 27)

In [10]:
from sklearn.feature_selection import SelectKBest, chi2

# Perform one-hot encoding on the categorical features
categorical_features_encoded = pd.get_dummies(categorical_features)

# Perform feature selection
k_best = SelectKBest(score_func=chi2, k=5)
selected_features = k_best.fit_transform(categorical_features_encoded, y)

# Get the column names of the selected features
selected_feature_indices = k_best.get_support(indices=True)
selected_categorical_feature = categorical_features_encoded.columns[selected_feature_indices]

# Print the selected feature names
print("Top 5 Selected Categorical Features:")
print(selected_categorical_feature)

Top 5 Selected Categorical Features:
Index(['BusinessTravel_Travel_Frequently', 'JobRole_Sales Representative',
       'MaritalStatus_Single', 'OverTime_No', 'OverTime_Yes'],
      dtype='object')


In [13]:
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.feature_selection import SelectKBest, f_classif
X_num = X[numerical_features]
k_best_numerical = SelectKBest(f_classif,k=10)
k_best_numerical.fit_transform(X_num,y)
selected_num_features = list(X_num.columns[k_best_numerical.get_support()])
selected_num_features

ValueError: Boolean array expected for the condition, not int64

In [39]:
# Perform one-hot encoding on the categorical features
categorical_features_encoded = pd.get_dummies(categorical_features)

# Perform feature selection
k_best = SelectKBest(score_func=chi2, k=5)
selected_features = k_best.fit_transform(categorical_features_encoded, y)

# Get the column names of the selected features
selected_feature_indices = k_best.get_support(indices=True)
selected_categorical_feature = categorical_features_encoded.columns[selected_feature_indices]

# Print the selected feature names
print("Top 5 Selected Categorical Features:")
print(selected_categorical_feature)

Top 5 Selected Categorical Features:
Index(['BusinessTravel_Travel_Frequently', 'JobRole_Sales Representative',
       'MaritalStatus_Single', 'OverTime_No', 'OverTime_Yes'],
      dtype='object')


(5,)

In [40]:
selected_features = selected_numerical_feature + selected_categorical_feature
selected_features

ValueError: operands could not be broadcast together with shapes (10,) (5,) 

In [30]:
#creating numerical and categorical pipeline

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder,StandardScaler

# Numeric Pipeline creation
num_pipeline = Pipeline(steps=[('imputer',SimpleImputer(strategy='mean')),
                               ('scaler',StandardScaler())])

# Categorical Pipeline creation
cat_pipeline = Pipeline(steps=[('imputer',SimpleImputer(strategy='most_frequent')),
                               ('one_hot_encoder',OneHotEncoder()),
                               ('scaler',StandardScaler(with_mean=False))])

In [None]:
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([('num_pipeline',num_pipeline,selected_num_features),
                                  ('cat_pipeline',cat_pipeline,selected_cat_features)])

In [None]:
Transform the dataset with ColumnTransformer
xtrain_transformed = pd.DataFrame(preprocessor.fit_transform(xtrain),columns=preprocessor.get_feature_names_out())
xtest_transformed = pd.DataFrame(preprocessor.transform(xtest),columns=preprocessor.get_feature_names_out())

In [None]:
preprocessor.get_feature_names_out()

In [20]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_regression

# Read the dataset
df = pd.read_csv('/Users/aakanksha/My_Codes/data-science-master-course/data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Select only the numerical features
numerical_features = df.select_dtypes(include=['int64', 'float64'])

# Handle missing or non-numeric values
numerical_features = numerical_features.fillna(0)  # Replace missing values with 0
numerical_features = numerical_features.apply(pd.to_numeric, errors='coerce')  # Convert non-numeric values to NaN

# Extract the target variable
target = df['Attrition']

# Perform feature selection
k_best = SelectKBest(score_func=f_regression, k=10)
selected_features = k_best.fit_transform(numerical_features, target)

# Get the column names of the selected features
selected_feature_indices = k_best.get_support(indices=True)
selected_feature_names = numerical_features.columns[selected_feature_indices]

# Print the selected feature names
print("Top 10 Selected Numerical Features:")
print(selected_feature_names)

TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''