# Feature Engineering

The process of using domain knowledge of the data to create features that make machine learning algorithms work is feature engineering. 

Feature engineering has two goals primarily:

Preparing the proper input dataset, compatible with the machine learning algorithm requirements
Improving the performance of machine learning models

# 1. Feature Encoding

# Categorical Encoding

Machines understand numbers, not text. We need to convert each text category 
to numbers in order for the machine to process them using mathematical equations

To convert categorical columns to numerical columns so that a machine 
learning algorithm understands it. This process is called categorical encoding.

In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Label Encoding

In [10]:
Data = ['None','low','medium','high','very high']

Data = pd.DataFrame(Data, columns=['Types'])



In [11]:
Data

Unnamed: 0,Types
0,
1,low
2,medium
3,high
4,very high


In [12]:
labelencoder = LabelEncoder()

label_data = labelencoder.fit_transform(Data['Types'])
label_data

array([0, 2, 3, 1, 4], dtype=int64)

In [21]:
Data['label'] = label_data

In [15]:
Data

Unnamed: 0,Types,label
0,,0
1,low,2
2,medium,3
3,high,1
4,very high,4


In [23]:
Data['label_types'] = labelencoder.inverse_transform(Data['Types'])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [None]:
df.head()

In [None]:
df['education'].unique()

In [None]:
df['education'] = labelencoder.fit_transform(df['education'])

In [None]:
df[['gender','income']] = df[['gender','income']].apply(labelencoder.fit_transform)


In [None]:
df.head()

In [19]:
cd C:\Mehul Session\Session 27_Python Introduction\LMS FT\LMS Python\1_Python_Session_File\5_Machine Learning\2_Feature Engineering\1_Encoding_Techniques

C:\Mehul Session\Session 27_Python Introduction\LMS FT\LMS Python\1_Python_Session_File\5_Machine Learning\2_Feature Engineering\1_Encoding_Techniques


In [25]:
df = pd.read_csv("income.csv")

In [30]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [26]:
df['education'].unique()

array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',
       'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',
       '5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'], dtype=object)

In [27]:
df['education'] = labelencoder.fit_transform(df['education'])

In [28]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,1,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,11,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,7,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,15,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,15,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [29]:
df[['gender','income']] = df[['gender','income']].apply(labelencoder.fit_transform)


In [30]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,1,7,Never-married,Machine-op-inspct,Own-child,Black,1,0,0,40,United-States,0
1,38,Private,89814,11,9,Married-civ-spouse,Farming-fishing,Husband,White,1,0,0,50,United-States,0
2,28,Local-gov,336951,7,12,Married-civ-spouse,Protective-serv,Husband,White,1,0,0,40,United-States,1
3,44,Private,160323,15,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,1,7688,0,40,United-States,1
4,18,?,103497,15,10,Never-married,?,Own-child,White,0,0,0,30,United-States,0


# Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them

# This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column

# One-Hot Encoder

In [5]:
Employee={
    'Department':['Sales','HR' ,'DataScience','Finance','HR','Finance'],
    'Age':[44,34,26,35,23,30],
    'Salary':[72000,65000,80000,35000,30000,40000]
}
Employee

{'Age': [44, 34, 26, 35, 23, 30],
 'Department': ['Sales', 'HR', 'DataScience', 'Finance', 'HR', 'Finance'],
 'Salary': [72000, 65000, 80000, 35000, 30000, 40000]}

In [6]:
data=pd.DataFrame(Employee)

In [7]:
data

Unnamed: 0,Age,Department,Salary
0,44,Sales,72000
1,34,HR,65000
2,26,DataScience,80000
3,35,Finance,35000
4,23,HR,30000
5,30,Finance,40000


In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
Age           6 non-null int64
Department    6 non-null object
Salary        6 non-null int64
dtypes: int64(2), object(1)
memory usage: 224.0+ bytes


Department, is the categorical feature as it is represented by the object data type and 
the rest of them are numerical features as they are represented by int64.

# Using dummies values method for one hot encoding

In [25]:
dummies = pd.get_dummies(data.Department)
dummies

Unnamed: 0,DataScience,Finance,HR,Sales
0,0,0,0,1
1,0,0,1,0
2,1,0,0,0
3,0,1,0,0
4,0,0,1,0
5,0,1,0,0


In [26]:
merged = pd.concat([data,dummies],axis='columns')
merged

Unnamed: 0,Age,Department,Salary,DataScience,Finance,HR,Sales
0,44,Sales,72000,0,0,0,1
1,34,HR,65000,0,0,1,0
2,26,DataScience,80000,1,0,0,0
3,35,Finance,35000,0,1,0,0
4,23,HR,30000,0,0,1,0
5,30,Finance,40000,0,1,0,0


In [27]:
final = merged.drop(['Department'], axis='columns')
final

Unnamed: 0,Age,Salary,DataScience,Finance,HR,Sales
0,44,72000,0,0,0,1
1,34,65000,0,0,1,0
2,26,80000,1,0,0,0
3,35,35000,0,1,0,0
4,23,30000,0,0,1,0
5,30,40000,0,1,0,0


In [38]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,1,7,Never-married,Machine-op-inspct,Own-child,Black,1,0,0,40,United-States,0
1,38,Private,89814,11,9,Married-civ-spouse,Farming-fishing,Husband,White,1,0,0,50,United-States,0
2,28,Local-gov,336951,7,12,Married-civ-spouse,Protective-serv,Husband,White,1,0,0,40,United-States,1
3,44,Private,160323,15,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,1,7688,0,40,United-States,1
4,18,?,103497,15,10,Never-married,?,Own-child,White,0,0,0,30,United-States,0


In [37]:
dum_df = pd.get_dummies(df, columns=["relationship"], prefix=["Type_is"] )
dum_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,Type_is_Husband,Type_is_Not-in-family,Type_is_Other-relative,Type_is_Own-child,Type_is_Unmarried,Type_is_Wife
0,25,Private,226802,1,7,Never-married,Machine-op-inspct,Black,1,0,0,40,United-States,0,0,0,0,1,0,0
1,38,Private,89814,11,9,Married-civ-spouse,Farming-fishing,White,1,0,0,50,United-States,0,1,0,0,0,0,0
2,28,Local-gov,336951,7,12,Married-civ-spouse,Protective-serv,White,1,0,0,40,United-States,1,1,0,0,0,0,0
3,44,Private,160323,15,10,Married-civ-spouse,Machine-op-inspct,Black,1,7688,0,40,United-States,1,1,0,0,0,0,0
4,18,?,103497,15,10,Never-married,?,White,0,0,0,30,United-States,0,0,0,0,1,0,0


# Apply on whole data

In [39]:
new_df = pd.read_csv("income.csv")

In [40]:
dummi_df = pd.get_dummies(new_df)

In [41]:
dummi_df.head()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week,workclass_?,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,income_<=50K,income_>50K
0,25,226802,7,0,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,38,89814,9,0,0,50,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
2,28,336951,12,0,0,40,0,0,1,0,...,0,0,0,0,0,1,0,0,0,1
3,44,160323,10,7688,0,40,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4,18,103497,10,0,0,30,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
