**Feature engineering**

- so far we have perforned feature analysis
   - Data frame quick checks
   - Categorical column analysis
   - Bar charts, pie charts
   - Numerical column analysis
   - Histograms, box plots, outliers
   - Bi-multi variate analysis
   - correlation etc
- now we need to learn feature engineering
    - we will create a new column for better ML models
    - we will perform the **Encoding** which is convert categorical to numerical data
    - we will do data **transformations**
    - we will perform **Scaling: standardization and Normalization**
    - we will perfrom the **missing value analysis**
- simply feature engineering means data will be modified

- Feature Analysis
- Feature Engineering
- Feature selection
- model development
- model evaluation
- model tuning
- model deployment

## Encoding

- Encoding means convert categorical data to Numerical data
- ML/DL or any models works based on maths
- Models can not understand english characters
- so it is important convert categorical label to numerical values
- for example Gender column has two lables
   - Male
   - Female
- then Female represent as 0, Male represent 1 based on alphabetical order

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
file_path=r"C:\Users\sneha\Documents\Data_files\Visadataset.csv"
visa_df=pd.read_csv(file_path)
visa_df
cat=visa_df.select_dtypes(include='object').columns
num=visa_df.select_dtypes(exclude='object').columns

In [21]:
visa_df['case_status'].unique()

array(['Denied', 'Certified'], dtype=object)

- Certified should assign as 0
- Denied should assign as 1

In [22]:
#mapping
d={'Certified':0,'Denied':1}
d

{'Certified': 0, 'Denied': 1}

In [13]:
visa_df['case_status'].map(d)

0        1
1        0
2        1
3        1
4        0
        ..
25475    0
25476    0
25477    0
25478    0
25479    0
Name: case_status, Length: 25480, dtype: int64

In [14]:
visa_df['case_status']=visa_df['case_status'].map(d)
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,1
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,0
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,1
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,1
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,0
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,0
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,0
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,0


In [None]:
# if you are getting NaN you ran double time
# so do one by one from starting
# read the data
# select the columns
# create a mapper
# map to slected column

In [26]:
visa_df['continent'].unique()

array(['Asia', 'Africa', 'North America', 'Europe', 'South America',
       'Oceania'], dtype=object)

In [27]:
# create a dictionary
d={}
d['Asia']=0
d['Africa']=1
d['North America']=2
d['Europe']=3
d['South America']=4
d['Oceania']=5
d

{'Asia': 0,
 'Africa': 1,
 'North America': 2,
 'Europe': 3,
 'South America': 4,
 'Oceania': 5}

In [29]:
lables=visa_df['continent'].unique()
n=len(lables)
d={}
for i in range(n):
    d[lables[i]]=i
d

{'Asia': 0,
 'Africa': 1,
 'North America': 2,
 'Europe': 3,
 'South America': 4,
 'Oceania': 5}

In [32]:
lables=sorted(visa_df['continent'].unique())
d={}
for i, j in enumerate(lables):
    d[j]=i
d

{'Africa': 0,
 'Asia': 1,
 'Europe': 2,
 'North America': 3,
 'Oceania': 4,
 'South America': 5}

In [36]:
l=['Apple','Banana']
for i in l:
    print(i)

Apple
Banana


In [34]:
l=['Apple','Banana']
for i in range(len(l)):
    print(i)

0
1


In [37]:
l=['Apple','Banana']
for i in range(len(l)):
    print(i,l[i])

0 Apple
1 Banana


In [38]:
l=['Apple','Banana']
for i,j in enumerate(l):
    print(i,j)

0 Apple
1 Banana


In [41]:
lables=sorted(visa_df['continent'].unique())
d={j:i for i,j in enumerate(lables)}
visa_df['continent']=visa_df['continent'].map(d)

In [42]:
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,High School,N,N,14513,2007,West,592.2029,Hour,Y,Denied
1,EZYV02,1,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,Certified
2,EZYV03,1,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,Denied
3,EZYV04,1,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,Denied
4,EZYV05,0,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,Certified
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,Certified
25476,EZYV25477,1,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,Certified
25477,EZYV25478,1,Master's,Y,N,1121,1910,South,146298.8500,Year,N,Certified
25478,EZYV25479,1,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,Certified


In [None]:
lables=sorted(visa_df['continent'].unique())
d={j:i for i,j in enumerate(lables)}
visa_df['continent']=visa_df['continent'].map(d)
# i need to repeat above 3 lines for all cat column
# except case_id

In [44]:
# step-1: read the data
# step-2: get the cat columns
# lables=sorted(visa_df[i].unique())
#d={j:i for i,j in enumerate(lables)}
#visa_df[i]=visa_df[i].map(d)

In [47]:

file_path=r"C:\Users\sneha\Documents\Data_files\Visadataset.csv"
visa_df=pd.read_csv(file_path)
cat=visa_df.select_dtypes(include='object').columns
for i in cat[1:]:
    lables=sorted(visa_df[i].unique())
    d={j:i for i,j in enumerate(lables)}
    visa_df[i]=visa_df[i].map(d)

visa_df
    

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,2,0,0,14513,2007,4,592.2029,0,1,1
1,EZYV02,1,3,1,0,2412,2002,2,83425.6500,3,1,0
2,EZYV03,1,0,0,1,44444,2008,4,122996.8600,3,1,1
3,EZYV04,1,0,0,0,98,1897,4,83434.0300,3,1,1
4,EZYV05,0,3,1,0,1082,2005,3,149907.3900,3,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,0,1,1,2601,2008,3,77092.5700,3,1,0
25476,EZYV25477,1,2,1,0,3274,2006,2,279174.7900,3,1,0
25477,EZYV25478,1,3,1,0,1121,1910,3,146298.8500,3,0,0
25478,EZYV25479,1,3,1,1,1918,1887,4,86154.7700,3,1,0


## Label Encoder

- LableEncoder is a method will convert cat lables to numerical values
- It is uniue package : **sklearn (sickit-learn)**
   - The class Name: **preprocessing**
     - The method name: LableEncoder

In [48]:
#from <package>.<class>import <method>
from sklearn.preprocessing import LabelEncoder


- step-1: import the method
- step-2: save the method
- step-3: apply fit transform
- **fit transform**
   - **fit**:means fingding logic
     - in above ex fit means create **mapper** or **dict**
   - **transform**: apply the logic and change the data
     - apply the mapper on the data

In [49]:
file_path=r"C:\Users\sneha\Documents\Data_files\Visadataset.csv"
visa_df=pd.read_csv(file_path)
cat=visa_df.select_dtypes(include='object').columns

# import the method
from sklearn.preprocessing import LabelEncoder

# save the method
le=LabelEncoder()

# apply fit transfrom
visa_df['case_status']=le.fit_transform(visa_df['case_status'])
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,Asia,High School,N,N,14513,2007,West,592.2029,Hour,Y,1
1,EZYV02,Asia,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,0
2,EZYV03,Asia,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,1
3,EZYV04,Asia,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,1
4,EZYV05,Africa,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,Asia,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,0
25476,EZYV25477,Asia,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,0
25477,EZYV25478,Asia,Master's,Y,N,1121,1910,South,146298.8500,Year,N,0
25478,EZYV25479,Asia,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,0


In [50]:
from sklearn.preprocessing import LabelEncoder

# save the method
le=LabelEncoder()

# apply fit transfrom
visa_df['continent']=le.fit_transform(visa_df['continent'])
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,High School,N,N,14513,2007,West,592.2029,Hour,Y,1
1,EZYV02,1,Master's,Y,N,2412,2002,Northeast,83425.6500,Year,Y,0
2,EZYV03,1,Bachelor's,N,Y,44444,2008,West,122996.8600,Year,Y,1
3,EZYV04,1,Bachelor's,N,N,98,1897,West,83434.0300,Year,Y,1
4,EZYV05,0,Master's,Y,N,1082,2005,South,149907.3900,Year,Y,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,Bachelor's,Y,Y,2601,2008,South,77092.5700,Year,Y,0
25476,EZYV25477,1,High School,Y,N,3274,2006,Northeast,279174.7900,Year,Y,0
25477,EZYV25478,1,Master's,Y,N,1121,1910,South,146298.8500,Year,N,0
25478,EZYV25479,1,Master's,Y,Y,1918,1887,West,86154.7700,Year,Y,0


In [51]:
file_path=r"C:\Users\sneha\Documents\Data_files\Visadataset.csv"
visa_df=pd.read_csv(file_path)
cat=visa_df.select_dtypes(include='object').columns
from sklearn.preprocessing import LabelEncoder

# save the method
le=LabelEncoder()

# apply fit transfrom
for i in cat[1:]:
    visa_df[i]=le.fit_transform(visa_df[i])
visa_df

Unnamed: 0,case_id,continent,education_of_employee,has_job_experience,requires_job_training,no_of_employees,yr_of_estab,region_of_employment,prevailing_wage,unit_of_wage,full_time_position,case_status
0,EZYV01,1,2,0,0,14513,2007,4,592.2029,0,1,1
1,EZYV02,1,3,1,0,2412,2002,2,83425.6500,3,1,0
2,EZYV03,1,0,0,1,44444,2008,4,122996.8600,3,1,1
3,EZYV04,1,0,0,0,98,1897,4,83434.0300,3,1,1
4,EZYV05,0,3,1,0,1082,2005,3,149907.3900,3,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
25475,EZYV25476,1,0,1,1,2601,2008,3,77092.5700,3,1,0
25476,EZYV25477,1,2,1,0,3274,2006,2,279174.7900,3,1,0
25477,EZYV25478,1,3,1,0,1121,1910,3,146298.8500,3,0,0
25478,EZYV25479,1,3,1,1,1918,1887,4,86154.7700,3,1,0


## OneHotEncoder

- First how many lables are there in a columns
- That many new columns will be created extra
- for ex case status has two lables
   - Certified
   - Denied
- so it will create Two new extra columns
   - case_status_Certified
   - case_status_Denied
- at a time one column value become 1 other column value become 0
- This means a **one hot encoder**


In [None]:
# continent    C_A     C_Af     C_E
# Asia          1       0        0
# Africa        0       1        0
# Europe        0       0        1

## pd.get_dummies

In [56]:
file_path=r"C:\Users\sneha\Documents\Data_files\Visadataset.csv"
visa_df=pd.read_csv(file_path)
cat=visa_df.select_dtypes(include='object').columns
pd.get_dummies(visa_df['case_status'],dtype=int)

Unnamed: 0,Certified,Denied
0,0,1
1,1,0
2,0,1
3,0,1
4,1,0
...,...,...
25475,1,0
25476,1,0
25477,1,0
25478,1,0


In [57]:
file_path=r"C:\Users\sneha\Documents\Data_files\Visadataset.csv"
visa_df=pd.read_csv(file_path)
cat=visa_df.select_dtypes(include='object').columns
pd.get_dummies(visa_df['continent'],dtype=int)

Unnamed: 0,Africa,Asia,Europe,North America,Oceania,South America
0,0,1,0,0,0,0
1,0,1,0,0,0,0
2,0,1,0,0,0,0
3,0,1,0,0,0,0
4,1,0,0,0,0,0
...,...,...,...,...,...,...
25475,0,1,0,0,0,0
25476,0,1,0,0,0,0
25477,0,1,0,0,0,0
25478,0,1,0,0,0,0
