Feature engineering is not just about creating different features but also includes different types of normalization and transformations.


In [1]:
# importing libraries

import pandas as pd
import numpy as np

In [2]:
# read data

df = pd.read_csv('datasets/messages.csv')

In [3]:
df.head()

Unnamed: 0,date,msg
0,2013-12-15 00:50:00,ищу на сегодня мужика 37
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше


We can extract 
1. Year
2. Week of year
3. Month
4. Day of week
5. Weekend
6. Hour

from the data column date

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    1000 non-null   object
 1   msg     1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [12]:
# convert date to datetime object

df['date'] = pd.to_datetime(df['date'])

In [37]:
df['Year'] = df['date'].dt.year
# df['Week of Year'] = df['date'].dt.weekofyear ---> deprecated
df['Month'] = df['date'].dt.month
df['Day of Week'] = df['date'].dt.dayofweek # Monday ---> 0, Tuesday ---> 1, ...
df['Weekend'] = (df['date'].dt.weekday >= 5).astype(int) # if week day >= 5 as Monday = 0, Tue = 1...
df['Hour'] = df['date'].dt.hour
df['Minutes'] = df['date'].dt.minute
df['Quarter'] = df['date'].dt.quarter

In [38]:
df.head()

Unnamed: 0,date,msg,Year,Week of Year,Month,Day of Week,Weekend,Hour,Minutes,Quarter
0,2013-12-15 00:50:00,ищу на сегодня мужика 37,2013,50,12,6,1,0,50,4
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826,2014,18,4,1,0,23,40,2
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576,2012,52,12,6,1,0,21,4
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...,2014,48,11,4,0,0,31,4
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше,2013,43,10,5,1,23,11,4


In [30]:
df['Day of Week'].describe()

count    1000.000000
mean        3.002000
std         2.025858
min         0.000000
25%         1.000000
50%         3.000000
75%         5.000000
max         6.000000
Name: Day of Week, dtype: float64

## Sample Features

In [31]:
# create a series of datetime with frequency of 10 hours

s = pd.date_range('2020-01-06', '2020-01-10', freq='10H').to_series()

# create some features based on datetime
features = {
    "dayofweek": s.dt.dayofweek.values,
    "dayofyear": s.dt.dayofyear.values,
    "hour": s.dt.hour.values,
    "is_leap_year": s.dt.is_leap_year.values,
    "quarter": s.dt.quarter.values,
}

In [32]:
features

{'dayofweek': array([0, 0, 0, 1, 1, 2, 2, 2, 3, 3], dtype=int64),
 'dayofyear': array([6, 6, 6, 7, 7, 8, 8, 8, 9, 9], dtype=int64),
 'hour': array([ 0, 10, 20,  6, 16,  2, 12, 22,  8, 18], dtype=int64),
 'is_leap_year': array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]),
 'quarter': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)}

In [33]:
type(features)

dict

In [34]:
# create a series of datetime with frequency of 10 hours

s = pd.date_range('2020-01-06', '2020-01-10', freq='24H').to_series()

# create some features based on datetime
features = {
    "dayofweek": s.dt.dayofweek.values,
    "dayofyear": s.dt.dayofyear.values,
    "hour": s.dt.hour.values,
    "is_leap_year": s.dt.is_leap_year.values,
    "quarter": s.dt.quarter.values,
}

In [35]:
features

{'dayofweek': array([0, 1, 2, 3, 4], dtype=int64),
 'dayofyear': array([ 6,  7,  8,  9, 10], dtype=int64),
 'hour': array([0, 0, 0, 0, 0], dtype=int64),
 'is_leap_year': array([ True,  True,  True,  True,  True]),
 'quarter': array([1, 1, 1, 1, 1], dtype=int64)}

In [36]:
features['dayofweek']

array([0, 1, 2, 3, 4], dtype=int64)

In [41]:
df2 = pd.read_csv('datasets/Mall_Customers.csv')
df2.head()

Unnamed: 0,CustomerID,Genre,Age,Annual_Income_(k$),Spending_Score
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


## Aggreagation using pandas

In [45]:
df2.rename({"Genre": "Gender"}, axis=1, inplace=True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   CustomerID          200 non-null    int64 
 1   Gender              200 non-null    object
 2   Age                 200 non-null    int64 
 3   Annual_Income_(k$)  200 non-null    int64 
 4   Spending_Score      200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


In [51]:
# create an aggregate dictionary
aggs = {}

# aggregate by spending score and calculate min, max, sum and mean value of this column
aggs['Spending_Score'] = ['sum', 'max', 'min', 'mean']

# for customer id we create size or total count
aggs['CustomerID'] = ['size']

# again for customer id we calculate the total unique
aggs['CustomerID'] = ['nunique']

# we group by customer id and calculate the aggregates
agg_df = df2.groupby('CustomerID').agg(aggs)
agg_df = agg_df.reset_index() # seperate index formed ---> Initially index col was CustomerID w/o this code line

agg_df

Unnamed: 0_level_0,CustomerID,Spending_Score,Spending_Score,Spending_Score,Spending_Score,CustomerID
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,max,min,mean,nunique
0,1,39,39,39,39.0,1
1,2,81,81,81,81.0,1
2,3,6,6,6,6.0,1
3,4,77,77,77,77.0,1
4,5,40,40,40,40.0,1
...,...,...,...,...,...,...
195,196,79,79,79,79.0,1
196,197,28,28,28,28.0,1
197,198,74,74,74,74.0,1
198,199,18,18,18,18.0,1


In [58]:
pd.concat([df2, agg_df], axis=1)


Unnamed: 0,CustomerID,Gender,Age,Annual_Income_(k$),Spending_Score,"(CustomerID, )","(Spending_Score, sum)","(Spending_Score, max)","(Spending_Score, min)","(Spending_Score, mean)","(CustomerID, nunique)"
0,1,Male,19,15,39,1,39,39,39,39.0,1
1,2,Male,21,15,81,2,81,81,81,81.0,1
2,3,Female,20,16,6,3,6,6,6,6.0,1
3,4,Female,23,16,77,4,77,77,77,77.0,1
4,5,Female,31,17,40,5,40,40,40,40.0,1
...,...,...,...,...,...,...,...,...,...,...,...
195,196,Female,35,120,79,196,79,79,79,79.0,1
196,197,Female,45,126,28,197,28,28,28,28.0,1
197,198,Male,32,126,74,198,74,74,74,74.0,1
198,199,Male,32,137,18,199,18,18,18,18.0,1


## Generate Random Features with Two Columns and 100 Rows

In [71]:
# generate random dataframe with 100 rows and 2 columns

df = pd.DataFrame(np.random.rand(100, 2), columns=[f"feature_{i}" for i in range(1,3)])
df

Unnamed: 0,feature_1,feature_2
0,0.001140,0.527652
1,0.229608,0.401723
2,0.370082,0.774157
3,0.762492,0.198948
4,0.424794,0.708745
...,...,...
95,0.621357,0.450957
96,0.876396,0.051598
97,0.027874,0.970125
98,0.698067,0.445372


## Converting the numbers to categories - Binning

Binning helps us to use numerical features as categorical

In [72]:
df2.head()

Unnamed: 0,CustomerID,Gender,Age,Annual_Income_(k$),Spending_Score
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [73]:
df2['Age'].describe()

count    200.000000
mean      38.850000
std       13.969007
min       18.000000
25%       28.750000
50%       36.000000
75%       49.000000
max       70.000000
Name: Age, dtype: float64

In [74]:
# binning on the basis of age

df2['Age_Group'] = pd.cut(df2['Age'], bins=[0,25,45,80], labels=['0-25', '25-45', '45-80'])

In [75]:
df2.head()

Unnamed: 0,CustomerID,Gender,Age,Annual_Income_(k$),Spending_Score,Age_Group
0,1,Male,19,15,39,0-25
1,2,Male,21,15,81,0-25
2,3,Female,20,16,6,0-25
3,4,Female,23,16,77,0-25
4,5,Female,31,17,40,25-45


In [77]:
df2[df2['Age_Group'] == '45-80']

Unnamed: 0,CustomerID,Gender,Age,Annual_Income_(k$),Spending_Score,Age_Group
8,9,Male,64,19,3,45-80
10,11,Male,67,19,14,45-80
12,13,Female,58,20,15,45-80
18,19,Male,52,23,29,45-80
22,23,Female,46,25,5,45-80
...,...,...,...,...,...,...
176,177,Male,58,88,15,45-80
178,179,Male,59,93,14,45-80
182,183,Male,46,98,15,45-80
186,187,Female,54,101,24,45-80


## Filling Missing Numerical Values

There are many ways to fill a misiing value:
1. Use a value that is not present in that specific feature and use that to fill the NaNs. (say 0 is not present so fill with 0) ---> One of the best ways but might not be very effective
    
    
2. Better than 0 is to fill with mean, median or mode.


3. A fancy way is to fill with k-nearest neighbour method ---> select sample with missing value ---> find nearest neighbour using some distance metrics, say Euclidean distance ---> take mean of all nearest neighbours ---> fill with that mean


4. Impute missing value in column by training a regression model that tries to predict the missing value based on other columns

## Filling with KNNImputer

In [78]:
# creating df with NaNs

# creating random numpy array with 10 samples and 6 features from 1 to 15
X = np.random.randint(1, 15, (10, 6))

# convert array to float
X = X.astype(float)

# randomly assign 10 elements to NaN 
X.ravel()[np.random.choice(X.size, 10, replace=False)] = np.nan

In [79]:
X

array([[ 3., 12.,  7., nan, nan, 14.],
       [nan,  3., 13., 12., nan,  7.],
       [ 3., 14.,  6., 11.,  3., 12.],
       [ 7.,  4., nan, nan,  7., nan],
       [14., 13., nan, 10., nan, 11.],
       [ 4., 10.,  7.,  4.,  8.,  3.],
       [ 2.,  2.,  7.,  9.,  3., nan],
       [ 6.,  7.,  5.,  1.,  4.,  8.],
       [10., 10.,  2., 14., 12.,  6.],
       [11.,  2.,  3., 12.,  2., 10.]])

In [81]:
# Using 2 nearest neighbour to fill the NaN
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=2)

knn_imputer.fit_transform(X)

array([[ 3. , 12. ,  7. ,  6. ,  3.5, 14. ],
       [ 4.5,  3. , 13. , 12. ,  5. ,  7. ],
       [ 3. , 14. ,  6. , 11. ,  3. , 12. ],
       [ 7. ,  4. ,  9. ,  6.5,  7. ,  7.5],
       [14. , 13. ,  4. , 10. ,  7.5, 11. ],
       [ 4. , 10. ,  7. ,  4. ,  8. ,  3. ],
       [ 2. ,  2. ,  7. ,  9. ,  3. ,  8.5],
       [ 6. ,  7. ,  5. ,  1. ,  4. ,  8. ],
       [10. , 10. ,  2. , 14. , 12. ,  6. ],
       [11. ,  2. ,  3. , 12. ,  2. , 10. ]])

## End Notes

1. Remember to scale and normalize features if using linear models like Logistic Regression or SVM


2. Tree based models will always work fine without any normalization of features.