# Feature Engineering
- Process of selecting, transforming or creating new features from the raw dataset for improving the machine Learning model

## Common Techniques used in Feature Engineering
1. **Feature Selection** - Identifying and selecting most relevant features from the dataset - Domain knowledge, statistical methods, feature importance

2. **Feature Scaling** - Ensuring that all the features are on a similar scale to prevent some features from dominating - Standardization, normalization..

3. **Feature Transformation** - Tranform the feature to make them more suitable for analysis

4. **Handling Missing Values** - Dealing with the missing values - imputation, sophisticated methods; predictive imputations,

5. **Encoding of the categorical Variables** - Converting categorical columns to be processed by Machine Learning Algorithms.

6. **Creating Interaction Terms** - Combining two or more features to form a new feature.

7. **Feature Aggregation** - Aggregating multiple related features into a single feature.

8. **Dimentionality Reduction** - Reduce the number of feature while ensuring that we keep the relevant information- PCA

In [None]:
# Feature Engineering
# import the libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures

In [None]:
# Create a dataset
np.random.seed(0)
data = pd.DataFrame({
    'Feature1' : np.random.normal(0, 1, 100),
    'Feature2' : np.random.normal(0, 5, 100),
    'Feature3' : np.random.choice(['A', 'B', 'C'], 100)
})

In [None]:
data.head()

Unnamed: 0,Feature1,Feature2,Feature3
0,1.764052,9.415753,A
1,0.400157,-6.738795,C
2,0.978738,-6.352425,B
3,2.240893,4.846984,C
4,1.867558,-5.865617,B


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Feature1  100 non-null    float64
 1   Feature2  100 non-null    float64
 2   Feature3  100 non-null    object 
dtypes: float64(2), object(1)
memory usage: 2.5+ KB


In [None]:
data.loc[::10, 'Feature1'] = np.nan # Intorduces missing values

In [None]:
# Feature Selection
selected_features = ['Feature1', 'Feature2']
data_selected = data[selected_features]

In [None]:
data_selected

Unnamed: 0,Feature1,Feature2
0,,9.415753
1,0.400157,-6.738795
2,0.978738,-6.352425
3,2.240893,4.846984
4,1.867558,-5.865617
...,...,...
95,0.706573,-0.857732
96,0.010500,3.858953
97,1.785870,4.117521
98,0.126912,10.816180


In [None]:
# Feature Scaling
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data[['Feature1', 'Feature2']])

In [None]:
scaled_data

array([[        nan,  1.74078977],
       [ 0.29521588, -1.38186686],
       [ 0.88448093, -1.3071819 ],
       [ 2.16994356,  0.85765153],
       [ 1.78971421, -1.21308245],
       [-1.10765542,  1.79923417],
       [ 0.85530231, -0.47902557],
       [-0.266483  , -0.80167608],
       [-0.2174557 ,  1.77924787],
       [ 0.30584998,  1.35164437],
       [        nan,  1.72572044],
       [ 1.36879787,  0.79642212],
       [ 0.66276054, -0.91163501],
       [ 0.01159113,  1.76680225],
       [ 0.33972899, -0.33828888],
       [ 0.2275053 ,  0.69630463],
       [ 1.4093385 ,  0.83624876],
       [-0.32127757, -0.2290815 ],
       [ 0.20651815,  0.51423927],
       [-0.98219855,  0.81204262],
       [        nan,  0.28454813],
       [ 0.55335777, -1.14182995],
       [ 0.76806841,  0.20898048],
       [-0.86820088,  1.20267963],
       [ 2.19933795, -0.75056113],
       [-1.59355329, -0.22388605],
       [-0.06572727, -0.49983862],
       [-0.30297123,  1.7080382 ],
       [ 1.44875329,

In [None]:
# Feature Transformation
data_transformed = np.log1p(data[['Feature1']])

  result = func(self.values, **kwargs)


In [None]:
data_transformed

Unnamed: 0,Feature1
0,
1,0.336585
2,0.682459
3,1.175849
4,1.053461
...,...
95,0.534487
96,0.010445
97,1.024560
98,0.119481


In [None]:
# Handling missing Values
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data[['Feature1']])

In [None]:
data_imputed

array([[ 0.11029405],
       [ 0.40015721],
       [ 0.97873798],
       [ 2.2408932 ],
       [ 1.86755799],
       [-0.97727788],
       [ 0.95008842],
       [-0.15135721],
       [-0.10321885],
       [ 0.4105985 ],
       [ 0.11029405],
       [ 1.45427351],
       [ 0.76103773],
       [ 0.12167502],
       [ 0.44386323],
       [ 0.33367433],
       [ 1.49407907],
       [-0.20515826],
       [ 0.3130677 ],
       [-0.85409574],
       [ 0.11029405],
       [ 0.6536186 ],
       [ 0.8644362 ],
       [-0.74216502],
       [ 2.26975462],
       [-1.45436567],
       [ 0.04575852],
       [-0.18718385],
       [ 1.53277921],
       [ 1.46935877],
       [ 0.11029405],
       [ 0.37816252],
       [-0.88778575],
       [-1.98079647],
       [-0.34791215],
       [ 0.15634897],
       [ 1.23029068],
       [ 1.20237985],
       [-0.38732682],
       [-0.30230275],
       [ 0.11029405],
       [-1.42001794],
       [-1.70627019],
       [ 1.9507754 ],
       [-0.50965218],
       [-0

In [None]:
# Encoding the categorical Variables
encoder = OneHotEncoder()
data_encoded = encoder.fit_transform(data[['Feature3']])

In [None]:
print(data_encoded)

  (0, 0)	1.0
  (1, 2)	1.0
  (2, 1)	1.0
  (3, 2)	1.0
  (4, 1)	1.0
  (5, 0)	1.0
  (6, 1)	1.0
  (7, 1)	1.0
  (8, 2)	1.0
  (9, 2)	1.0
  (10, 0)	1.0
  (11, 1)	1.0
  (12, 0)	1.0
  (13, 0)	1.0
  (14, 0)	1.0
  (15, 2)	1.0
  (16, 0)	1.0
  (17, 1)	1.0
  (18, 0)	1.0
  (19, 2)	1.0
  (20, 1)	1.0
  (21, 0)	1.0
  (22, 1)	1.0
  (23, 2)	1.0
  (24, 0)	1.0
  :	:
  (75, 1)	1.0
  (76, 2)	1.0
  (77, 2)	1.0
  (78, 1)	1.0
  (79, 0)	1.0
  (80, 0)	1.0
  (81, 2)	1.0
  (82, 1)	1.0
  (83, 1)	1.0
  (84, 0)	1.0
  (85, 0)	1.0
  (86, 2)	1.0
  (87, 0)	1.0
  (88, 1)	1.0
  (89, 2)	1.0
  (90, 2)	1.0
  (91, 1)	1.0
  (92, 0)	1.0
  (93, 0)	1.0
  (94, 1)	1.0
  (95, 0)	1.0
  (96, 0)	1.0
  (97, 2)	1.0
  (98, 0)	1.0
  (99, 0)	1.0


In [None]:
# Creating Interaction Terms
poly = PolynomialFeatures(degree=2, interaction_only=True)
data_interactions = poly.fit_transform(data[['Feature2']])

In [None]:
data_interactions

array([[  1.        ,   9.41575349],
       [  1.        ,  -6.73879531],
       [  1.        ,  -6.35242499],
       [  1.        ,   4.84698354],
       [  1.        ,  -5.86561703],
       [  1.        ,   9.71810593],
       [  1.        ,  -2.0680949 ],
       [  1.        ,  -3.73727406],
       [  1.        ,   9.61471013],
       [  1.        ,   7.40257396],
       [  1.        ,   9.3377948 ],
       [  1.        ,   4.53022329],
       [  1.        ,  -4.30612843],
       [  1.        ,   9.55032477],
       [  1.        ,  -1.34001685],
       [  1.        ,   4.01228198],
       [  1.        ,   4.73625984],
       [  1.        ,  -0.77505047],
       [  1.        ,   3.07039685],
       [  1.        ,   4.61103336],
       [  1.        ,   1.88212766],
       [  1.        ,  -5.49700395],
       [  1.        ,   1.49119087],
       [  1.        ,   6.63192948],
       [  1.        ,  -3.4728393 ],
       [  1.        ,  -0.7481727 ],
       [  1.        ,  -2.17576776],
 

In [None]:
# Feature Aggregation
data_aggregated = data.groupby('Feature2').agg({'Feature1': 'mean', 'Feature2': 'count'})

In [None]:
data_aggregated

Unnamed: 0_level_0,Feature1,Feature2
Feature2,Unnamed: 1_level_1,Unnamed: 2_level_1
-11.117016,-1.536244,1
-8.010288,1.895889,1
-7.723855,0.462782,1
-7.456288,,1
-6.874756,0.900826,1
...,...,...
9.614710,-0.103219,1
9.647660,-0.359553,1
9.718106,-0.977278,1
10.816180,0.126912,1


In [None]:
scaled_data

array([[ 0.        ],
       [ 0.31118486],
       [ 0.93232477],
       [ 2.28732135],
       [ 1.88652442],
       [-1.16757133],
       [ 0.9015678 ],
       [-0.28089775],
       [-0.22921844],
       [ 0.32239419],
       [ 0.        ],
       [ 1.44283964],
       [ 0.69861095],
       [ 0.01221813],
       [ 0.3581058 ],
       [ 0.23981164],
       [ 1.48557321],
       [-0.33865629],
       [ 0.21768924],
       [-1.03532818],
       [ 0.        ],
       [ 0.58329031],
       [ 0.80961519],
       [-0.91516408],
       [ 2.31830576],
       [-1.67975266],
       [-0.06928262],
       [-0.31935972],
       [ 1.52712005],
       [ 1.45903454],
       [ 0.        ],
       [ 0.28757229],
       [-1.07149635],
       [-2.24490654],
       [-0.49191084],
       [ 0.04944262],
       [ 1.20238113],
       [ 1.17241724],
       [-0.53422476],
       [-0.44294651],
       [ 0.        ],
       [-1.64287837],
       [-1.95018672],
       [ 1.97586312],
       [-0.66554811],
       [-0

In [None]:
scaled_data = scaler.fit_transform(data_imputed)

In [None]:
# Dimensionality Reduction
pca = PCA(n_components = 1)
data_pca = pca.fit_transform(scaled_data)

In [None]:
data_pca.shape

(100, 1)