In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import random

## List of Feature Engineering Techniques
* **Imputation**
* **Handling Outliers**
* **Binning**
* **Log Transform**
* **Encoding**
* **Grouping Operations**
* **Scaling**

In [None]:
df = pd.read_csv("/kaggle/input/impact-of-covid19-pandemic-on-the-global-economy/transformed_data.csv")
df.head()

## 1. Imputation
Missing values are one of the most common problems you can encounter when you prepare your data for machine learning. The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, etc. Whatever the reason, missing values affect the performance of machine learning models.

In [None]:
df.isna().sum()

In [None]:
import missingno as msno 

msno.bar(df)
plt.show()

In [None]:
df["HDI"].dtype

In [None]:
df["HDI"] = df["HDI"].fillna(df["HDI"].mean())

## 2. Handling Outliers
![](https://editor.analyticsvidhya.com/uploads/12469out.png)

The best way to detect outliers is to demonstrate the data visually. All other statistical methodologies are open to making mistakes, whereas visualizing the outliers gives a chance to take a decision with high precision.

![](https://miro.medium.com/max/1922/1*Mn5NoddG6Hlqqld171x-Xg.jpeg)

In [None]:
df.dtypes

In [None]:
import plotly.express as px

fig = px.box(df, y="POP")
fig.show()

## Methods for handling outliers
**Now that we understand how to detect outliers in a better way, it’s time to engineer them. We’re going to explore a few different techniques and methods to achieve that:**

* **Trimming**: Simply removing the outliers from our dataset.
* **Imputing**: We treat outliers as missing data, and we apply missing data imputation techniques.
* **Discretization**: We place outliers in edge bins with higher or lower values of the distribution.
* **Censoring**: Capping the variable distribution at the maximum and minimum values.

### 2.1. Trimming

In [None]:
#Trimming

#calculate the IQR
IQR = df["POP"].quantile(0.75) - df["POP"].quantile(0.25)

#calculate the boundries
lower = df["POP"].quantile(0.25) - (IQR * 1.5)
upper = df["POP"].quantile(0.75) + (IQR * 1.5)

# find the outliers
outliers = np.where(df["POP"] > upper, True, np.where(df["POP"] < lower, True, False))

# remove outliers from data.
df_trimming = df.loc[~(outliers)] 

In [None]:
import plotly.express as px

fig = px.box(df_trimming, y="POP", title = "Trimming")
fig.show()

### 2.2. Imputing outliers

In [None]:
# Imputing

def detect_outlier(data):
        outliers = []
        threshold = -2
        mean = np.mean(data)
        std = np.std(data)
        for y in data:
            z_score = (y - mean) / std
            if z_score < threshold:
                outliers.append(y)
        return outliers

result = list(set(detect_outlier(df["POP"])))
print(f'Outlier: {result}')

In [None]:
df_impute = df

df_impute["POP"] = df_impute["POP"].apply(lambda x : df_impute["POP"].mean() if x in result else x)

In [None]:
fig = px.box(df_impute, y="POP", title = "Imputing")
fig.show()

## 3. Binning
Binning can be applied on both categorical and numerical data.

The main motivation of binning is to make the model more robust and prevent overfitting. However, it has a cost on the performance. Every time you bin something, you sacrifice information and make your data more regularized.

In [None]:
df_binning = df.copy(deep = True)

df_binning["TD"].value_counts()

In [None]:
pd.cut(df_binning["TD"], bins = 5)

In [None]:
pd.cut(df_binning["TD"], bins = 5, labels = ["Bin_1", "Bin_2", "Bin_3", "Bin_4", "Bin_5"])

In [None]:
df_binning["TD"] = pd.cut(df_binning["TD"], bins = 5, labels = ["Bin_1", "Bin_2", "Bin_3", "Bin_4", "Bin_5"])
df_binning.sample(5)

## 4. Log Transform
Logarithm transformation (or log transform) is one of the most commonly used mathematical transformations in feature engineering. Here are the benefits of using log transform:

* It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal
* It also decreases the effect of the outliers due to the normalization of magnitude differences and the model become more robust
* The data you apply log transform to must have only positive values, otherwise you receive an error

In [None]:
df_log = df.copy(deep = True)
df_log["GDPCAP"] = df_log["GDPCAP"].apply(lambda x : x * 10000)
df_log["GDPCAP"]

In [None]:
np.log1p(df_log["GDPCAP"])

In [None]:
df_log["GDPCAP"] = np.log1p(df_log["GDPCAP"])
df.sample(5)

In [None]:
# Extras

#Let's find the log and log1p of a small positive number plus one
print("Log -->", np.log(1e-100 + 1))
print("Log1p -->", np.log1p(1e-100 + 1))

## 5.Encoding
Encoding is one of the most common encoding methods in machine learning. This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them. These binary values express the relationship between grouped and encoded column.

This method changes your categorical data, which is challenging to understand for algorithms, to a numerical format and enables you to group your categorical data without losing any information.

### 5.1 One-Hot Encoding

In [None]:
df_encode = df.copy(deep = True)

df_encode["COUNTRY"].value_counts()

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

res = ohe.fit_transform(df_encode["COUNTRY"].values.reshape(-1,1))

res.toarray()

In [None]:
ohe.inverse_transform(res)

In [None]:
ohe.get_feature_names()

### 5.2 get_dummies()

In [None]:
y = pd.get_dummies(df_encode["COUNTRY"], prefix='Country')
y.head()

### 5.3 Label Encoder

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

print(le.fit_transform(df_encode["COUNTRY"]))

In [None]:
df_encode["COUNTRY"] = le.fit_transform(df_encode["COUNTRY"])

print(le.inverse_transform(df_encode["COUNTRY"]))

In [None]:
df_encode.sample(5)

## 6. Grouping

**Categorical Grouping**<br>
Using a pivot table or grouping based on aggregate functions using lambda.

**Numeric Grouping**<br>
Numerical columns are grouped using sum and mean functions in most of the cases.

### 6.1 Categorical Grouping

In [None]:
df_group = df.copy(deep = True)

In [None]:
df.pivot(index = 'DATE', columns = 'CODE')

### 6.2 Numerical Grouping

In [None]:
pd.DataFrame(df_group.groupby("COUNTRY")[["HDI", "TC", "TD"]].mean()).reset_index()

## 7. Scaling

In most cases, the numerical features of the dataset do not have a certain range and they differ from each other. In order for a symmetric dataset, scaling is required.

**Normalization**<br>
Normalization (or min-max normalization) scales all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers<br><br>
**Standardization**<br>
Standardization (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.

### 7.1 Normalization

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

y = scaler.fit_transform(df["HDI"].values.reshape(-1, 1))

pd.DataFrame({"HDI": df["HDI"], "Normalized HDI": y.flatten()})

### 7.2 Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

y = scaler.fit_transform(df["HDI"].values.reshape(-1, 1))

pd.DataFrame({"HDI": df["HDI"], "Standardized HDI": y.flatten()})

In [None]:
# Extras

from scipy.stats import zscore

zscore(df["HDI"])

### References:

https://heartbeat.fritz.ai/hands-on-with-feature-engineering-techniques-dealing-with-outliers-fcc9f57cb63b

https://www.analyticsvidhya.com/blog/2020/10/7-feature-engineering-techniques-machine-learning/

https://stackoverflow.com/questions/49538185/what-is-the-purpose-of-numpy-log1p/49538384