# Feature Engineering ⚙️

###### Imports

In [5]:
import numpy as np  # Provides extra tools for efficient numerical computations and array manipulation
import pandas as pd  # Essential for data manipulation, cleaning, and analysis using DataFrames
from sklearn.preprocessing import StandardScaler  # Used to normalize data by scaling features to a standard range
import statsmodels.api as sm  # Offers advanced statistical modeling and hypothesis testing capabilities

### The following funtions enable the relevant feature columns to be scaled, transformed or one-hot encoded.

**Scaling** ⚖️

Several scaling methods were tested to optimise model performance:

*   Robust Scaling: Scales using quartiles.
*   Min-Max Scaling: Scales within a defined range.

Ultimately, **Standard Scaling** was chosen as it performed best. It standardizes data using the mean and standard deviation of each column, ensuring uniform scaling across all features in the DataFrame.






In [2]:
# Define a function which applies standard scaling to the numerical columns:
def scaling(df):
    df = df.copy()
    # List the columns to be scaled
    scaled_col_names = ["Year", "Infant_deaths", "Under_five_deaths", "Adult_mortality", "Alcohol_consumption",
                "Hepatitis_B", "Measles", "BMI", "Polio", "Diphtheria", "Incidents_HIV", "GDP_per_capita",
                "Population_mln", "Thinness_ten_nineteen_years", "Thinness_five_nine_years", "Schooling",
                "Economy_status_Developed", "Economy_status_Developing", 'GDP_per_capita_log', 'Incidents_HIV_log']
    features = df[scaled_col_names]
    # Fit and transform the scaler on the features to be scaled
    scaler = StandardScaler().fit(features)
    scaled_features = scaler.transform(features)
    df[scaled_col_names] = scaled_features
    return df

**Log Transformations** 🪵

To meet the assumptions of a linear model, column relationships were checked. **GDP per capita** and **HIV incidents** showed logarithmic relationships with life expectancy, so log transformations were applied to improve model performance.

**One-Hot Encoding** 🔥

Categorical regions were converted into separate columns (one-hot encoded) to enhance model accuracy.


In [3]:
# Define a function applying feature engineering (plus scaling) to the data:
def feature_eng(df):
        df = df.copy()  # Good practice to use a copy

        # Log transformation of the GDP column and the Incidents_HIV column
        df['GDP_per_capita_log'] = df['GDP_per_capita'].apply(lambda x: np.log(x))
        df['Incidents_HIV_log'] = df['Incidents_HIV'].apply(lambda x: -np.log(x))

        # Scale the data using a standard scaler
        scaled_df = scaling(df)

        # Making region numerical OHE if it's within the dataframe
        scaled_df = pd.get_dummies(scaled_df, columns = ['Region'], drop_first = True, prefix = 'Region', dtype = int)

        # Add the constant column
        scaled_df = sm.add_constant(scaled_df)

        # Return the feature engineered result
        return scaled_df