
## 📌 Feature Engineering Plan

Based on the exploratory data analysis (EDA) conducted, we can now outline a structured plan to guide the feature engineering process. The following steps will be followed:

1. **Create a `'Month'` Feature**  
   Extract the month from the date column to capture potential seasonal trends in rainfall.

2. **Remove Outliers**  
   Identify and eliminate extreme values that may distort the model’s performance.

3. **Address Skewness**  
   Apply transformations (e.g., log, Box-Cox) to reduce skewness in numerical features where necessary.

4. **Remove Highly Correlated Features**  
   Drop redundant features that show strong correlation with each other to reduce multicollinearity.

5. **Handle Missing Values**  
   - If a feature has a high percentage of missing values, it will either be dropped or a new category such as `'Missing'` will be created (for categorical features).  
   - If the percentage of missing values is moderate or low, different imputation strategies will be tested.

6. **Create New Features**  
   Engineer additional features based on domain knowledge or variable interactions that may improve model performance.

7. **Standardization and Encoding**  
   - Standardize numerical features to ensure consistent scale.  
   - Apply suitable encoding techniques (e.g., One-Hot Encoding, Ordinal Encoding) to categorical features for compatibility with machine learning algorithms.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load dataset
df = pd.read_csv('../data/raw/weatherAUS.csv')
target = 'RainTomorrow'
# Drop target labels that are null
df.dropna(subset=[target], inplace=True)
# Convert both target and RainToday feature into boolean
df['RainTomorrow'] = df['RainTomorrow'].map({'No': 0, 'Yes': 1})
df['RainToday'] = df['RainToday'].map({'No': 0, 'Yes': 1})

print(f'Dataframe shape: {df.shape}')
df.head(3)

Dataframe shape: (142193, 23)


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,0.0,0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,0.0,0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,0.0,0


In [3]:
df['Date'] = pd.to_datetime(df['Date'])
num_features = df.select_dtypes(include=np.number).columns.tolist()
cat_features = df.select_dtypes(include='object').columns.tolist()

#### 1. Create `'Month'` feature

In [4]:
df['Month'] = df['Date'].dt.month

#### 2. Remove outliers

In [5]:
features_with_outliers = ['Rainfall', 'Evaporation', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm']

In [6]:
def filter_outliers_by_percentile(df, features, percentile=0.99):
    """
    Remove rows where any value in specified features exceeds the given percentile threshold.
    Only filters the upper tail (high values) as outliers.

    Parameters:
        df (pd.DataFrame): Input DataFrame.
        features (list): List of columns to check for outliers.
        percentile (float): Percentile threshold (between 0 and 1). Rows with values above this percentile are removed.

    Returns:
        pd.DataFrame: DataFrame without rows exceeding the specified percentile in any of the features.
    """
    df_clean = df.copy()
    outlier_indices = set()
    
    for col in features:
        threshold = df_clean[col].quantile(percentile)
        col_outliers = df_clean[df_clean[col] > threshold].index
        outlier_indices.update(col_outliers)
        
    initial_rows = df_clean.shape[0]
    df_clean = df_clean.drop(index=outlier_indices)
    final_rows = df_clean.shape[0]
    
    print(f"Rows removed due to upper {100*(1-percentile):.2f}% outliers: {initial_rows - final_rows}")
    return df_clean

# Ejemplo de uso:
df = filter_outliers_by_percentile(df, features_with_outliers, 0.99)

Rows removed due to upper 1.00% outliers: 4944


#### 3. Address skewness

In [7]:
df[num_features].skew().sort_values()

Sunshine        -0.521518
Humidity9am     -0.478696
Cloud9am        -0.206948
Cloud3pm        -0.203580
Pressure9am     -0.054308
Pressure3pm     -0.010067
MinTemp          0.033117
Humidity3pm      0.035499
Temp9am          0.094830
MaxTemp          0.221805
Temp3pm          0.236211
WindSpeed3pm     0.347202
WindSpeed9am     0.441208
WindGustSpeed    0.545855
Evaporation      0.888658
RainTomorrow     1.385824
RainToday        1.399826
Rainfall         4.047390
dtype: float64

In [8]:
def get_skewed_features(df, threshold=0.5):
    """
    Returns a Series of numerical features with skewness < -threshold or > threshold.
    
    Parameters:
        df (pd.DataFrame): The input DataFrame.
        threshold (float): Skewness threshold to detect features to be treated.
    
    Returns:
        pd.Series: Features with strong skewness.
    """
    numeric_cols = df.select_dtypes(include=['number']).columns
    skew_values = df[numeric_cols].skew()
    skewed = skew_values[(skew_values > threshold) | (skew_values < -threshold)]
    return skewed.index.to_list()

skewed_features = get_skewed_features(df, threshold=0.5)
skewed_features.remove('RainTomorrow')
print("Skewed features:", skewed_features)

Skewed features: ['Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'RainToday']


In [9]:
from sklearn.preprocessing import PowerTransformer

def treat_skewness(df, features):
    """
    Apply transformations to reduce skewness in numeric features.
    Uses Box-Cox for strictly positive data and Yeo-Johnson otherwise.
    
    Parameters:
        df (pd.DataFrame): DataFrame with numeric features.
        features (list): List of feature names to transform.
    
    Returns:
        pd.DataFrame: DataFrame with transformed features.
    """
    df_transformed = df.copy()
    
    for feature in features:
        data = df_transformed[feature].values.reshape(-1, 1)
        if (df_transformed[feature] > 0).all():
            # Use Box-Cox transformation
            pt = PowerTransformer(method='box-cox')
        else:
            # Use Yeo-Johnson transformation for zero or negative values
            pt = PowerTransformer(method='yeo-johnson')
        
        try:
            df_transformed[feature] = pt.fit_transform(data).flatten()
        except Exception as e:
            print(f"Could not transform feature {feature}: {e}")
    
    return df_transformed

# Ejemplo de uso:
# features_to_treat = ['Rainfall', 'Temperature', 'WindGustSpeed']
df = treat_skewness(df, skewed_features)

#### 4. Remove highly correlated features

In [10]:
import pandas as pd
import numpy as np

def remove_highly_correlated_features(df, target, threshold=0.9, strategy='nulls'):
    """
    Removes one of each pair of highly correlated features based on a selection strategy.
    
    Parameters:
        df (pd.DataFrame): The dataset.
        target (str): Name of the target variable.
        threshold (float): Correlation threshold.
        strategy (str): Strategy to drop one variable from each correlated pair. 
                        Options: 'nulls', 'target_corr'.
    
    Returns:
        pd.DataFrame: DataFrame with reduced features.
        list: List of dropped features.
    """
    df_numeric = df.select_dtypes(include=[np.number]).drop(columns=[target], errors='ignore')
    corr_matrix = df_numeric.corr().abs()
    upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    
    to_drop = set()

    for col in upper_tri.columns:
        for row in upper_tri.index:
            corr_value = upper_tri.loc[row, col]
            if pd.notnull(corr_value) and corr_value > threshold:
                # If any of the two has already been dropped, skip
                if row in to_drop or col in to_drop:
                    continue

                if strategy == 'nulls':
                    nulls_row = df[row].isnull().sum()
                    nulls_col = df[col].isnull().sum()
                    drop = row if nulls_row > nulls_col else col

                elif strategy == 'target_corr':
                    if target not in df.columns:
                        raise ValueError("Target column not found in DataFrame.")
                    corr_row = abs(df[[row, target]].corr().iloc[0, 1])
                    corr_col = abs(df[[col, target]].corr().iloc[0, 1])
                    drop = row if corr_row < corr_col else col

                else:
                    raise ValueError("Invalid strategy. Choose 'nulls' or 'target_corr'.")

                to_drop.add(drop)

    reduced_df = df.drop(columns=list(to_drop))
    return reduced_df, list(to_drop)

# Example usage:
df, dropped = remove_highly_correlated_features(df, target='RainTomorrow', threshold=0.9, strategy='target_corr')
print("Dropped features:", dropped)

Dropped features: ['RainToday', 'Temp9am', 'MaxTemp', 'Pressure3pm']


#### 5. Handling missing values

Impute categorical features with the label `'Missing'`

In [None]:
print(df[cat_features].isna().mean())

for col in cat_features:
    df[col] = df[col].fillna('Missing')

Impute some numerical features by the median, grouped by `'Month'` and `'Location'`

In [None]:
def impute_median_by_group(df, features, group_cols=['Month', 'Location']):
    """
    Impute missing values in specified columns using the median,
    calculated by grouping over the given columns (default 'Month' and 'Location').

    Args:
        df (pd.DataFrame): Original dataframe.
        features (list): List of columns to impute.
        group_cols (list): Columns to group by for median calculation (default ['Month', 'Location']).

    Returns:
        pd.DataFrame: Dataframe with imputed values.
    """
    df_imputed = df.copy()
    
    # Calculate group-wise medians for the features
    medians = df_imputed.groupby(group_cols)[features].transform('median')
    
    # Fill missing values with the calculated medians
    for feature in features:
        df_imputed[feature] = df_imputed[feature].fillna(medians[feature])
    
    return df_imputed

In [32]:
num_features = df.select_dtypes(include=['number']).columns.tolist()
#print(f"Missing values %:\n {df[num_features].isna().mean().sort_values(ascending=False)}")
#print(f"\n\nCorrelation with target: \n {df[num_features].corr()[target].sort_values()}")
impute_by_model = ['Sunshine', 'Cloud3pm', 'Cloud9am'] # Those are 3/4 of the features with most missing data. We exclude the feature Evaporation because that feature doesn't have too much correlation with the target (the other 3 does) 
impute_median_features = [x for x in num_features if x not in impute_by_model]

In [33]:
df = impute_median_by_group(df, impute_median_features)

In [37]:
impute_median_features

['MinTemp',
 'Rainfall',
 'Evaporation',
 'WindGustSpeed',
 'WindSpeed9am',
 'WindSpeed3pm',
 'Humidity9am',
 'Humidity3pm',
 'Pressure9am',
 'Temp3pm',
 'RainTomorrow',
 'Month']

In [35]:
df[num_features].isna().mean()

MinTemp          0.000000
Rainfall         0.000000
Evaporation      0.321161
Sunshine         0.477475
WindGustSpeed    0.041829
WindSpeed9am     0.000000
WindSpeed3pm     0.000000
Humidity9am      0.000000
Humidity3pm      0.000000
Pressure9am      0.083709
Cloud9am         0.380258
Cloud3pm         0.404455
Temp3pm          0.000000
RainTomorrow     0.000000
Month            0.000000
dtype: float64

In [39]:
df[df['Evaporation'].isna()]['Location'].value_counts()

Location
Albury           2981
Tuggeranong      2968
SalmonGums       2935
Penrith          2931
Witchcliffe      2909
Ballarat         2893
BadgerysCreek    2890
Newcastle        2831
MountGinini      2792
Walpole          2787
GoldCoast        2749
NorahHead        2741
PearceRAAF       2723
Wollongong       2717
Nhil             1518
Uluru            1507
Launceston       1207
Name: count, dtype: int64

In [None]:
df[df['Location']=='Uluru']['Evaporation'].isna().mean()
# Hay lugares donde no se mide la evaporacion!!!!!!!!!!!!!! 

1.0