## [ENG/KOR] From Preprocessing to Modeling!

I wanted to complete the systematic process of making a machine-learning model :)

Although the score from my code is not the best, I hope it can help your work a bit!

**Keywords: Imputation, Mutual Information, Seasonal Models, XGBoost**

Libraries and Functions we will use.

사용할 함수들을 정의합니다.

In [1]:
import os
import math
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import mutual_info_regression
from tools.dataset import score_dataset
from tools.preprocessing.missing_values import get_missing_raio, delete_columns, impute_missing_values
from tools.preprocessing.outliers import delete_outliers, impute_outliers, get_limits
from tools.preprocessing.scaling import minmax
from tools.engineering.mi import mi_score
from tools.engineering.encoding import one_hot
from tools.engineering.clustering import kmc

# A function when we make the 'season' column.
SEASON = {
    'summer': [12, 1, 2],
    'fall': [3, 4, 5],
    'winter': [6, 7, 8],
    'spring': [9, 10, 11],
}

def insert_season(x, season=SEASON):
    if x in season['summer']:
        return 'summer'
    elif x in season['fall']:
        return 'fall'
    elif x in season['winter']:
        return 'winter'
    elif x in season['spring']:
        return 'spring'
    else:
        raise Exception('unknown week')


# Negative values to zero.
def neg_to_zero(x):
    if x < 0:
        return 0
    else:
        return x

# 1. Load Data

In [3]:
DATA_PATH = './data'

train = pd.read_csv(os.path.join(DATA_PATH, 'train.csv'))
test = pd.read_csv(os.path.join(DATA_PATH, 'test.csv'))

sub = test.copy().loc[:, ['ID_LAT_LON_YEAR_WEEK']]

print(f'train shape: {train.shape}')
print(f'test shape: {test.shape}')

train shape: (79023, 76)
test shape: (24353, 75)


In [4]:
# Check some rows.
train.head()

Unnamed: 0,ID_LAT_LON_YEAR_WEEK,latitude,longitude,year,week_no,SulphurDioxide_SO2_column_number_density,SulphurDioxide_SO2_column_number_density_amf,SulphurDioxide_SO2_slant_column_number_density,SulphurDioxide_cloud_fraction,SulphurDioxide_sensor_azimuth_angle,...,Cloud_cloud_top_height,Cloud_cloud_base_pressure,Cloud_cloud_base_height,Cloud_cloud_optical_depth,Cloud_surface_albedo,Cloud_sensor_azimuth_angle,Cloud_sensor_zenith_angle,Cloud_solar_azimuth_angle,Cloud_solar_zenith_angle,emission
0,ID_-0.510_29.290_2019_00,-0.51,29.29,2019,0,-0.000108,0.603019,-6.5e-05,0.255668,-98.593887,...,3664.436218,61085.80957,2615.120483,15.568533,0.272292,-12.628986,35.632416,-138.786423,30.75214,3.750994
1,ID_-0.510_29.290_2019_01,-0.51,29.29,2019,1,2.1e-05,0.728214,1.4e-05,0.130988,16.592861,...,3651.190311,66969.478735,3174.572424,8.690601,0.25683,30.359375,39.557633,-145.18393,27.251779,4.025176
2,ID_-0.510_29.290_2019_02,-0.51,29.29,2019,2,0.000514,0.748199,0.000385,0.110018,72.795837,...,4216.986492,60068.894448,3516.282669,21.10341,0.251101,15.377883,30.401823,-142.519545,26.193296,4.231381
3,ID_-0.510_29.290_2019_03,-0.51,29.29,2019,3,,,,,,...,5228.507736,51064.547339,4180.973322,15.386899,0.262043,-11.293399,24.380357,-132.665828,28.829155,4.305286
4,ID_-0.510_29.290_2019_04,-0.51,29.29,2019,4,-7.9e-05,0.676296,-4.8e-05,0.121164,4.121269,...,3980.59812,63751.125781,3355.710107,8.114694,0.235847,38.532263,37.392979,-141.509805,22.204612,4.347317


# 2. Handling Missing Values

결측값을 처리합니다.

In [5]:
# We will not delete these columns because they are important.
# 중요한 칼럼들이기 때문에 이 칼럼들은 삭제하지 않습니다.
protected_columns = ['latitude', 'longitude', 'week_no', 'year']

Let's delete columns that have lots of missing values.

We have to set threshold value as an argument.

결측값이 많은 칼럼을 제거합니다.

문턱값을 정해야 합니다.

In [6]:
train_deleted, deleted_columns = delete_columns(train, 0.3, target='emission')
print(f'The number of deleted columns: {len(deleted_columns)}')

The number of deleted columns: 7


In [7]:
# Apply this process to test data.
test_deleted = test.drop(deleted_columns, axis=1)

We will find the ideal method to impute missing values.

가장 이상적인 결측값 대체법을 찾습니다.

In [8]:
methods = ['mean', 'linear', 'fill']
results = []

for method in methods:
    train_imputed, _ = impute_missing_values(train_deleted, method)
    score = score_dataset(train_imputed, 'emission')
    results.append([method, score])
    print(f'method "{method}" completed')

results

method "mean" completed
method "linear" completed
method "fill" completed


[['mean', 8.880529420732177],
 ['linear', 8.823701726182582],
 ['fill', 8.949878023297918]]

In [9]:
best_method = sorted(results, key=lambda x: x[1])[0]
print(f'best method: {best_method}')

best method: ['linear', 8.823701726182582]


In [10]:
train_imputed, _ = impute_missing_values(train_deleted, best_method[0])
test_imputed, _ = impute_missing_values(test_deleted, best_method[0])

In [11]:
train_now = train_imputed
test_now = test_imputed

# 3. Feature Engineering

Make some columns: 'month_no', 'covid'

This process is copied from [BASSEM GOUTY's code](https://www.kaggle.com/code/bassemgouty/ps3e20-ensembling-with-score-nudge). Thank you!


'month_no' 칼럼과 'covid' 칼럼을 추가합니다.


In [12]:
train_now['date'] = pd.to_datetime('2021' + train_now['week_no'].astype(str) + '0', format='%Y%W%w')
train_now['month_no'] = train_now['date'].dt.month
train_now.drop(columns=['date'], inplace=True)

train_now['covid'] = (train_now.year == 2020) & (train_now.month_no > 2)

In [13]:
test_now['date'] = pd.to_datetime('2021' + test_now['week_no'].astype(str) + '0', format='%Y%W%w')
test_now['month_no'] = test_now['date'].dt.month
test_now.drop(columns=['date'], inplace=True)

test_now['covid'] = (test_now.year == 2020) & (test_now.month_no > 2)

Execute one-hot encoding to the 'covid' column.

'covid' 칼럼에 one-hot 인코딩을 수행합니다.

In [14]:
train_now, encoder = one_hot(train_now, 'covid')
test_now, _ = one_hot(test_now, 'covid', encoder)



Add new columns to 'protected_columns'

In [15]:
protected_columns.extend(['covid_False', 'covid_True', 'month_no'])

Make the 'season' column.

'season' 칼럼을 만듭니다.

In [16]:
train_now['season'] = train_now['month_no'].apply(insert_season, args=[SEASON])

test_now['season'] = test_now['month_no'].apply(insert_season, args=[SEASON])

print(train_now['season'].value_counts(), test_now['season'].value_counts())

season
summer    20874
fall      19383
winter    19383
spring    19383
Name: count, dtype: int64 season
fall      6461
winter    6461
spring    6461
summer    4970
Name: count, dtype: int64


We will split datasets into each season.

**Based on these datasets, we will also make models for each season.**

데이터셋을 계절별로 나눕니다. 각 데이터셋에 기반하여 모델도 따로 만들 것입니다.

In [17]:
train_summer = train_now.loc[train_now['season'] == 'summer']
train_fall = train_now.loc[train_now['season'] == 'fall']
train_winter = train_now.loc[train_now['season'] == 'winter']
train_spring = train_now.loc[train_now['season'] == 'spring']

test_summer = test_now.loc[test_now['season'] == 'summer']
test_fall = test_now.loc[test_now['season'] == 'fall']
test_winter = test_now.loc[test_now['season'] == 'winter']
test_spring = test_now.loc[test_now['season'] == 'spring']

In [18]:
train_sets = [train_summer, train_fall, train_winter, train_spring]
test_sets = [test_summer, test_fall, test_winter, test_spring]

In [19]:
# Delete the 'season' columns.
for i in range(len(train_sets)):
    train_sets[i] = train_sets[i].drop('season', axis=1)

for i in range(len(test_sets)):
    test_sets[i] = test_sets[i].drop('season', axis=1)

# 4. Feature Selection(with Mutual Information, Correlation)

Only features with a MI score(correlation->correlation coefficient) above the thresholds will be selected.

문턱값을 넘는 MI 스코어(또는 상관계수)를 갖는 features만 선택합니다.

In [20]:
selected_columns = []

for idx, train in enumerate(train_sets):
    df, columns, _ = mi_score(train, 'emission', 0.1, corr=True, corr_threshold=0.1, protected=protected_columns)
    columns.remove('emission')
    selected_columns.append(columns)
    train_sets[idx] = df
    print(f'The number of columns of {idx+1}th data: {len(train_sets[idx].columns)}')

The number of columns of 1th data: 10
The number of columns of 2th data: 10
The number of columns of 3th data: 9
The number of columns of 4th data: 10


Actually, this work seems meaningless. As you can see from other participants' codes, Many features, except geographic features, are challenging to use because it is hard to find some close relationship with emission.

사실 이 작업은 큰 의미가 없게 보입니다. 지리적 features 외 다른 features들과 emission의 관계를 찾기가 어려워 이들을 사용하기 어렵기 때문입니다.

In [21]:
# Apply this process to test data.
for idx, test in enumerate(test_sets):
    test_sets[idx] = test.loc[:, selected_columns[idx]]

# 5. The end of making datasets.

Let's synchronize column order.

In [22]:
train_columns = []

for idx, train in enumerate(train_sets):
    col = list(map(str, list(train.columns)))
    train.columns = col
    col.sort()
    train_sets[idx] = train[col]
    col.remove('emission')
    train_columns.append(col)

In [23]:
for idx, test in enumerate(test_sets):
    col = list(map(str, list(test.columns)))
    test.columns = col
    test_sets[idx] = test[train_columns[idx]]

# 6. Modeling(XGBoost)

In [24]:
param_grid = {"max_depth":    [8, 10],
              "n_estimators": [100, 300],
              }

models = []

for idx, train in enumerate(train_sets):
    X_train = train.drop(['emission', 'ID_LAT_LON_YEAR_WEEK'], axis=1)
    y_train = train['emission']

    regressor = xgb.XGBRegressor(eval_metric='rmsle',
                                # tree_method='gpu_hist'
                                 )

    search = GridSearchCV(regressor, param_grid, cv=5).fit(X_train, y_train)

    regressor=xgb.XGBRegressor(
        n_estimators = search.best_params_["n_estimators"],
        max_depth    = search.best_params_["max_depth"],
        eval_metric  = 'rmsle',
        # tree_method  = 'gpu_hist'
        )

    regressor.fit(X_train, y_train)

    models.append(regressor)

    print(f'{idx+1}th model completed')

1th model completed
2th model completed
3th model completed
4th model completed


# 7. Prediction

In [25]:
predictions = []

for idx, test in enumerate(test_sets):
    X_test = test.drop('ID_LAT_LON_YEAR_WEEK', axis=1)
    prediction = models[idx].predict(X_test)
    predictions.append(prediction)

# 8. Submit

In [26]:
for idx, pred in enumerate(predictions):
    test_sets[idx].loc[:, 'emission'] = pred

In [27]:
test_all = pd.concat(test_sets)
sub = sub.merge(test_all, on='ID_LAT_LON_YEAR_WEEK').loc[:, ['ID_LAT_LON_YEAR_WEEK', 'emission']]

Change negative values to zero.

In [28]:
sub['emission'] = sub['emission'].apply(neg_to_zero)

We will multiply **1.09738621** by each value.
I got this multiple by calculating annual CO2 emissions in Rwanda from 2009 to 2018.

The CO2 data is [here](https://ourworldindata.org/co2-emissions#global-co2-emissions-from-fossil-fuels-and-land-use-change)

각 값에 1.09738621를 곱해줍니다. 해당 값은 르완다의 2009년부터 2018녀까지 연간 CO2 배출량 데이터를 바탕으로 추출한 값입니다.

In [29]:
ANNUAL_INCREASEMENT_RATIO = 1.09738621

sub['emission'] = sub['emission'] * ANNUAL_INCREASEMENT_RATIO

sub.head()

Unnamed: 0,ID_LAT_LON_YEAR_WEEK,emission
0,ID_-0.510_29.290_2022_00,4.548859
1,ID_-0.510_29.290_2022_01,3.877588
2,ID_-0.510_29.290_2022_02,4.227235
3,ID_-0.510_29.290_2022_03,4.888444
4,ID_-0.510_29.290_2022_04,3.067149


In [30]:
sub.to_csv('sub.csv', index = False)