# Baselines
Here we will calculate some simple baseline predictions for future comparisons. The following methods will be used for the calculations:
- mean
- informed random
- median

The prediction column names are:
- X4:&emsp;&emsp;&ensp;Stem specific density (SSD) or wood density (stem dry mass per stem fresh volume)
- X11:&emsp;&emsp;Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA)
- X18:&emsp;&emsp;Plant height
- X50:&emsp;&emsp;Seed dry mass
- X26:&emsp;&emsp;Leaf nitrogen (N) content per leaf area 
- X3112:&emsp;Leaf area (in case of compound leaves: leaf, undefined if petiole in- or excluded)

## 1 - Setup
Import the needed packages and load the train dataset

In [None]:
import pandas as pd

import numpy as np

import xgboost as xgb

from sklearn.model_selection import KFold, cross_val_score

In [None]:
FOLDER_PATH = "../data"
SUBMISSIONS_PATH = "../submissions"
PREDICTION_COLUMNS = ['X4', 'X11', 'X18', 'X50', 'X26', 'X3112']

## 2 - Data Preparation
Load the datasets and set the target columns in the test set to 0

In [None]:
test_dataset = pd.read_csv(FOLDER_PATH + "/test.csv")
test_dataset = test_dataset.filter(["id"])
for column in PREDICTION_COLUMNS:
    test_dataset[column] = 0
test_dataset.describe()

In [None]:
train_dataset = pd.read_csv(FOLDER_PATH + "/cleaned/cleaned_train.csv")

## 2 - Baseline Calculations
We'll filter out all the columns that contain auxilliary information, as these are not needed for baseline calculations
- WORLDCLIM*
- SOIL*
- MODIS*
- VOD*

In [None]:
for column in ["WORLDCLIM", "SOIL", "MODIS", "VOD", "_sd", "image", "id"]:
	train_dataset.drop(labels=list(train_dataset.filter(regex=column)), inplace=True, axis=1)

### 2.1 - Mean
The mean of the target variable will be calculated for each species and used as the first baseline prediction.

In [None]:
mean_values = train_dataset.mean()
mean_submission = test_dataset.copy(deep=True)
for column in PREDICTION_COLUMNS:
    mean_submission[column] = mean_values[f"{column}_mean"]
mean_submission.to_csv(SUBMISSIONS_PATH + "/mean_submission.csv", index=False)
mean_submission.head()

With this submission, we received a score of -0.08377

### 2.2 - Informed Random
For the informed random baseline, we will let numpy choose a random value from the target variable's distribution for each species.

In [None]:
informed_random_submission = test_dataset.copy(deep=True)
for column in PREDICTION_COLUMNS:
    size = informed_random_submission.shape[0]
    informed_random_submission[column] = np.random.choice(train_dataset[f"{column}_mean"], size)
informed_random_submission.to_csv(SUBMISSIONS_PATH + "/informed_random_submission.csv", index=False)
informed_random_submission.head()

The informed random submission got a score of -7.63431

### 2.3 - Median
The median of the target variable will be calculated for each species and used as the third baseline prediction.

In [None]:
median_sumbission = test_dataset.copy(deep=True)
for column in PREDICTION_COLUMNS:
    median_sumbission[column] = train_dataset[f"{column}_mean"].median()
median_sumbission.to_csv(SUBMISSIONS_PATH + "/median_sumbission.csv", index=False)
median_sumbission.head()

The median submission got a score of -0.07111. This is the best score of the three baseline predictions.

### 2.4 - Gradient Boosting

In [None]:
train = pd.read_csv(FOLDER_PATH + "/cleaned/cleaned_train.csv")
test = pd.read_csv(FOLDER_PATH + "/test.csv")

# not worring about "_sd" columns for now
sd_columns = [col for col in train.columns if col.endswith("_sd")]
train = train.drop(columns=sd_columns)

#our targets
mean_columns = ["X4_mean", "X11_mean", "X18_mean", "X50_mean", "X26_mean", "X3112_mean"]

In [None]:
X_full = train.drop(columns=mean_columns)
Y_full = train[mean_columns]

In [None]:
models = {}

for column in Y_full.columns:

    model = xgb.XGBRegressor(objective="reg:squarederror",
                             n_estimators=150, learning_rate=0.1, max_depth=10)

    print(f"\nDoing cross-validation scoring for {column}...")
    scores = cross_val_score(model, X_full, Y_full[column], cv=KFold(
        n_splits=3, shuffle=True, random_state=42), scoring="r2")
    print(f"R^2 score for {column}: {np.mean(scores)}")

    print(f"Training model for {column}...")
    model.fit(X_full, Y_full[column])
    models[column] = model

In [None]:
mean_values = Y_full.mean()
submission = pd.DataFrame({"id": test["id"]})
submission[Y_full.columns] = mean_values

submission.columns = submission.columns.str.replace("_mean", "")
submission["X4"] = models["X4_mean"].predict(test)
submission["X11"] = models["X11_mean"].predict(test)
submission["X18"] = models["X18_mean"].predict(test)
submission["X50"] = models["X50_mean"].predict(test)
submission["X26"] = models["X26_mean"].predict(test)
submission["X3112"] = models["X3112_mean"].predict(test)

submission.to_csv(SUBMISSIONS_PATH + "/gradient_boosting.csv", index=False)