This very strange transmission is coming from your narrow band radio signal receiver, pointed towards one of the farthest away galaxies. It is early morning, you are sitting in your radio observatory high in the mountains. You are enjoying a quiet time with a cup of coffee and reviewing the data reports from last night, when this strange sound arrived.


Only two steps prevent us from achieving singularity:

1) To understand what makes us better off.
Our elders used the composite index to measure our well-being performance, but this knowledge has disappeared in the sands of time. Use our data and train your model to predict this index with the highest possible level of certainty.

2) To achieve the highest possible level of well-being through optimized allocation of additional energy
We have discovered the star of an unusually high energy of 50000 zillion DSML. We have agreed between ourselves that:
- no one galaxy will consume more than 100 zillion DSML
- atleast 10% of the total energy will be consumed by galaxiesin need with existence expectancy index below 0.7.



Transmission suddenly ends. You put your notebook and pencil away and start thinking. You really want to help this species optimize their well-being. You open up Python and upload the dataset from the narrowband radio signal receiver. It will be another great day at the observatory today.


The solutions are evaluated on two criteria: predicted future Index values and allocated energy from a newly discovered star

1) Index predictions are evaluated using RMSE metric

2) Energy allocation is also evaluated using RMSE metric and has a set of known factors that need to be taken into account.

Every galaxy has a certain limited potential for improvement in the index described by the following function:

  Potential for increase in the Index = -np.log(Index+0.01)+3

Likely index increase dependent on potential for improvement and on extra energy availability is described by the following function:

  Likely increase in the Index = extra energy * Potential forincrease in the Index **2 / 1000

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from category_encoders import CatBoostEncoder
from scipy.optimize import linprog

from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

import warnings
warnings.filterwarnings('ignore')

The code reads and combines training and testing data from CSV files, ensuring uniqueness by removing duplicates.

In [None]:
test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

test_data.drop_duplicates(keep='first', inplace=True)
train_data.drop_duplicates(keep='first', inplace=True)

df = pd.concat([train_data, test_data], ignore_index=True)

df.head(5)

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,...,"Intergalactic Development Index (IDI), female","Intergalactic Development Index (IDI), male",Gender Development Index (GDI),"Intergalactic Development Index (IDI), female, Rank","Intergalactic Development Index (IDI), male, Rank",Adjusted net savings,"Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total",Private galaxy capital flows (% of GGP),Gender Inequality Index (GII),y
0,990025,Large Magellanic Cloud (LMC),0.628657,63.1252,27109.23431,0.646039,8.240543,,,,...,,,,,,,,,,0.05259
1,990025,Camelopardalis B,0.818082,81.004994,30166.793958,0.852246,10.671823,4.74247,0.833624,0.467873,...,,,,,,19.177926,,22.785018,,0.059868
2,990025,Virgo I,0.659443,59.570534,8441.707353,0.499762,8.840316,5.583973,0.46911,0.363837,...,,,,,,21.151265,6.53402,,,0.050449
3,990025,UGC 8651 (DDO 181),0.555862,52.333293,,,,,,,...,,,,,,,5.912194,,,0.049394
4,990025,Tucana Dwarf,0.991196,81.802464,81033.956906,1.131163,13.800672,13.188907,0.910341,0.918353,...,,,,,,,5.611753,,,0.154247


In [None]:
df.columns

Index(['galactic year', 'galaxy', 'existence expectancy index',
       'existence expectancy at birth', 'Gross income per capita',
       'Income Index', 'Expected years of education (galactic years)',
       'Mean years of education (galactic years)',
       'Intergalactic Development Index (IDI)', 'Education Index',
       'Intergalactic Development Index (IDI), Rank',
       'Population using at least basic drinking-water services (%)',
       'Population using at least basic sanitation services (%)',
       'Gross capital formation (% of GGP)', 'Population, total (millions)',
       'Population, urban (%)',
       'Mortality rate, under-five (per 1,000 live births)',
       'Mortality rate, infant (per 1,000 live births)',
       'Old age dependency ratio (old age (65 and older) per 100 creatures (ages 15-64))',
       'Population, ages 15–64 (millions)',
       'Population, ages 65 and older (millions)',
       'Life expectancy at birth, male (galactic years)',
       'Life expect

In [None]:
df.isnull().sum().sort_values(ascending=True).tail(5)

Intergalactic Development Index (IDI), male, Rank                            3314
Adjusted net savings                                                         3324
Creature Immunodeficiency Disease prevalence, adult (% ages 15-49), total    3332
Private galaxy capital flows (% of GGP)                                      3345
Gender Inequality Index (GII)                                                3382
dtype: int64

This line aims to extract the column names of the DataFrame where more than half of the values are missing.

In [None]:
(df.isnull().sum() / len(df)).sort_values()[(df.isnull().sum() / len(df)).sort_values() > 0.5].index

Index(['Gross capital formation (% of GGP)', 'Population, total (millions)',
       'Population, urban (%)',
       'Mortality rate, under-five (per 1,000 live births)',
       'Mortality rate, infant (per 1,000 live births)',
       'Old age dependency ratio (old age (65 and older) per 100 creatures (ages 15-64))',
       'Population, ages 15–64 (millions)',
       'Population, ages 65 and older (millions)',
       'Life expectancy at birth, male (galactic years)',
       'Life expectancy at birth, female (galactic years)',
       'Population, under age 5 (millions)',
       'Young age (0-14) dependency ratio (per 100 creatures ages 15-64)',
       'Adolescent birth rate (births per 1,000 female creatures ages 15-19)',
       'Mortality rate, male grown up (per 1,000 people)',
       'Mortality rate, female grown up (per 1,000 people)',
       'Employment in agriculture (% of total employment)',
       'Labour force participation rate (% ages 15 and older)',
       'Labour force parti

This line removes columns from the DataFrame where more than half of the values are missing (NaN).

In [None]:
df.drop((df.isnull().sum() / len(df)).sort_values()[(df.isnull().sum() / len(df)).sort_values() > 0.5].index,\
        axis=1, inplace=True)

df.head(5)

Unnamed: 0,galactic year,galaxy,existence expectancy index,existence expectancy at birth,Gross income per capita,Income Index,Expected years of education (galactic years),Mean years of education (galactic years),Intergalactic Development Index (IDI),Education Index,"Intergalactic Development Index (IDI), Rank",Population using at least basic drinking-water services (%),Population using at least basic sanitation services (%),y
0,990025,Large Magellanic Cloud (LMC),0.628657,63.1252,27109.23431,0.646039,8.240543,,,,,,,0.05259
1,990025,Camelopardalis B,0.818082,81.004994,30166.793958,0.852246,10.671823,4.74247,0.833624,0.467873,152.522198,,,0.059868
2,990025,Virgo I,0.659443,59.570534,8441.707353,0.499762,8.840316,5.583973,0.46911,0.363837,209.813266,,,0.050449
3,990025,UGC 8651 (DDO 181),0.555862,52.333293,,,,,,,,,,0.049394
4,990025,Tucana Dwarf,0.991196,81.802464,81033.956906,1.131163,13.800672,13.188907,0.910341,0.918353,71.885345,,,0.154247


In [None]:
print('Shape of the data:', df.shape, '\n')
print('Nulls in each column:')
df.isnull().sum()

Shape of the data: (4755, 14) 

Nulls in each column:


galactic year                                                     0
galaxy                                                            0
existence expectancy index                                        6
existence expectancy at birth                                     6
Gross income per capita                                          33
Income Index                                                     33
Expected years of education (galactic years)                    138
Mean years of education (galactic years)                        371
Intergalactic Development Index (IDI)                           399
Education Index                                                 399
Intergalactic Development Index (IDI), Rank                     443
Population using at least basic drinking-water services (%)    1854
Population using at least basic sanitation services (%)        1860
y                                                               890
dtype: int64

This line of code applies a function to each group in the DataFrame, filling missing values with interpolation, forward fill, and backward fill methods, and then drops any remaining rows with missing values.

In [None]:
def fill_missing(grp):
  res = grp.set_index('galactic year').\
  interpolate(method='linear', limit=5).\
  fillna(method='ffill').\
  fillna(method='bfill')
  del res['galaxy']
  return res

df = df.groupby('galaxy').apply(lambda grp: fill_missing(grp))

df.reset_index(inplace=True)

df.dropna(inplace=True)

In [None]:
target_encoder = CatBoostEncoder().fit(df['galaxy'], df['y'])

df['galaxy'] = target_encoder.transform((df['galaxy']))

y = df['y']
X = df.drop('y', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
data_array = X_train

q1 = np.percentile(data_array, 25, axis=0)
q3 = np.percentile(data_array, 75, axis=0)

iqr = q3 - q1
threshold = 1.5
outlier_indices = np.where((data_array < q1 - threshold*iqr) | (data_array > q3 + threshold*iqr))

print("Outlier Indices:")
for col, row in zip(*outlier_indices):
  print(f"Column {col+1}, Row {row+1}")

In [None]:
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

**Simple Linear Regression**

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
linreg_preds = lin_reg.predict(X_test)

def calculate_scores(linreg_preds, actuals):
  r2 = r2_score(linreg_preds, actuals)
  mae = mean_absolute_error(linreg_preds, actuals)
  mse = mean_squared_error(linreg_preds, actuals)

  result = {'R-squared': round(r2, 3), 'Mean Absolute Error': round(mae, 3), 'Mean Squared Error': round(mse, 3)}
  print('Result of the Model:')
  for key, value in result.items():
      print(f"{key}: {value}")

calculate_scores(linreg_preds, y_test)

Result of the Model:
R-squared: 0.907
Mean Absolute Error: 0.014
Mean Squared Error: 0.001


In [None]:
param_grid = {
    'copy_X': [True, False],
    'fit_intercept': [True, False],
    'n_jobs': [-1, 1, 2],
    'positive': [True, False]
}

linear_reg = LinearRegression()
grid_search = GridSearchCV(linear_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X,y)

In [None]:
grid_search.best_params_

{'copy_X': True, 'fit_intercept': False, 'n_jobs': -1, 'positive': False}

**Tuned Linear Regression**

In [None]:
lr_model = LinearRegression(copy_X= True, fit_intercept= False, n_jobs= -1, positive= False)
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_test)

calculate_scores(lr_preds, y_test)

Result of the Model:
R-squared: 0.9
Mean Absolute Error: 0.015
Mean Squared Error: 0.001


**Simple Random Forest Regression**

In [None]:
rf_model1 = RandomForestRegressor()
rf_model1.fit(X_train, y_train)
rf_preds1 = rf_model1.predict(X_test)

calculate_scores(rf_preds1, y_test)

Result of the Model:
R-squared: 0.949
Mean Absolute Error: 0.009
Mean Squared Error: 0.0


In [None]:
param_grid = {
    'n_estimators': [10, 30, 50],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}

rf_regressor = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(rf_regressor, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X, y)

In [None]:
grid_search.best_params_

{'bootstrap': True,
 'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 50}

In [None]:
rf_model2 = RandomForestRegressor(bootstrap= True,
                                  max_depth= None,
                                  max_features= 'auto',
                                  min_samples_leaf= 4,
                                  min_samples_split= 2,
                                  n_estimators= 50)

rf_model2.fit(X_train, y_train)
rf_preds2 = rf_model2.predict(X_test)
calculate_scores(rf_preds2, y_test)

Result of the Model:
R-squared: 0.946
Mean Absolute Error: 0.009
Mean Squared Error: 0.0


**Extra Tree Regressor**

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.datasets import make_regression
import time

# X_train, y_train = make_regression(n_samples=1000, n_features=20, noise=0.1)

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

et_regressor = ExtraTreesRegressor()
grid_search = GridSearchCV(estimator=et_regressor, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)

Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}


In [None]:
best_et_regressor = ExtraTreesRegressor(
    max_depth=None,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=100
)

best_et_regressor.fit(X_train, y_train)
y_pred = best_et_regressor.predict(X_test)
calculate_scores(y_pred, y_test)

Result of the Model:
R-squared: 0.895
Mean Absolute Error: 35.209
Mean Squared Error: 2124.516


**XGBoost**

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

xgb_regressor = XGBRegressor()

grid_search = GridSearchCV(estimator=xgb_regressor, param_grid=param_grid, cv=3, n_jobs=-1)

grid_search.fit(X_train, y_train, early_stopping_rounds=10, eval_set=[(X_test, y_test)], verbose=False)

print("Best parameters:", grid_search.best_params_)

y_pred = grid_search.best_estimator_.predict(X_test)
calculate_scores(y_pred, y_test)

Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Result of the Model:
R-squared: 0.983
Mean Absolute Error: 15.945
Mean Squared Error: 439.002


**XGBoost** emerges as the superior choice due to its substantially higher R-squared value of **0.983**, indicating its superior ability to explain the variance in the target variable compared to other models. While XGBoost exhibits higher Mean Absolute Error (MAE) and Mean Squared Error (MSE) than some models like Simple Random Forest Regression, its predictive accuracy and capacity to capture variability in the data outweigh these metrics. Therefore, XGBoost is the preferred model for its outstanding predictive performance and comprehensive explanation of the variability in the target variable.