# **Preliminary Round FIT Competition 2024**
---
### **Bismillah Dulu**
University of Brawijaya
- Amira Ghina Nurfansepta
- Shania Edina

## **Data Understanding**
Data understanding involves comprehensively exploring and analyzing a dataset to gain insights into its structure, characteristics, and potential issues. This process helps identify patterns, relationships, and anomalies, which are crucial for informed decision-making in data-driven projects.

### **Features**

- **id** - City or Regency identifier
- **city_or_regency** - Name of City or Regency
- **year** - The year in which the data is recorded
- **total_area** - Area of City or Regency (KM^2)
- **population** - The Number of Residents in One City or Regency
- **densities** - Density Level (Population/KM^2)
- **traffic_density** - Categories for Traffic Density (Low/Medium/High)
- **green_open_space** - Area of Green Open Space (KM^2)
- **hdi** - Index of Human Development for Each City or Regency
- **gross_regional_domestic_product** - Total Gross Value Added at Current Prices (Billion Rupiah)
- **total_landfills** - Number of Landfills per City or Regency
- **solid_waste_generated** - The amount of waste each City or Regency generated from various sources for a year (Tens of Tons)
- **happiness_score** - Score to Measure The Level of Happiness for each city or Regency (0 - 100)

### **Import Library**
Install and import the required libraries.

In [None]:
pip install catboost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder, PolynomialFeatures, MinMaxScaler, StandardScaler
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV

from catboost import CatBoostRegressor

from sklearn.metrics import mean_squared_error

### **Import Dataset**
Import dataset to view existing data.

In [None]:
df_train = pd.read_csv('/kaggle/input/preliminary-round-fit-competition-2024/train.csv')
df_train.head()

In [None]:
df_test = pd.read_csv('/kaggle/input/preliminary-round-fit-competition-2024/test.csv')
df_test.head()

### **Number of Rows and Columns**
Knowing how many rows and columns of each data.

In [None]:
print("Number of rows =", df_train.shape[0])
print("Number of columns =", df_train.shape[1])

In [None]:
print("Number of rows =", df_test.shape[0])
print("Number of columns =", df_test.shape[1])

### **Dataset Information**
Knowing the data type and missing values in each data.

In [None]:
df_train.info()

In [None]:
df_test.info()

There are data types that do not match and must be corrected. All data has missing values in some columns.

### **Descriptive Statistics**
Knowing descriptive statistics of each data, either object or numeric data.

In [None]:
df_train.describe()

In [None]:
df_train.describe(include='object')

In [None]:
df_test.describe()

In [None]:
df_test.describe(include='object')

### **Duplicated Data**
Checking for duplicates in each data.

In [None]:
print("Number of duplicate data in the dataset:", df_train.duplicated().sum())

In [None]:
print("Number of duplicate data in the dataset:", df_test.duplicated().sum())

There is no duplication in any data.

### **Missing Value**

In [None]:
df_train.isna().sum()

In [None]:
df_test.isna().sum()

All data has missing values in the green_open_space, total_landfills, and solid_waste_generated columns. The missing values must be filled in.

## **Data Preprocessing**
Data preprocessing refers to the cleaning, transformation, and preparation of raw data before analysis. It involves tasks such as handling missing data, removing outliers, standardizing or normalizing data, and encoding categorical variables, all aimed at ensuring the data is suitable and reliable for machine learning models or analytical processes.

### **Cleaning Numeric Data**
Cleaning numerical data to make it suitable and easier to process.

In [None]:
def clean_numeric_data(column):
    return pd.to_numeric(column.astype(str).str.replace(',', ''), errors='coerce')

In [None]:
df_train['total_area (km2)'] = clean_numeric_data(df_train['total_area (km2)'])
df_train['population'] = clean_numeric_data(df_train['population'])
df_train['gross_regional_domestic_product'] = clean_numeric_data(df_train['gross_regional_domestic_product'])
df_train['solid_waste_generated'] = clean_numeric_data(df_train['solid_waste_generated'])

df_train.head()

In [None]:
df_test['total_area (km2)'] = clean_numeric_data(df_test['total_area (km2)'])
df_test['population'] = clean_numeric_data(df_test['population'])
df_test['gross_regional_domestic_product'] = clean_numeric_data(df_test['gross_regional_domestic_product'])
df_test['solid_waste_generated'] = clean_numeric_data(df_test['solid_waste_generated'])

df_test.head()

### **Converting Data Types**
Change the data type to match what it should be.

In [None]:
df_train['total_area (km2)'] = pd.to_numeric(df_train['total_area (km2)'], errors='coerce')
df_train['population'] = pd.to_numeric(df_train['population'], errors='coerce').astype('Int64')
df_train['green_open_space'] = pd.to_numeric(df_train['green_open_space'], errors='coerce')
df_train['gross_regional_domestic_product'] = pd.to_numeric(df_train['gross_regional_domestic_product'], errors='coerce').astype('Int64')
df_train['solid_waste_generated'] = pd.to_numeric(df_train['solid_waste_generated'], errors='coerce')
df_train.info()

In [None]:
df_test['total_area (km2)'] = pd.to_numeric(df_test['total_area (km2)'], errors='coerce')
df_test['population'] = pd.to_numeric(df_test['population'], errors='coerce').astype('Int64')
df_test['green_open_space'] = pd.to_numeric(df_test['green_open_space'], errors='coerce')
df_test['gross_regional_domestic_product'] = pd.to_numeric(df_test['gross_regional_domestic_product'], errors='coerce').astype('Int64')
df_test['solid_waste_generated'] = pd.to_numeric(df_test['solid_waste_generated'], errors='coerce')
df_test.info()

### **Drop Missing Value**
Removing missing values in some columns.

In [None]:
# df_train = df_train.dropna(subset=['green_open_space', 'total_landfills', 'solid_waste_generated'])
# df_train.isna().sum()

In [None]:
# df_test = df_test.dropna(subset=['green_open_space', 'total_landfills', 'solid_waste_generated'])
# df_test.isna().sum()

Based on the analysis of the results obtained, the MSE value by removing missing values is worse than imputation.

### **Input Missing Value**
Perform imputation with KNN Imputer for numerical data.

In [None]:
imputer = KNNImputer(n_neighbors=5)
columns_to_impute = ['green_open_space', 'total_landfills', 'solid_waste_generated']
df_train[columns_to_impute] = imputer.fit_transform(df_train[columns_to_impute])
df_train.info()

In [None]:
imputer = KNNImputer(n_neighbors=5)
columns_to_impute = ['green_open_space', 'total_landfills', 'solid_waste_generated']
df_test[columns_to_impute] = imputer.fit_transform(df_test[columns_to_impute])
df_test.info()

In [None]:
# from sklearn.impute import SimpleImputer

# # Inisialisasi SimpleImputer dengan strategi 'mean'
# mean_imputer = SimpleImputer(strategy='mean')

# # Kolom yang akan diimputasi
# columns_to_impute = ['green_open_space', 'total_landfills', 'solid_waste_generated']

# # Imputasi nilai yang hilang
# df[columns_to_impute] = mean_imputer.fit_transform(df[columns_to_impute])

# # Tampilkan hasil setelah diimputasi
# df.info()


In [None]:
# from sklearn.impute import SimpleImputer

# # Inisialisasi SimpleImputer dengan strategi 'median'
# median_imputer = SimpleImputer(strategy='median')

# # Kolom yang akan diimputasi
# columns_to_impute = ['green_open_space', 'total_landfills', 'solid_waste_generated']

# # Imputasi nilai yang hilang
# df[columns_to_impute] = median_imputer.fit_transform(df[columns_to_impute])

# # Tampilkan hasil setelah diimputasi
# df.info()


In [None]:
# from sklearn.impute import SimpleImputer

# # Inisialisasi SimpleImputer dengan strategi 'most_frequent' untuk data kategorikal
# mode_imputer = SimpleImputer(strategy='most_frequent')

# # Kolom yang akan diimputasi
# columns_to_impute = ['green_open_space', 'total_landfills', 'solid_waste_generated']

# # Imputasi nilai yang hilang
# df[columns_to_impute] = mode_imputer.fit_transform(df[columns_to_impute])

# # Tampilkan hasil setelah diimputasi
# df.info()


In [None]:
# from sklearn.experimental import enable_iterative_imputer
# from sklearn.impute import IterativeImputer

# # Inisialisasi IterativeImputer
# iterative_imputer = IterativeImputer()

# # Imputasi nilai yang hilang
# df[columns_to_impute] = iterative_imputer.fit_transform(df[columns_to_impute])

# # Tampilkan hasil setelah diimputasi
# df.info()

After experimenting several times, the best n_neighbors value is 5. 

### **Feature Engineering**
Feature engineering involves creating new features or transforming existing ones from raw data that can enhance the performance of machine learning models. This process aims to extract meaningful information, reduce noise, or improve the representation of data, ultimately improving the model's ability to make accurate predictions or classifications. Examples include creating interaction terms, scaling features, or encoding temporal information.

In [None]:
def split_id(id):
    str_id = str(id)
    split1 = str_id[:1]
    split2 = str_id[1:2]
    split3 = str_id[:2]
    split4 = str_id[3:4]
    return pd.Series([split1, split2,split3,split4])

In [None]:
df_train[['split1', 'split2','split3','split4']] = df_train['id'].apply(split_id)
df_train[['id', 'split1', 'split2','split3','split4']]
df_train.head()

In [None]:
df_test[['split1', 'split2','split3','split4']] = df_test['id'].apply(split_id)
df_test[['id', 'split1', 'split2','split3','split4']]
df_test.head()

We analyzed the id, and found that there is a code for each digit in the id.
- split1 is an island
- split2 is the province increment for each island
- split3 is the province
- split4 is the grouping of cities and districts

In [None]:
poly = PolynomialFeatures(degree=2, include_bias=False)
hdi_poly = poly.fit_transform(df_train[['hdi']])
df_train['hdi_squared'] = hdi_poly[:, 1]

scaler = StandardScaler()
df_train['standardized_hdi'] = scaler.fit_transform(df_train[['hdi']])

df_train['hdi_binned'] = pd.qcut(df_train['hdi'], q=4, labels=['Low', 'Medium-Low', 'Medium-High', 'High'])
df_train['hdi_x_densities'] = df_train['hdi'] * df_train['densities']
df_train['hdi_x_gross'] = df_train['hdi'] * df_train['gross_regional_domestic_product']
df_train['hdi_log'] = np.log(df_train['hdi'] + 1)

df_train['densities_binned'] = pd.qcut(df_train['densities'], q=3, labels=['Low', 'Medium', 'High'])
df_train['density_green_space'] = df_train['green_open_space'] / df_train['total_area (km2)']

df_train['gdp_per_capita'] = df_train['gross_regional_domestic_product'] / df_train['population']
df_train['log_grdp'] = np.log(df_train['gross_regional_domestic_product'] + 1)
df_train['lagged_gdp_growth'] = df_train['gross_regional_domestic_product'].diff()

df_train['green_space_per_capita'] = df_train['green_open_space'] / df_train['population']
df_train['waste_per_capita'] = df_train['solid_waste_generated'] / df_train['population']
df_train['sqrt_landfills'] = np.sqrt(df_train['total_landfills'])
df_train['population_log'] = np.log(df_train['population'] + 1)

df_train.head()

We did some feature engineering for the other columns to add more data and knowledge to the model.

### **Label Encoding**
Label encoding is a technique used in feature engineering where categorical data is converted into numerical form. Each unique category is assigned a unique integer, typically starting from 0 or 1 up to the number of distinct categories minus one. It's useful for algorithms that require numerical inputs, but it may not be suitable for categorical variables with no inherent order, as it could introduce unintended relationships.

In [None]:
label_encoder = LabelEncoder()

object_columns = ['city_or_regency', 'traffic_density', 'hdi_binned', 'densities_binned', 'split1', 'split3']
for col in object_columns:
    df_train[col] = label_encoder.fit_transform(df_train[col])

df_train.describe()

In [None]:
label_encoder = LabelEncoder()

object_columns = ['city_or_regency', 'traffic_density', 'split1', 'split3']
for col in object_columns:
    df_test[col] = label_encoder.fit_transform(df_test[col])

df_test.describe()

### **Normalization**
Normalization is a preprocessing technique used to rescale numeric data to a common scale, typically between 0 and 1. It ensures that all features contribute equally to the analysis and prevents features with larger numeric ranges from dominating those with smaller ranges. Common normalization techniques include Min-Max scaling and Z-score standardization.

In [None]:
# scaler = MinMaxScaler()

# df['total_area (km2)'] = scaler.fit_transform(df[['total_area (km2)']])
# df['population'] = scaler.fit_transform(df[['population']])
# df['densities'] = scaler.fit_transform(df[['densities']])
# df['green_open_space'] = scaler.fit_transform(df[['green_open_space']])
# df['hdi'] = scaler.fit_transform(df[['hdi']])
# df['gross_regional_domestic_product'] = scaler.fit_transform(df[['gross_regional_domestic_product']])
# df['total_landfills'] = scaler.fit_transform(df[['total_landfills']])
# df['solid_waste_generated'] = scaler.fit_transform(df[['solid_waste_generated']])

# df.describe()

We tried normalizing the data, and the MSE result was the same as the non-normalized one. Therefore, normalization does not need to be applied.

### **Feature Selection**
Feature selection is the process of choosing a subset of relevant features (variables, predictors) from a larger set of available features to use in model construction. It aims to improve model performance by reducing overfitting, simplifying interpretation, and decreasing computational cost. Techniques include statistical tests, feature importance from models, and algorithms like Recursive Feature Elimination (RFE).

In [None]:
X = df_train.drop('happiness_score', axis=1)
y = df_train['happiness_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
model = CatBoostRegressor(verbose=0)
model.fit(X_train, y_train)

feature_importance = model.get_feature_importance()

plt.figure(figsize=(10, 6))
plt.barh(X_train.columns, feature_importance)
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance Plot')
plt.show()

We perform feature selection with the feature importance of the CatBoost model. The MSE results obtained are still not good.

In [None]:
correlation_matrix = df_train.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='Blues')
plt.title('Correlation Matrix')
plt.show()

correlation_with_target = correlation_matrix["happiness_score"].drop("happiness_score")
threshold = 0.1
selected_features = correlation_with_target[abs(correlation_with_target) > threshold].index.tolist()

print(correlation_with_target)
print(f"Selected features: {selected_features}")

We try to do feature selection with the correlation of each feature. We tried one by one in bruteforce.

## **Exploratory Data Analysis**
Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It helps uncover patterns, trends, relationships, and anomalies in the data, providing insights that inform further analysis or model building. EDA typically involves tasks such as data visualization, summary statistics, and correlation analysis to understand the nature of the data before applying more complex techniques.

### **Feature and Label Distribution**
Understanding the distribution of features and labels helps identify patterns, trends, and potential anomalies within the data, facilitating better model training and performance.

In [None]:
features = ['year', 'split1', 'split3']
for column in features:
    plt.figure(figsize=(10, 5))
    sns.histplot(df_train[column], kde=True, bins=30)
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()

plt.figure(figsize=(10, 5))
sns.histplot(df_train['happiness_score'], kde=True, bins=30)
plt.title('Distribution of Happiness Score')
plt.xlabel('Happiness Score')
plt.ylabel('Frequency')
plt.show()

In [None]:
all_features = df_train.columns.tolist()
num_cols = 3
num_rows = (len(all_features) + num_cols - 1) // num_cols

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 5))
axes = axes.flatten()

for i, column in enumerate(all_features):
    sns.histplot(df_train[column], kde=True, bins=30, ax=axes[i])
    axes[i].set_title(f'Distribution of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

for i in range(len(all_features), len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()


From the visualization, we can observe the data distribution for each feature; some features have a normal distribution, while others are skewed.
- The `year` feature shows an equal distribution of data for 2022 and 2023.
- The `split1` and `split3` features exhibit varied distributions.
- The `happiness_score` is almost normally distributed.

### **Descriptive Statistics**
Descriptive statistics provide summary insights about the central tendency, dispersion, and shape of the data's distribution, enabling quick comprehension of the data's characteristics.

In [None]:
pd.set_option('display.max_columns', None)
df_train.describe(include='all')

### **Feature Correlation**
Analyzing feature correlations helps identify relationships between variables, which can inform feature selection and improve the model's predictive power.

In [None]:
plt.figure(figsize=(12, 8))
correlation_matrix = df_train[features + ['happiness_score']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

The correlation matrix indicates that all three features have a positive correlation with the happiness score. These correlations suggest that each of these features is meaningfully related to the happiness score.

### **Outlier Identification**
Detecting outliers is crucial for ensuring data quality, as outliers can significantly skew model training and lead to poor generalization.

In [None]:
for column in features:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x=df_train[column])
    plt.title(f'Boxplot of {column}')
    plt.show()

plt.figure(figsize=(10, 5))
sns.boxplot(x=df_train['happiness_score'])
plt.title('Boxplot of Happiness Score')
plt.show()

Based on the boxplot visualization, there are no outliers in the features `year`, `split1`, and `split3`. However, the `happiness_score` does have a few outliers. 

In [None]:

numeric_features = df_train.select_dtypes(include=[np.number]).columns.tolist()
num_cols = 3
num_rows = (len(numeric_features) + num_cols - 1) // num_cols

fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 5))
axes = axes.flatten()

for i, column in enumerate(numeric_features):
    sns.boxplot(x=df_train[column], ax=axes[i])
    axes[i].set_title(f'Boxplot of {column}')
    axes[i].set_xlabel(column)

for i in range(len(numeric_features), len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

Based on the boxplot visualizations, several features exhibit a significant number of outliers

### **Multivariate Visualization**
Multivariate visualizations provide a comprehensive view of the relationships between multiple features simultaneously, aiding in the identification of complex patterns and interactions.

In [None]:
sns.pairplot(df_train[features + ['happiness_score']])
plt.show()

* Year: Data is evenly distributed between 2022 and 2023.
* Split1 and Split3: Both show a varied distribution across different values.
* Happiness Score: Nearly forms a normal distribution, with most scores clustered around the mean.

Relationships:
* `Split1` and `Split3` have a clear, positive relationship.
* `Year` has a distinct separation, especially noticeable in `split1` and `split3`.
* `Happiness Score` shows spread relationships with other features.

In [None]:
# # scatter plot
# plt.figure(figsize=(10, 6))
# sns.scatterplot(x=df['green_open_space'], y=df['solid_waste_generated'])
# plt.title('Scatter Plot of Green Open Space vs Solid Waste Generated')
# plt.xlabel('Green Open Space')
# plt.ylabel('Solid Waste Generated')
# plt.show()


In [None]:
# # Plot line plot
# plt.figure(figsize=(10, 6))
# sns.lineplot(data=df, x='year', y='solid_waste_generated')
# plt.title('Line Plot of Solid Waste Generated Over Years')
# plt.xlabel('Year')
# plt.ylabel('Solid Waste Generated')
# plt.show()

In [None]:
# # Plot bar plot
# plt.figure(figsize=(10, 6))
# sns.barplot(x=df['city'], y=df['solid_waste_generated'])
# plt.title('Bar Plot of Solid Waste Generated by City')
# plt.xlabel('City')
# plt.ylabel('Solid Waste Generated')
# plt.xticks(rotation=90)
# plt.show()


In [None]:
# data = df['city'].value_counts()

# # Plot pie chart
# plt.figure(figsize=(10, 6))
# plt.pie(data, labels=data.index, autopct='%1.1f%%', startangle=140)
# plt.title('Pie Chart of City Distribution')
# plt.axis('equal')
# plt.show()


## **Modeling (Regression)**
Modeling in the context of regression involves using statistical techniques to build a predictive model that estimates the relationship between one or more independent variables (predictors) and a dependent variable (target). The goal is to create a function that best fits the data, allowing predictions of the target variable for new data points. Techniques range from simple linear regression to more complex methods like polynomial regression, ridge regression, or machine learning algorithms such as random forests or gradient boosting. Evaluation of regression models typically involves metrics like mean squared error (MSE) or R-squared to assess predictive accuracy.

### **Split Data**
Splitting data refers to dividing a dataset into two or more subsets for different purposes, typically for training and evaluating machine learning models.

In [None]:
X = df_train[['year', 'split3', 'split1']]
y = df_train['happiness_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
testing = df_test[['year', 'split3', 'split1']]

After trying repeatedly, with different features in bruteforce, the 3 best features that affect the model and MSE value are year, split3, and split1.

### **Resampling**
Resampling refers to techniques used to repeatedly draw samples from a dataset to improve statistical inference and model performance. It's particularly useful in scenarios where data is limited or imbalanced.

In [None]:
# def add_gaussian_noise(X, y, mean=0, std=0.1, n_samples=10):
#     X_augmented = []
#     y_augmented = []
#     for _ in range(n_samples):
#         noise = np.random.normal(mean, std, X.shape)
#         X_augmented.append(X + noise)
#         y_augmented.append(y)
#     return np.vstack(X_augmented), np.hstack(y_augmented)

# X_train, y_train = add_gaussian_noise(X_train, y_train)

# print(y_train.shape)

After resampling, the MSE results obtained are still not good enough so there is no need for resampling.

### **Tuning Hyperparameter**
Tuning hyperparameters involves the process of selecting the optimal values for parameters that are not directly learned during model training but rather set before training begins.

In [None]:
# model = CatBoostRegressor(verbose=0)

# param_grid = {
#     'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.1],
#     'iterations': [500,700, 1000],
# }

# grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=1)
# grid_search.fit(X_train, y_train)

# best_params = grid_search.best_params_
# print("Best parameters found: ", best_params)

# best_model = grid_search.best_estimator_
# best_model.fit(X_train, y_train)

# y_pred = best_model.predict(X_test)
# mse = mean_squared_error(y_test, y_pred)
# print("Mean Squared Error on test data: ", mse)

### **Modeling With CatBoost**
CatBoost Regressor is a high-performance machine learning algorithm designed specifically for regression tasks, particularly suited for tabular data. It distinguishes itself by its ability to handle categorical variables automatically, without preprocessing, which simplifies data preparation and often improves model accuracy. CatBoost incorporates built-in regularization techniques to mitigate overfitting and utilizes gradient-based learning with ordered boosting to optimize training efficiency. It supports GPU acceleration for faster training on large datasets and has demonstrated competitive performance compared to other popular boosting algorithms like XGBoost and LightGBM. Overall, CatBoost Regressor is a robust choice for regression problems where both predictive accuracy and computational efficiency are priorities.

In [None]:
model = CatBoostRegressor(iterations=1000,
                          learning_rate=0.02,  
                          loss_function='RMSE',  
                          random_state=0,
                          verbose=0)

cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error', verbose=100)

cv_mse_scores = -cv_scores

print("Cross-Validation MSE scores for each fold:", cv_mse_scores)
print("Average MSE score:", cv_mse_scores.mean())

We perform cross validation to determine whether the prediction results are overfitting or not.

In [None]:
model.fit(X_train, y_train, verbose=100)

In [None]:
# from sklearn.ensemble import GradientBoostingRegressor

# model_gbm = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.02, random_state=0)
# model_gbm.fit(X_train, y_train)


In [None]:
# from lightgbm import LGBMRegressor

# model_lgbm = LGBMRegressor(n_estimators=1000, learning_rate=0.02, random_state=0)
# model_lgbm.fit(X_train, y_train)


In [None]:
# from sklearn.ensemble import AdaBoostRegressor

# model_adaboost = AdaBoostRegressor(n_estimators=1000, learning_rate=0.02, random_state=0)
# model_adaboost.fit(X_train, y_train)


In [None]:
# from xgboost import XGBRegressor

# model_xgboost = XGBRegressor(n_estimators=1000, learning_rate=0.02, random_state=0)
# model_xgboost.fit(X_train, y_train)


In [None]:
# from ngboost import NGBRegressor

# model_ngboost = NGBRegressor(n_estimators=1000, learning_rate=0.02, random_state=0)
# model_ngboost.fit(X_train, y_train)


In [None]:
# from sklearn.experimental import enable_hist_gradient_boosting
# from sklearn.ensemble import HistGradientBoostingRegressor

# model_hist_gbm = HistGradientBoostingRegressor(max_iter=1000, learning_rate=0.02, random_state=0)
# model_hist_gbm.fit(X_train, y_train)


In [None]:
# from sklearn.ensemble import RandomForestRegressor

# model_rf = RandomForestRegressor(n_estimators=1000, random_state=0)
# model_rf.fit(X_train, y_train)


In [None]:
# from sklearn.tree import DecisionTreeRegressor

# model_dt = DecisionTreeRegressor(random_state=0)
# model_dt.fit(X_train, y_train)


In [None]:
# from sklearn.neighbors import KNeighborsRegressor

# model_knn = KNeighborsRegressor(n_neighbors=5)
# model_knn.fit(X_train, y_train)


In [None]:
# from sklearn.naive_bayes import GaussianNB

# model_nb = GaussianNB()
# model_nb.fit(X_train, y_train)


In [None]:
# from sklearn.svm import SVR

# model_svm = SVR(kernel='rbf', C=1.0, epsilon=0.1)
# model_svm.fit(X_train, y_train)


In [None]:
# from sklearn.linear_model import LogisticRegression

# model_log_reg = LogisticRegression(random_state=0)
# model_log_reg.fit(X_train, y_train)


In [None]:
# from sklearn.ensemble import StackingRegressor
# from sklearn.linear_model import LinearRegression

# estimators = [
#     ('rf', RandomForestRegressor(n_estimators=1000, random_state=0)),
#     ('gbm', GradientBoostingRegressor(n_estimators=1000, learning_rate=0.02, random_state=0))
# ]

# model_stacking = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
# model_stacking.fit(X_train, y_train)


In [None]:
# from sklearn.ensemble import VotingRegressor

# estimators = [
#     ('rf', RandomForestRegressor(n_estimators=1000, random_state=0)),
#     ('gbm', GradientBoostingRegressor(n_estimators=1000, learning_rate=0.02, random_state=0))
# ]

# model_voting = VotingRegressor(estimators=estimators)
# model_voting.fit(X_train, y_train)


We have tried several other models as well as hyperparameter tuning. It was found that CatBoost is the best model with the above tuning. Some of the models we have tried are GBM, LGBM, AdaBoost, XGBoost, NGBoost, HistGBM, Random Forest, Decision Tree, KNN, Naive Bayes, SVM, Logistic Regression, Stacking Approach, and Voting Approach.

In [None]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f'MSE: {mse:.3f}')

MSE of 1.361 is the best value we managed to get.

## **Result**
Model results are applied to test data for submission.

In [None]:
prediction = model.predict(testing)

In [None]:
output = pd.DataFrame({'id': df_test['id'], 'happiness_score' : prediction})
output.to_csv('submission.csv', index=False)
output

## **Conclusion**
- `year` has a moderate positive correlation with `happiness_score` (0.35), indicating that the year might significantly impact the happiness score. `Split1` (0.18) and `Split3` (0.17) also have moderate positive correlations with `happiness_score`, suggesting they are meaningful features for prediction.
- `year`, `split1`, and `split3` have no significant outliers. `happiness_score` feature has a few outliers
- CatBoost is the model that obtained the best MSE value compared to other models, with a value of 1.361
- The feature set currently used is not enough to adequately predict happiness_score. Additional or more relevant features are needed to improve the prediction accuracy of the model.

## **Suggestion**
- Expand the dataset to cover a longer timeframe and incorporate features that are more directly linked to happiness scores, which could potentially improve the model's predictive capabilities.