In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Beginning, inspiration and improvement

This notebook is an improvement from another notebook, called: "Top 3% solution: LGBM+Mean". It can be found [here](https://www.kaggle.com/code/abdulravoofshaik/top-3-solution-lgbm-mean) so please check it out and give the man a vote, because he deserves it. I surely did.

Basically, he divides the entire dataset into four parts, delimited by the *F_x_y* features. The *F_4_x* features are the most correlated of all, thus it makes sense to focus our imputing on that part, the other three parts being imputed using the *mean* strategy of sklearn's *SimpleImputer* class. The *F_4_x* features were imputed using the *LGBMRegressor* class, with 20k estimators.

This gave me *inspiration* for an *improvement* to the method. The improvement comes from the fact that *SimpleImputer* offers appaling results for this dataset. How do I know? Well:
1. I tried the method;
1. If it were that easy, everybody would have done it.

So, why not apply the *LGBMRegressor* to the other three parts as well? If the idea offers superior results to the fourth part, why not apply it to the other three?

Let's see what happens.

# 2. Importing the needed libraries.

Here we import other libraries that we need, except for *numpy* and *pandas*, which are imported by default.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from lightgbm import LGBMRegressor
from sklearn.impute import SimpleImputer
from tqdm import tqdm

# 3. Defining imputing function

Below the imputing function is defined. It will be used to all the imputing that will be done.

In [None]:
def imputing(data,data_final,n_estimators=500):
    for column in tqdm(data.columns):
        if data[column].isnull().any():
            missing_list = list(np.where(data[column].isnull())[0])
            no_missing_list = list(np.where(data[column].isnull() == False)[0])
            train = data.iloc[no_missing_list,]
            test = data.iloc[missing_list,]
            X = train.drop([column], axis=1)
            y = train[column]
            X_test = test.drop([column], axis=1)
            X_cols = X.columns
            X_test_cols = X_test.columns
            model = LGBMRegressor(n_estimators=n_estimators, metric='r2',boosting_type='dart')
            model.fit(X=X, y=y)
            y_predict = model.predict(X_test)
            print(model.score(X=X, y=y))
            data_all = data[column]
            data_all.iloc[missing_list,] = y_predict
            data_final = pd.concat([data_final, data_all], axis=1)
        else:
            data_final = pd.concat([data_final, data[column]], axis=1)
    return data_final

The significance of the protoype parameters are:
1. **data** = the Data Frame that will be imputed;
1. **data_final** = the Data Frame that stores the values of the imputed columns, including the columns' imputed values;
1. **n_estimators** = the number of estimators for *LGBMRegressor*. The default value of 500 was chosen arbitrary.

# 4. Reading inputs

Here we read the *.csv* file content, store the column list for further use, then we display different properties of the Data Frame. In the last two lines, we create an empty Data Frame called *data_final*, which will store the values of the imputed columns, as I said above. The *submission* variable will store the final data ready for submission. We read the content of the *sample_submission* file because it's easier to modify that Data Frame than to create a new one.

In [None]:
data = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')
data_columns_list = data.columns.tolist()
print(data.head())
print(data.describe())
print(data.info())
data_final = pd.DataFrame()

submission = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv', index_col='row-col')

# 5. EDA of features.

This plot is taken from the notebook that I linked above and presents the EDA of features.

In [None]:
plt.rcParams['figure.figsize'] = (25, 20)
fig, ax = plt.subplots(9, 9)
fig.text(0.35, 1, 'EDA of Features', {'size': 24})
i = j = 0
for col in data.columns:
    if col not in ['row_id']:
        ax[j, i].hist(data[col], bins=100)
        ax[j, i].set_title(col, {'size': '14', 'weight': 'bold'})
        if i == 8:
            i = 0
            j += 1
        else:
            i += 1
plt.rcParams.update({'axes.facecolor': 'lightgreen'})
plt.figure(facecolor='red')
plt.show()

# 6. Correlation plots.

Again, these plots are taken from the notebook that I linked above. It represents the correlations between each *F_x_y* features. As the author said, the *F_4_x* part is the most correlated of all.

In [None]:
features = list(data.columns)
features_1, features_2, features_3, features_4 = [], [], [], []
F = [[], [], [], [], []]
for feature in features:
    for i in [1, 2, 3, 4]:
        if feature.split('_')[1] == str(i):
            F[i].append(feature)
df = [[], [], [], [], []]
fig, axs = plt.subplots(nrows=4, ncols=1, figsize=(18, 30))
for i in [1, 2, 3, 4]:
    df[i] = data[F[i]]
    corr = df[i].corr()
    sns.set(font_scale=0.7)
    sns.heatmap(corr, ax=axs[i - 1], annot=True)
plt.show()

# 7. Parts estimation

Here we use the imputing function to impute the missing values of the four parts. The only difference is the number of estimators, which can be read from the code.

In [None]:
print('First part: ')
df[1]=imputing(df[1],data_final,n_estimators=50)
print('Second part: ')
df[2]=imputing(df[2],data_final,n_estimators=1)
print('Third part: ')
df[3]=imputing(df[3],data_final,n_estimators=50)
print('Fourth part: ')
df[4]=imputing(df[4],data_final,n_estimators=20000)

# 8. Assembling the result

Here we assemble the four parts into the final, imputed Data Frame

In [None]:
data = pd.concat([df[1], df[2], df[3], df[4]], axis=1)

# 9. Filling the submission Data Frame

The title is self-describing.

In [None]:
for i in tqdm(submission.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    submission.loc[i, 'value'] = data.loc[row, col]

# 10. Building the csv file

Again, the title is self-describing.

In [None]:
submission.to_csv('submission.csv')

# 11. Conclusions

As one can see, this solution yielded better results. The original dataframe had a score of 0.87724, while this one has a score of 0.87456. However, a few aspects need to be mentioned:

1. The running time of this notebook is longer than the running time of the original notebook. The original notebook ended its execution in about 7 hours, this one took about 9 hours.
1. I took the number of estimators by trial and error. At first, I thought that a higher number of estimators (10k or so) for the first three parts will compensate the poor correlation inside them. Decreasing the number of estimators lead to a score improvement, so I conclude that such a high number actually lead to overfitting. I stopped iterating at these values because I saw that the score improvements are more than marginal (0.0001) to justify about 9 hours of waiting;
1. Does the score gain worth 2 more computing hours? I don't know. Depends on the final purpose. For me it means 10 places or so, but for others, it may worth it.

This is my improved solution. Thank you for reading so far. I am looking forward to your comments and questions and don't forget to vote for it, if you liked it.