In this notebook, I tried to use the concept of Gaussian Mixture Models to get the relationship between each feature and the final loss. This was used to predict the loss for the test set.

Starting with the typical data imports:

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pandas import DataFrame, Series
import sklearn 
import os
'''for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))'''
X=pd.read_csv('/kaggle/input/tabular-playground-series-aug-2021/train.csv')
Y=X['loss']
X.drop(['id','loss'],axis=1,inplace=True)

Here we see the training data set

In [None]:
X.head()

Now we can find how the final loss varies with each of the hundred features. I checked the relationship between each feature and the final loss for the first one thousand training samples.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
fig, axs = plt.subplots(10, 10,figsize=[100, 100])
for i in range(10):
    for j in range(10):
        axs[i,j].plot(X[X.columns[(10*i)+j]][0:1000],Y[0:1000])

Interesting! It would appear that there are definite trends and shapes. I was curious on whether a similar relationship is displayed across the entire dataset. So I selected another subset of a thousand training samples (from 15000-16000) and checked.

In [None]:
fig, axs = plt.subplots(10, 10,figsize=[100, 100])
for i in range(10):
    for j in range(10):
        axs[i,j].plot(X[X.columns[(10*i)+j]][15000:16000],Y[15000:16000])

Surprisingly (or not surprisingly), the features show a very similar trend and distribution across the dataset with respect to their relationship with the final loss value.

Building on this, I tried to fit a Gaussian Mixture Model onto the relationship between each feature and the final loss. I found that setting the number of components to be the maximum loss value was helpful in modelling a suitable distribution.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.mixture import GaussianMixture
model_list=[]
for i in range(10):
    for j in range(10):
        s=GaussianMixture(n_components=int(max(Y[0:1000])), random_state=10).fit(np.vstack(X[X.columns[(10*i)+j]][0:1000]),np.vstack(Y[0:1000]))
        model_list.append(s)

To find the best weightage to give to each of the one hundred predictions, I checked the best value to divide the sum of predictions such that mean squared error was minimum. The validation set was the dataset sampled from indices 1000-2000.

In [None]:
s=np.zeros((1,1000))
for i in range(100):
    s=s+model_list[i].predict(np.vstack(X[X.columns[i]][1000:2000]))
mis=[]
for i in range(1,1000):
    r=s/i
    rms=mean_squared_error(r[0],Y[1000:2000], squared=False)
    mis.append(rms)
print(mis.index(min(mis)))
print(min(mis))

It appears that dividing by 279 yields the lowest RMS value. Applying this information to a larger subset of the test data also yields similar RMS.

In [None]:
s=np.zeros((1,10000))
for i in range(100):
    s=s+model_list[i].predict(np.vstack(X[X.columns[i]][3000:13000]))
s=s/279
rms=mean_squared_error(s[0],Y[3000:13000], squared=False)
print(rms)

Okay, time to implement this idea for the entire training and test data. Let's train 100 GMMs on the entire training set.

In [None]:
import numpy as np
from sklearn.mixture import GaussianMixture
model_list=[]
for i in range(10):
    for j in range(10):
        s=GaussianMixture(n_components=int(max(Y[0:250000])), random_state=5).fit(np.vstack(X[X.columns[(10*i)+j]][0:250000]),np.vstack(Y[0:250000]))
        model_list.append(s)
        print(str(10*i+j)) #This could take a while to run, so this print statement serves as a timer

Now let's sum up every model's predictions of the test set and divide by 279. We can place the final result into a dataframe 'concated'.

In [None]:
s=np.zeros((1,150000))
X_test=pd.read_csv('/kaggle/input/tabular-playground-series-aug-2021/test.csv')
ids=X_test['id'][0:150000]
X_test.drop(['id'],axis=1,inplace=True)
for i in range(100):
    s=s+model_list[i].predict(np.vstack(X_test[X_test.columns[i]][0:150000]))
    print(i)
s=s/279
ap=pd.DataFrame(s[0])
concated=pd.concat([ids,ap],axis=1)
concated.rename(columns={0:'loss'},inplace=True)

In the final step, we convert the dataframe to a csv.

In [None]:
concated.to_csv('gaussian_sub',index=False)