# Using data science to predict ground energy level

PS. Sorry for rather ugly start of notebook and all those naughty terminal prints, they serve a good purpose, I promis. You will see hidden beauty in the end

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Scikit-Learn modules
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.decomposition import KernelPCA, PCA
from sklearn.model_selection import  train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.pipeline import Pipeline

# others
from skimage.measure import block_reduce

In [None]:
%matplotlib inline
_ = plt.style.use('ggplot')
_ = plt.style.available

In [None]:
molecule_data = pd.read_csv('../input/roboBohr.csv', header=0, dtype=np.float64, usecols=list(range(1,1276))+[1277])
molecule_data.head()

# Visualization

As we can see absolute values of cells do vary a lot. Rightmost values are essentially zero. For that reason Let us normalize the data. SKlearn StandarScaler is good one to help us here. It will center our data around 0-mean and variance of 1. 

PS. After playing with some data I learnt that there are padded 0s to data. If you want to see story of how I noticed it, simply comment out the line where I get rid of them and read the sentence I have striked through on one of next cells

In [None]:
pre_X = molecule_data.drop(columns=['Eat'])
zero_mask_X = (pre_X==0)
print("{0:.2f} % of cells were actually padded zero"
      .format(100.0 * zero_mask_X.values.flatten().sum() / (pre_X.shape[0]*pre_X.shape[1])))
print("--- --- --- ")
print("Turning them into np.nan")
pre_X[zero_mask_X] = np.nan
print("DONE!")
print("--- --- --- ")
X = StandardScaler().fit_transform(pre_X)
print("Scaling finished, slice of new feature data")
print(X[:8])
print('--- --- --- ')
print('Target values:')
y = molecule_data['Eat'].values
print(y)

In [None]:
# let us learn some stats and info about dataset
print("There are {} entries with {} features".format(X.shape[0], X.shape[1]))
print("--- --- --- ")
print("The statistical information about each feature (column)")
molecule_stats = pd.DataFrame(X).describe()
print(molecule_stats)

Since we scaled the data mean values are nearly zero (e-16) and variance is 1.  Seems like max values tend to be more on the outlier size than min values. To investigate it further, let us make boxplots of first and last three feature columns

In [None]:
feature_indices = [0,1,2,300,500,700,-3,-2,-1]
chosen_features = ([ pd.DataFrame(X[:,i]).dropna().values for i in feature_indices])
### PLOTTING time ###
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
ax.set_yscale('linear')
ax.set_xlabel('Value of feature')
ax.set_ylabel('Feature: top-rightmost, bottom-leftmost')
ax.tick_params(bottom=False, left=False, labelleft=False)
# print(chosen_features)
_ = ax.boxplot(chosen_features, showmeans=True,showcaps=True, showfliers=True,vert=False)

We see that first features (leftmost) follow sort of gamma distribution. There is little refresher on distributions on next cell. ~~Going more right stretches distribution more into exponential territory. For the rightmost one even can argue that there are two groups of values.~~ There fewer rightmost values which means less outliers
This actually can be a little problematic if we happen to use PCA dimensionality reduction. Main point of PCA is that it preserves variance in the data, however, that is not the case in gamma distribution. Nevertheless, we can still give it a try since it can still give satisfactoery results


~~> Why does it happen?~~

~~It turns out, after skimming the article associated with dataset, because some molecules have atoms less than 50, extra 0s are padded (guess where, to rightmost features!). That is why the mean is close to zero while there is no **real** zero value feature. This also affects StandarScaler too. To overcome it, let us turn 0 values into NAN~~

In [None]:
np.random.seed(0)
dist_gauss = np.random.normal(size=1000)
dist_gamma = np.random.gamma(shape=1.5, scale=1.5,size=1000)
dist_exp = np.random.exponential(size=1000)

fig = plt.figure(figsize=(10,5))
with plt.xkcd():
    plt.rcParams.update({'font.size':'10'})
    ax_gauss = fig.add_subplot(131)
    ax_gauss.tick_params(left=False, labelleft=False)
    ax_gamma = fig.add_subplot(132)
    ax_gamma.tick_params(left=False, labelleft=False)
    ax_exp = fig.add_subplot(133)
    ax_exp.tick_params(left=False, labelleft=False)
    ax_gauss.hist(dist_gauss,density=True,bins=17)
    ax_gauss.set_title("Normal(Gauss) distribution")
    ax_gamma.hist(dist_gamma,density=True,bins=17)
    ax_gamma.set_title("Gamma distribution")
    ax_exp.hist(dist_exp,density=True,bins=17)
    ax_exp.set_title("Exponential distribution")

    

# Dimensionality reduction

1275 is awful a lot, let us reduce it using Principal Component Analysis so curse of dimensionality does not annoy us

In [None]:
##############################
### Put missing values back ### 
##############################

X[zero_mask_X] = 0

###########
### PCA ###
###########
# try different number of components and see how well they explain variance
N_PCA=50
p = PCA(n_components=N_PCA).fit(X)
ns = list(range(N_PCA))

plt.figure()
plt.plot(ns, [ p.explained_variance_ratio_[n] for n in ns], 
         'r+', label="Explained variance - single feature")
plt.plot(ns, [ p.explained_variance_ratio_.cumsum()[n] for n in ns], 
         'b*', label="Explained variance - cumulative")
_ = plt.legend()

In [None]:
# from analyzing the graph we can see that about 25 components
# are enough to explain 96% variation in data. We can keep them

# Another thing that grinds my gears
# how well does new PCs explain energy levels?
X_reduced = p.transform(X)[:,:25]
plt.style.use('grayscale')
fig = plt.figure(figsize=(10,10))
axs = fig.subplots(5,5,sharex=True,sharey=True)
axs = np.array(axs).flatten()
for ax in axs[[0,5,10,15,20]]:
    ax.set_ylabel("Energy Level")
for i in range(25):
    ax = axs[i]
    ax.scatter(X_reduced[:,i],y,s=0.1, alpha=0.2)
    ax.set_xlabel("PC-{}".format(i+1), labelpad=2)
    ax.tick_params(left=False, bottom=False)

Just as expected first few principal components produce more well defined shapes. The latter components essentially "melts" dataset into a grid where the energy level practically is not dependent on the value at all. Just because everyone does it let us make 2D heatmap where color is dependent on target value using our new axis. Also if you try hard with your imagination, it is possible to see PC1 and PC2 structures in the image

In [None]:
plt.style.use('ggplot')
plt.set_cmap("magma")
plt.figure(figsize=(8,8))
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.tick_params(left=False, bottom=False, labelbottom=False, labelleft=False)
_ = plt.scatter(X_reduced[:,0], X_reduced[:,1], c=y,s=5,alpha=0.2)

### KERNEL PCA

I am not comfortable with using PCA as a hammer to everything. Particularly in this situation there is (kind of reason) to reject it. And also because I want to apply some kernels to dataset just because it is so cool. Beware not to apply to whole dataset though, because it is so expensive. Bootstrapping the dataset and fitting on it works fine and A LOT faster. Just a side note I use sigmoid, because it seems to work. Have tried rbf, but results were not good, and cosine pretty much gives the same results as sigmoid.

In [None]:
##################
### KERNEL PCA ###
##################

mask_random = np.random.randint(0,X.shape[0],size=2000)
kp = KernelPCA(n_components=100,kernel='sigmoid', gamma=0.5, max_iter=250).fit(X[mask_random])
print("PCA is trained on kernel")
print('--- --- --- ')
X_reduced = kp.transform(X)[:,:50]
plt.style.use('grayscale')
fig = plt.figure(figsize=(16,8))
axs = fig.subplots(5,10,sharex=True, sharey=True)
axs = np.array(axs).flatten()
for i in range(50):
    ax = axs[i]
    ax.scatter(X_reduced[:,i],y,s=0.1, alpha=0.2)
    ax.tick_params(left=False, bottom=False)

This is a lot of variance which is unaccounted. But soon we will find out it is not that important.

In [None]:
plt.style.use('ggplot')
plt.set_cmap("magma")
plt.figure(figsize=(16,8))
plt.subplot(121)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.tick_params(left=False, bottom=False, labelbottom=False, labelleft=False)
_ = plt.scatter(X_reduced[:,0], X_reduced[:,1], c=y,s=5,alpha=0.2)

plt.subplot(122)
plt.xlabel("PC 2")
plt.ylabel("PC 3")
plt.tick_params(left=False, bottom=False, labelbottom=False, labelleft=False)
_ = plt.scatter(X_reduced[:,1], X_reduced[:,2], c=y,s=5,alpha=0.2)

Let us take a step back and look at the hidden structure that got discovered in the dataset simply applying a kernel.
>**MESMERIZING!**

As we can see applying logistic kernel does not explain the variance as good as normal PCA (see the PC-20 in the plot, it is still trying to explain some variance).  However we do not really need variance, because gamma distribution is not well explained by variance anyway. From the look of plots, KernelPCA seems like better way of reduction. Enough pictures, time to actually learn...

# LET US Build 

## Experimentation

This section I will just throw different ML eggs on the data wall and hope one sticks

For the first try I will be bold, and use only 2 PC components and fit a linear model. This should give us good idea how the model would work, if it goes well, we can build a pipeline, do gridsearch, crossval etc. to optimize the parameters, if not, there are many things to do, no worries. Just for the baseline, **KNN with 4 PC (non-kernel) has an R2 score of ..96!** This is the number we are trying to beat.

PS. after little try-and-error I decided to give up on pipelining, and went on with manually doing things. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_reduced[:,:2], y)

In [None]:
R_clf = Ridge(alpha=10).fit(X_train, y_train)
R_clf.score(X_test, y_test)

In [None]:
# pretty embarassing, maybe try quadratic regression?
R_clf = Ridge(alpha=100).fit(PolynomialFeatures(degree=3).fit_transform(X_train), y_train)
R_clf.score(PolynomialFeatures(degree=3).fit_transform(X_test), y_test)

In [None]:
# much better but not promising, KNN still better
# how about Decision Tree and a bit more components
X_train, X_test, y_train, y_test = train_test_split(X_reduced[:,:10], y)

In [None]:
dt_clf = DecisionTreeRegressor(max_depth=25).fit(X_train, y_train)
dt_clf.score(X_test, y_test)

In [None]:
#NICE!!! well maybe Linear models (with poly features) are good fit too
R_clf = Ridge(alpha=0.05).fit(PolynomialFeatures(degree=3).fit_transform(X_train), y_train)
R_clf.score(PolynomialFeatures(degree=3).fit_transform(X_test), y_test)

In [None]:
# Not that bad, Ridge would probably have better generalization, however trees have another trick up their sleeve
# RANDOM FORESTS, this is basically a way to overcome overfitting and have better generalization
forest_reg = RandomForestRegressor(n_estimators=100,max_depth=30).fit(X_train, y_train)
forest_reg.score(X_test, y_test)

In [None]:
# Seems like we are pushing it to limits and hitting wall, there is another Ensemble model for trees -> ExtremeTrees!!!
ex_tree = ExtraTreesRegressor(n_estimators=100, max_depth=22).fit(X_train, y_train)
ex_tree.score(X_test, y_test)

**WOW** we have been able to increase to R2 score of .993, better than PCA+KNN. We have been able to squeeze few juice out of  trees in the forest. Extreme trees have slightly better performance because they take randomness one step further. The score we have *.993* is a bit too optimistic. During experimentation we ignored the fact that our KernelPCA chose points used all of X data to construct new components. This probably resulted in data leakage, because it has seen test data already. But we already suspect KernelPCA->RandomTrees is good candidate. So let us build a pipeline, and build model without leakage this time.

In [None]:
# do not forget to seperate validation and test data
X_inter, X_test, y_inter, y_test = train_test_split(X,y,test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_inter,y_inter,test_size=0.2)

We will not create sklearn Pipeline, because that would make our KernelPCA perform full scale fitting which requires a lot (A LOT) of time. Instead I will create function chaining and pre-trained kernels and use them as a pipeline. Wish me good luck

In [None]:
# KIDDING! I will do everythin manually, changing every parameter by hand to get good accuracy
N_BOOTSRAP = 2000
GAMMA = 0.0005
scores = []
np.random.seed(0)
for i in range(5):
    X_train, X_val, y_train, y_val = train_test_split(X_inter,y_inter,test_size=0.2)
    kp = (KernelPCA(n_components=10, kernel='sigmoid', gamma=GAMMA, max_iter=250)
          .fit(X_train[np.random.randint(0,X_train.shape[0],size=N_BOOTSRAP)]))
    X_train_reduce = kp.transform(X_train)
    X_val_reduce = kp.transform(X_val)
    ex_regr = ExtraTreesRegressor(n_estimators=100, max_depth=25).fit(X_train_reduce, y_train)
    r2_score = ex_regr.score(X_val_reduce, y_val)
    scores.append(r2_score)

In [None]:
print("Score on 5 different cross-vals:\n{}\n".format(scores))
print("Average {}".format(np.mean(scores)))

## Time to calculate test score

ONE LAST TIME AND WE ARE DONE!

In [None]:
%%time
kp = (KernelPCA(n_components=10, kernel='sigmoid', gamma=0.0005, max_iter=250)
       .fit(X_inter[np.random.randint(0,X_inter.shape[0],size=2000)]))
X_train_reduce = kp.transform(X_inter)
X_test_reduce = kp.transform(X_test)
ex_regr = ExtraTreesRegressor(n_estimators=100, max_depth=25).fit(X_train_reduce, y_inter)

In [None]:
ex_regr.score(X_test_reduce, y_test)

HAHA!!! I do not know what do you call a model that performs better on test phase than validation phase (opposite of overfitting), but seems like we have created one of those. And it takes only 20 seconds to train. YAY!

**Well we tried a lot today, and results are beautiful low dimensional structure and good regression learner. I think we can take a rest. Peace! **

If you find this work interesting, feel free to fork and tweak parameter and maybe build a proper pipeline and cross-validation procedure for easier reproducebility 