Importing relevant packages. Note that I have used Sci-kit for this dataset

In [None]:
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler

Loading the data

In [None]:
data=pd.read_csv('/kaggle/input/energy-molecule/roboBohr.csv')

Printing the first sew lines of the data to see what we are dealing with

In [None]:
data.head()

Removing two columns: 'Unnamed:0' and 'pubchem_id' as they serve no purpose in prediction

In [None]:
data.pop('pubchem_id')
data.pop('Unnamed: 0')

Printing the first sew lines of the data to make sure the columns have been removed

In [None]:
data.head()

Printing the information about the dataset. This will tell us whether there are any missing or nan values(Which there aren't in this case)

In [None]:
data.info()

Seeing the size of the dataset

In [None]:
data.values.shape

Initializing the scaler to scale values using (MinMaxScaler used here), it will reduce the computation load on our algorithm. The range has been set from -1 to 1 instead of 0 to 1 because sign is extremely important when studying atomization energy in chemistry and this preserves that information.

In [None]:
scaler=MinMaxScaler(feature_range=(-1,1))

Using the scaler to transform the columns one at a time as different columns have different range of values and have to be scaled in a different manner. Therefore we cannot directly apply one scaler to the entire dataset all at once

In [None]:
for column in data.columns:
    scaler.fit(data[column].values.reshape(-1,1))
    data[column]=scaler.transform(data[column].values.reshape(-1,1))

Printing the dataset to make sure the transformations have been applied

In [None]:
data.head()

Randomizing our dataset(Fraction=1 meaning the entire dataset will be shuffled)

In [None]:
data=data.sample(frac=1)

Printing our randomized dataset

In [None]:
data.head()

Making a separate output variable

In [None]:
y=data.pop('Eat')

Dividing our data into train and test batch

In [None]:
X_train,X_test,y_train,y_test=train_test_split(data,y,test_size=0.25)

Making the Random Forest model and training it using our training data

In [None]:
model=RandomForestRegressor(n_estimators=25)
model.fit(X_train,y_train)

Seeing the accuracy of the model on our testing data

In [None]:
acc=model.score(X_test,y_test)
print(f'Accuracy on test set:{acc*100}')