This is a very simple notebook, creating some simple machine learning models which will predict some nuclear properties!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math

# Import lots of tools from sklearn

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Function which turns the spin, such as "1/2" or "2", into a double.

def convert_to_float(frac_str):
    try:
        return float(frac_str)
    except ValueError:
        num, denom = frac_str.split('/')
        try:
            leading, num = num.split(' ')
            whole = float(leading)
        except ValueError:
            whole = 0
        frac = float(num) / float(denom)
        return whole - frac if whole < 0 else whole + frac

In [None]:
# obtain the closest appropriate spin (note: if a nucleus has even P+N it should have integer spin, while if P+N is odd it should have half-int spin).

def get_consistent_spin(row):
    if( (row['P'] + row['N'] ) % 2 ==0 ):
        return round(row['spin'])
    else:
        spin_floor = math.floor(2.0*row['spin'])
        spin_ceil = math.ceil(2.0*row['spin'])
        if(spin_floor % 2 ==0):
            return spin_ceil*0.5
        else:
            return spin_floor*0.5

In [None]:
# Import the data
nuclear_data = pd.read_csv("../input/nuclear-data/all_nuclei.csv")

# Find the states with unknown spin. We'll try and predict these spins.
unknown_spin_parity = nuclear_data[nuclear_data['spin_string'].isnull() & (nuclear_data['energy_double'].isnull()==0) & (nuclear_data['mass'].isnull()==0) ]

# And here we keep the states with known spin, which we'll use to train.
clean_nuclear_data = nuclear_data.dropna()

# Add a spin column, which double valued spins
clean_nuclear_data["spin"] = nuclear_data.spin_string.map(lambda p: convert_to_float(p))

# And turn parity into categorical data
label_encoder = LabelEncoder()
clean_nuclear_data['parity'] = label_encoder.fit_transform(clean_nuclear_data['parity'])

Ok, that's enough set up. Now to explain the data. We're mostly interested in the question: for a given proton number (P), neutron number (N) and mass (energy + mass), what is the spin and parity of that state?

Let's take a look at the data...

In [None]:
plt.title("Distribution of spins")

sns.kdeplot(data=clean_nuclear_data['spin'], shade=True)

In [None]:
plt.title("Spin vs Energy")

sns.scatterplot(x=clean_nuclear_data['spin'], y=clean_nuclear_data['energy_double'])

In the spin vs energy plot, we see that we only get very high spins for particular energies. There are "arms" or "branches" of spin chains. I'm not sure what these are, or if some nuclear models describe this physical phenomena.

In [None]:
# Percentage of positive parity states: there are more known positive parity states than negative parity states.

len(clean_nuclear_data[(clean_nuclear_data.parity==0)].parity)/len(clean_nuclear_data.parity)

In [None]:
# Set up the target and input variables

y = clean_nuclear_data[['spin','parity']]
X = clean_nuclear_data.drop(['spin_string','parity','spin'], axis=1)

Xp = unknown_spin_parity.drop(['spin_string','parity'], axis=1)

**Model 1: A Decision Tree Regressor**

A simple model, using a decision tree.

In [None]:
# Split data into training and validation 4:1.

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [None]:
# Define a function to get the mean absolute error for a Decision Tree Regressor with the number of nodes equal to max_lead_nodes

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [None]:
model = DecisionTreeRegressor(max_leaf_nodes=5000, random_state=0)
model.fit(X_train, y_train)
preds_val = model.predict(X_valid)
mae = mean_absolute_error(y_valid, preds_val)

In [None]:
for max_leaf_nodes in [10,100,1000,10000,20000]:
    my_mae = get_mae(max_leaf_nodes,X_train,X_valid, y_train, y_valid)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %.4f" %(max_leaf_nodes, my_mae))

In [None]:
for max_leaf_nodes in [1000,3000,5000,7000]:
    my_mae = get_mae(max_leaf_nodes,X_train,X_valid, y_train, y_valid)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %.4f" %(max_leaf_nodes, my_mae))

In [None]:
# Looks like max_leaf_modes's optimal value is ~5000. So we set our model with that.

one_forest_model = DecisionTreeRegressor(max_leaf_nodes=5000,random_state=0)
one_forest_model.fit(X, y)
preds_val = one_forest_model.predict(Xp)

Xp['spin'] = preds_val[:,0]
Xp['parity'] = preds_val[:,1]

In [None]:
# Re-intgerise the spin and parity from the continuous variables

Xp['int_spin'] = Xp.apply(get_consistent_spin,axis=1)
Xp["int_par"] = Xp.parity.map(lambda p: round(p))

In [None]:
# Let's look at the predictions

Xp.head()

**Model 2: a random forest regressor**

In [None]:
# Try a random forest 

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(X_train, y_train)
spin_preds = forest_model.predict(X_valid)
print(mean_absolute_error(y_valid, spin_preds))

Even without tuning, this has a smaller MAE than our best Decision Tree Regressor (as expected!)

**Model 3: xgboost**

Finally, let's try an gradient boosted forest using xgboost. This only allows for a single output, so we focus on spin.

In [None]:
# We start by considering a forest model with just spin as an output.

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(X_train, y_train.spin)
spin_preds = forest_model.predict(X_valid)
print(mean_absolute_error(y_valid.spin, spin_preds))

In [None]:
from xgboost import XGBRegressor

xg_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)

xg_model.fit(X_train, y_train.spin, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid.spin)], 
             verbose=False)

predictions = xg_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid.spin)))

I find that the boosted model does a bit better than the random forest.