# Classifying planetary bodies by radius and density

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
planetary_data = pd.read_csv('../input/solar-system-radius-and-density-data/planetary_data.csv', header=0)
planetary_data = planetary_data.set_index('Body', drop=True)

radius = planetary_data['RadiusSI']
density = planetary_data['DensitySI']

plt.figure(figsize=(12, 9))
plt.scatter(density, radius)
plt.title('Variation in planetary radii with density')
plt.xlabel('Density $kg\,m^{-3}$')
plt.ylabel('Radius $km$')
plt.yscale('log')
plt.grid(axis='both', which='both')
plt.show()
plt.close()

## Using scikit-learn and K-means clustering to classify planetary bodies

With the planetary data plotted by density and radius as above, we can attempt to group each body into a number of classification groups. We will arrange the data into the following categories:

* Giant planets: this should include Jupiter, Saturn, Uranus and Neptune.
* Terrestrial planets: as well as the four inner planets, we would expect rocky moons like Io and Europa to fit this category.
* Icy moons, for example Ganymede and Titan ("yes, sir, I've been around...").
* Asteroids and comets. This should also include smaller, irregular-shaped moons.

Scikit-learn is a widely used machine learning library for Python and provides a range of tools for the problem described above. We will not make any leading assumptions about the bodies and will therefore not assign any initial categories for them. Therefore, this becomes an unsupervised classification problem. The K-means clustering algorithm lends itself particularly well to problems of this description.

To begin with, we will import the `sklearn.cluster` and `numpy` packages.

In [None]:
import sklearn.cluster as cluster
import numpy as np

We have designated four-categories above, so will create a dict to map numeric categories to colours. This will allow us to distinguish the category of each body in a scatter plot.

In [None]:
colour_dict = {0: 'r', 1: 'g', 2: 'b', 3: 'y'}

The next stage is to actually training the clustering algorithm. We will not be using this to predict the categories of new samples, so training against the complete dataset is satisfactory here. In this case, the clustering algorithm works best against logarithmic data, so we will make this transformation. This will then need to be transposed before calling the `KMeans` function itself.

In [None]:
# Create the learning algorithm, fitted to the planetary data
log_data = np.array([np.log10(x) for x in [density, radius]]).transpose()
kmeans = cluster.KMeans(n_clusters=len(colour_dict), random_state=0).fit(log_data)

# Determine the clusters using this algorithm and fit this to the planetary data
planetary_data['Cluster'] = kmeans.predict(log_data)

We output the cluster numbers for each body and then provide labels for each. We will rearrange the clusters so that each body in the same cluster is assigned a category label corresponding to an expected example.

In [None]:
category_labels = {0: 'Terrestrial planet', 1: 'Asteroid/comet', 2: 'Icy moon', 3: 'Giant planet'}
num_category_labels = len(category_labels)

mercury_cluster = planetary_data.loc['Mercury', 'Cluster']
phobos_cluster = planetary_data.loc['Phobos', 'Cluster']
mimas_cluster = planetary_data.loc['Mimas', 'Cluster']
jupiter_cluster = planetary_data.loc['Jupiter', 'Cluster']

# Assign temporary values to the cluster of each body. This prevents a body from another category being assigned the same value.
planetary_data.loc[planetary_data['Cluster'] == mercury_cluster, 'Cluster'] = mercury_cluster + num_category_labels
planetary_data.loc[planetary_data['Cluster'] == phobos_cluster, 'Cluster'] = phobos_cluster + num_category_labels
planetary_data.loc[planetary_data['Cluster'] == mimas_cluster, 'Cluster'] = mimas_cluster + num_category_labels
planetary_data.loc[planetary_data['Cluster'] == jupiter_cluster, 'Cluster'] = jupiter_cluster + num_category_labels

# Reassign the cluster values.
planetary_data['Cluster'] = planetary_data['Cluster'] - num_category_labels

Once the labels have been created, we can update the table to display these.

In [None]:
def set_categories(data, category_labels):
    data['Category'] = [category_labels[x] for x in data['Cluster']]

set_categories(planetary_data, category_labels)
planetary_data

We can finally plot this data. To separate different clusters by colour and label, we iterate over `colour_dict`. Finally, we will output this plot to a PDF file for future reference.

In [None]:
from matplotlib.backends.backend_pdf import PdfPages

def plot_planetary_data(planetary_data, colour_dict, category_labels, density_column, radius_column, title, output_pdf=None):
    if output_pdf is not None:
        pp = PdfPages(output_pdf)
        plt.figure(figsize=(12, 9))
    else:
        pp = None
    
    plt.title(title)
    plt.xlabel('Density ($kg\,m^{-3}$)')
    plt.ylabel('Radius ($m$)')
    plt.grid(axis='both', which='both')
    plt.yscale('log')
    
    for i in range(0, len(colour_dict)):
        subdata = planetary_data.loc[planetary_data['Cluster'] == i]
        plt.scatter(subdata[density_column], subdata[radius_column], color=colour_dict[i], label=category_labels[i])
    
    plt.legend(loc='best')

    if pp is not None:
        pp.savefig()
    plt.show()
    plt.close()
    
    if pp is not None:
        pp.close()

In [None]:
plot_planetary_data(
    planetary_data,
    colour_dict,
    category_labels,
    'DensitySI',
    'RadiusSI',
    'K-means categorisation of planetary data'
)

## Manual classification of planetary data

The data displayed as above, may give us some hints as to how the bodies should be classified. For example, Ganymede, Calliso and Pluto are currently classed as terrestrial planets, whereas we may feel that these should be regarded as icy moons.

We will reclassify the bodies above according to the following rules:

* Bodies with a radius greater than $10^4\,\mathrm{km}$ will be regarded as giant planets.
* Terrestrial planets will refer to any other body with a radius greater than $10^3\,\mathrm{km}$ and a density greater than $2.5 \times 10^3\,\mathrm{kg\,m^{-3}}$.
* Asteroids and comets mean any body whose radius is less than $300\,\mathrm{km}$. This classes Vesta as an asteroid.
* Everything else will be classed as an icy moon.

The current category labels will be reset. The cluster and categories can then be assigned using the rules above.

In [None]:
category_labels2 = {0: 'Terrestrial planet', 1: 'Asteroid/comet', 2: 'Icy moon', 3: 'Giant planet'}

planetary_data2 = planetary_data.copy()
planetary_data2['Cluster'] = None

# Classify each body using manual rules
planetary_data2.loc[planetary_data2['RadiusSI'] > 1E4, 'Cluster'] = 3
planetary_data2.loc[
    (planetary_data2['RadiusSI'] > 1E3) &
    (planetary_data2['DensitySI'] > 2.5E3) &
    (planetary_data2['Cluster'].isnull()), 'Cluster'] = 0
planetary_data2.loc[planetary_data2['RadiusSI'] < 300, 'Cluster'] = 1
planetary_data2.loc[planetary_data2['Cluster'].isnull(), 'Cluster'] = 2

set_categories(planetary_data2, category_labels2)
planetary_data2

In [None]:
plot_planetary_data(
    planetary_data2,
    colour_dict,
    category_labels2,
    'DensitySI',
    'RadiusSI',
    'Manual categorisation of planetary data'
)

## Supervised learning using K-nearest neighbour classification

With the data manually assigned to categories, we can train another learning algorithm using a supervised learning technique. The K-nearest neighbour algorithm is the most appropriate for this purpose. As with the K-means method used above, we will train using logarithmic data. Fitting this model against the input data will likely adjust the results slightly.

In [None]:
import sklearn.svm as svm

# Create the learning algorithm, fitted to the planetary data
log_data = np.array([np.log10(x) for x in [density, radius]]).transpose()
knn = svm.SVC(gamma='scale').fit(log_data, planetary_data2['Cluster'])

planetary_data_svc = planetary_data2.copy()
planetary_data_svc['Cluster'] = knn.predict(log_data)

set_categories(planetary_data_svc, category_labels2)
planetary_data_svc

In [None]:
plot_planetary_data(
    planetary_data_svc,
    colour_dict,
    category_labels,
    'DensitySI',
    'RadiusSI',
    'K-NN categorisation of planetary data'
)

In [None]:
import astropy.units as u
import astropy.units.astrophys as ua

We can use the learning algorithm to classify certain exoplanets. We will use the [Open Exoplanet Catalogue](http://openexoplanetcatalogue.com/) as our source data. This can also be retrieved as a CSV file from [Kaggle](https://www.kaggle.com/mrisdal/open-exoplanet-catalogue/download).

In [None]:
# Read the data
oec = pd.read_csv('../input/open-exoplanet-catalogue/oec.csv', usecols=['PlanetIdentifier', 'PlanetaryMassJpt', 'RadiusJpt'])
oec = oec.set_index('PlanetIdentifier', drop=True)

# Clean the data by removing records that do not specify both mass and radius
oec = oec.loc[(oec['PlanetaryMassJpt'].notnull()) & (oec['RadiusJpt'].notnull())].reindex()

# Describe the radius, mass and volume of each body in SI units
oec['RadiusSI'] = [(x*ua.jupiterRad).to(u.m).value for x in oec['RadiusJpt']]
oec['RadiusKm'] = oec['RadiusSI'] * 1E-3
oec['MassSI'] = [(x*ua.jupiterMass).to(u.kg).value for x in oec['PlanetaryMassJpt']]
oec['VolumeSI'] = 4/3 * np.pi * oec['RadiusSI'] ** 3

# Calculate the density of each body
oec['DensitySI'] = oec['MassSI'] / oec['VolumeSI']

It is probably not sensible to classify every recorded exoplanet as some of these are extreme outliers. For example, the density of K2-77b is calculated as $2.74 \times 10^5\,\mathrm{kg\,m^{-3}}$, which is nearly 50 times that of the densest body in our sample data (Earth). We will therefore exclude any exoplanets that are more than twice Earth's density. Similarly we remove bodies, for example Kepler-51c, whose densities are less than half of Saturn's.

In [None]:
earth_density = planetary_data.loc['Earth', 'DensitySI']
saturn_density = planetary_data.loc['Saturn', 'DensitySI']

oec = oec.loc[
    (oec['DensitySI'] <= 2 * earth_density) &
    (oec['DensitySI'] >= saturn_density / 2)
]

Finally, we can predict and plot the data.

In [None]:
oec_log_data = np.array([np.log10(oec['DensitySI']), np.log10(oec['RadiusKm'])]).transpose()
oec['Cluster'] = knn.predict(oec_log_data)

set_categories(oec, category_labels2)
oec.head()

In [None]:
plot_planetary_data(
    oec,
    colour_dict,
    category_labels2,
    'DensitySI',
    'RadiusKm',
    'K-NN categorisation of exoplanet data'
)