A quick visual exploration of some of the correlations in this interesting dataset.
------------------------------------------------------------------------

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 
from matplotlib_venn import venn3 

#ignore some anoying warnings from the seaborn plot
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Import the data
planets_all_obs=pd.read_csv('../input/oec.csv')
planets_all_obs.shape
print("Number of observations:",planets_all_obs.shape[0] )
print("Number of features:",planets_all_obs.shape[1] )

In [None]:
planets_all_obs.describe()

Lots of interesting data here, but some of the features has quite few observations... It will be important to be aware of the missing data when examining this dataset. 

Lets select some of the columns and examine the correlations between the features.

In [None]:
selectedColumns=['PlanetaryMassJpt', 'RadiusJpt','PeriodDays', 'SemiMajorAxisAU', 'Eccentricity',
                 'SurfaceTempK','HostStarMassSlrMass', 'HostStarRadiusSlrRad',
                 'HostStarMetallicity','HostStarTempK',]

In [None]:
corrmat = planets_all_obs[selectedColumns].corr()
sns.heatmap(corrmat, vmax=.8, square=True)

Some interesting (but perhaps not unexpected) correlations:

 - Cluster1: PeriodDays , SemiMajorAxisAU, HostStarMassSlrMass
 - Cluster2: SurfaceTempK, HostStarTempK, RadiusJpt,Eccentricity
 - HostStarMassSlrMass vs. HostStarRadiusSlrRad
 - Planetary mass vs. Planetary Radius

one unexpected correlation

 - Eccentricity vs. SurfaceTemp K




**Venn diagram of the available data in correlation cluster 1**

In [None]:
# drop NA values for each column, and create a sett of the planet identifiers for
# each of the remaining values in each attribute.
periodDays_db=planets_all_obs[['PlanetIdentifier','PeriodDays']].dropna()
periodDays_set=set(periodDays_db['PlanetIdentifier'].values.flatten())

axis_db=planets_all_obs[['PlanetIdentifier','SemiMajorAxisAU']].dropna()
axis_set=set(axis_db['PlanetIdentifier'].values.flatten())

starmass_db=planets_all_obs[['PlanetIdentifier','HostStarMassSlrMass']].dropna()
starmass_set=set(starmass_db['PlanetIdentifier'].values.flatten())

venn3([periodDays_set, axis_set, starmass_set], ('PeriodDays', 'SemiMajorAxisAU', 'HostStarMassSlrMass'))
plt.show()

Not surprisingly, it´s the SemiMajorAxisAU feature that really reduces the number of observations with all 3 features available. From the figure we can see that SemiMajorAxisAU has data when the other two does not in only 58+3=61 of the observations.

In [None]:
# take a closer look at the correlation of the features in cluster1
selected_cluster1=['PeriodDays', 'SemiMajorAxisAU','HostStarMassSlrMass']
cluster1_noNA=planets_all_obs[selected_cluster1].dropna() #remove the NA values

print("Number of observations: ",cluster1_noNA.shape[0])

In [None]:
sns.pairplot(cluster1_noNA)

In [None]:
# take a closer look at the correlation of the features in cluster2
selected_cluster2=['SurfaceTempK', 'HostStarTempK', 'Eccentricity','RadiusJpt']
cluster2_noNA=planets_all_obs[selected_cluster2].dropna() #remove the NA values

print("Number of observations: ",cluster2_noNA.shape[0])

In [None]:
sns.pairplot(cluster2_noNA)

In [None]:
#Lets examine the relationship between the orbital period and the semi major axis of the orbit.
selectedColumns2=['SemiMajorAxisAU','PeriodDays'] #make a list of the columns we want to examine
planets_noNA=planets_all_obs[selectedColumns2].dropna() #remove the NA values

#add a new column with years instead of days, just to make it easier to compare with the earth
#The discance is meassured in AU = distance from Earth to the sun
planets_noNA['PeriodYears']=planets_noNA['PeriodDays']/365.25
print("Number of observations: ",planets_noNA.shape[0])

In [None]:
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot("SemiMajorAxisAU", "PeriodYears", data=planets_noNA, kind="reg",xlim=(0, 200), ylim=(0, 1000), color="b", size=7)

As expected, the period of the planet is highly correlated with the distance from its star

In [None]:
#Lets examine the relationship between the surface temperature and the Eccentricity
selectedColumns2=['SurfaceTempK','Eccentricity'] #make a list of the columns we want to examine
planets_noNA2=planets_all_obs[selectedColumns2].dropna() #remove the NA values
print('Number of observations: ', planets_noNA2.shape[0])

In [None]:
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot("Eccentricity", "SurfaceTempK", data=planets_noNA2, kind="reg",xlim=(0, 1.2), ylim=(0, 4000), color="b", size=7)

Only 578 of the observations had both Eccentricity and Surface temperature. From this plot, it´s hard to say that there is a useful relationship between the two variables.

In [None]:
selectedColumns4=['SemiMajorAxisAU', 'SurfaceTempK','HostStarTempK']
selectedColumns5=['PeriodDays', 'SurfaceTempK','HostStarTempK']
planets_noNA4=planets_all_obs[selectedColumns4].dropna() #remove the NA values
planets_noNA5=planets_all_obs[selectedColumns5].dropna() #remove the NA values
print('Number of observations using SemiMajorAxisAU: ', planets_noNA4.shape[0])
print('Number of observations using PeriodDays     : ', planets_noNA5.shape[0])

Interesting, even though there are 3341 obs with PeriodDays and only 1271 obs with SemiMajorAxisAU, when removing the rows that dont have SurfaceTempK and HostStarTempK, we seem to have about the same amount of observations.

In [None]:
#Lets examine the relationship between the surface temperature and the orbital distance from the planet to the star
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot("SemiMajorAxisAU", "SurfaceTempK", data=planets_noNA4, kind="scatter",xlim=(0, 7), ylim=(0, 3000), color="b", size=7)

As expected, the surface temperature drops off quickly as the distance increases.

In [None]:
#Lets examine the relationship between the surface temperature and the temperature of the star
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot("HostStarTempK", "SurfaceTempK", data=planets_noNA4, kind="scatter",xlim=(0, 12000), ylim=(0, 3000), color="b", size=7)

Not suprisingly, the hotter the star, the hotter the surface temperature of the planet.

Now one last plot... Let´s take a closer look at the relationship between the Host star´s temperature and mass

In [None]:
g = sns.jointplot(x="HostStarMassSlrMass", y="HostStarTempK", xlim=(0, 5), ylim=(0, 20000),data=planets_all_obs)

An interesting relationship between the star mass and the star temprature... it almot seems like there are two clusters here... well that´s a challenge for another day... I hope some one finds this notebook usefull..