At may 2021, there are over 4 thousands confirmed exoplanets, and few thousands more candidates more. In this notebook, I want to find patterns in distributions of planets. And try to explain gaps in distributions - planets with which characteristics we didn't found yet and why?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('../input/full-exoplanet-catalog/exoplanet_confirm_and_candidates.csv')
df

We have 98 columns - not all of them we need (at least initially). I want rename some, remove columns with measurement errors, and add column with logarithms of some parameters - it'll be more convenient for diagrams.

### Data preparation

In [None]:
df.columns

In [None]:
df = df.rename(columns={"# name": "planet_name", "mass":"mass_jup", "radius":"radius_jup"})

In [None]:
df['mass_earth'] = df['mass_jup'] * 317.8
df['radius_earth'] = df['radius_jup'] * 11.5

df['log_mass_earth'] = np.log10(df['mass_earth'])
df['log_radius_earth'] = np.log10(df['radius_earth'])
df['log_orbital_period'] = np.log10(df['orbital_period'])
df['log_star_teff'] = np.log10(df['star_teff'])
df.head()

In [None]:
#display(df[df['mass_earth'] < 5])

In [None]:
columns = list(df.columns)
new_columns = []
for i in columns:
    if 'error' not in i:
        new_columns.append(i)
        
new_columns

Creating new dataframe 

In [None]:
df_2 = df[new_columns]

In [None]:
df_2 = df_2.drop(['inclination',
 'angular_distance',
 'discovered',
 'updated',
 'omega',
 'tperi',
 'tconj',
 'tzero_tr',
 'tzero_tr_sec',
 'lambda_angle',
 'impact_parameter',
 'tzero_vr','hot_point_lon','log_g',
 'publication', 'ra',
 'dec',
 'mag_v',
 'mag_i',
 'mag_j',
 'mag_h',
 'mag_k', 'star_detected_disc',
 'star_magnetic_field',
 'star_alternate_names'], axis=1)

df_2

Further, dataset may be splitted by two tables - with confirmed and candidate planets.

In [None]:
df.planet_status.unique()

In [None]:
## Split dataframe into confirmed and candidates 

confirmed = df_2.query('planet_status == "Confirmed"')
candidates = df_2.query('planet_status == "Candidate"')

## Visualizations (for confirmed exo)

#### Orbital period - planet mass

In [None]:
fig = px.scatter(confirmed, x="log_orbital_period", y="log_mass_earth", 
                 hover_data=['planet_name'], color='star_teff')
fig.show()

Almost all star temperatures are less then 10.000 K. Drop the hottest:

In [None]:
confirmed_temp = confirmed.query('star_teff <= 10000')
fig = px.scatter(confirmed_temp, x="log_orbital_period", y="log_mass_earth", 
                 hover_data=['planet_name'], color='star_teff')
fig.show()

Most of objects are concentrated in field between 0.5 Earth - 30 Jupiter mass and orbital period between 0.5 and 1000 days.

On average, the less the mass of the planet - the lower the temperature of the host star.

All planets, on probation, may be splitted on two groups by mass - more than 50 Earth mass and less.  

#### Orbital period - planet radius

In [None]:

fig = px.scatter(confirmed_temp, x="log_orbital_period", y="log_radius_earth", 
                 hover_data=['planet_name'], color='star_teff')
fig.show()

Drop objects with radius more than 30 Earth:

In [None]:
confirmed_rad = confirmed_temp.query('log_radius_earth > -1 & log_radius_earth < 1.5')
fig = px.scatter(confirmed_rad, x="log_orbital_period", y="log_radius_earth", 
                 hover_data=['planet_name'], color='star_teff')
fig.show()

Most of objects are concentrated in field between 0.6 - 23 Earth radius and orbital period between 0.5 and 1000 days.

On average, the less the radius of the planet - the lower the temperature of the host star.

All planets may be splitted on two groups by radius - more than 5 Earth mass and less.  

To be continued..