# Data Analysis - Astorb Project
### M. Sébastien MASCHA & M. Pierre SAUVAGE
### M. Lucas MERCIER & M. Martin LABENNE
### ISEP Paris – November 5th, 2019
<br/>
<br/>

# Subject
### I Aim
The aim of this project is to perform a complete analysis of a real case data set. You will be
asked to work in groups of 4 to 5 students and to produce a pdf report containing a detailed
analysis of a data set of your choice.
Several data sets coming from various science fields will be at your disposal and you will have
to pick one before mid-November. Your final report shall be sent before midnight on January
17th

### II Expected work
In order to produce a good report, we strongly recommend you to follows all the steps
mentioned thereafter:
- First, you will have to familiarize yourself with the field from whence your data come.
The idea is that you should be able to understand and explain all the variables of your
set and be capable of grasping which type of data mining task is the most relevant for
this field and these data.
- For most data, you will have to start with a pre-processing step. This step may include:
re-formating your data, dealing with missing values, dealing with aberrant and outlier
values, grouping or deleting features, normalizing your data, etc. In the case of difficult
data set, you may also have to choose to work only on a subsample of your data or a
subsample of the features at your disposal.
- You may want to analyze the different descriptors of your data so that you can extract
interesting information about their distributions, their repartition and eventual
correlations between them.

- Depending on the type of data sets, you will have to either try to find interesting
structures and clusters in the data, or to highlight strong links between the variables,
build models allowing to detect the different classes in your data set, or building
predictive models from your data.
You will be required to provide a detailed account of these different analysis in your reports
including figures, statistical results, and your personal interpretation of all your results. 

### III Dataset
Introduction astorb.dat is an ASCII file of high-precision osculating orbital elements, ephemeris uncertainties, and some additional data for all the numbered asteroids and the vast majority of unnumbered asteroids (multi-apparition and single-apparition) for which it is possible to make reasonably determinate computations. It is currently about 52.4 Mb in size in its compressed form (astorb.dat.gz), 192.4 Mb in size when decompressed (astorb.dat), and contains 717962 orbits computed by me (Edward Bowell). Each orbit, based on astrometric observations downloaded from the Minor Planet Center, occupies one 266-column record. 

<br/>

___
# Part 0 - Import of libraries

This document has been done using python3.7 on Jupyter Notebook with docker and conda.
We will have uses of these librairies:

- Maths for sqrt, pi, exp
- Numpy to manipulate arrays
- Pandas to import csv
- Matplotlib to plot graphics
- Seaborn to make your charts prettier (built on top of Matplotlib)
- Scikit-Learn : tools for data mining and data analysis
- SciPy : a Python-based ecosystem of open-source software for mathematics, science, and engineering. 
- TensorFlow : develop and train ML models

In [12]:
# coding: utf-8

import os, sys

from math import sqrt,pi,exp
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns; sns.set()

import scipy
import sklearn
import tensorflow

print("SciPy version : " + scipy.__version__)
print("Scikit-Learn version : " + sklearn.__version__)
print("TensorFlow version : " + tensorflow.__version__)

# Import data
astorb_data_file = "/app/data/astorb.dat"
astorb_text_file = "/app/data/astorb.name.txt"
astorb_csv_file = "/app/data/astorb.csv"

'''
SciPy version : 1.3.1
Scikit-Learn version : 0.21.3
TensorFlow version : 2.0.0
'''

SciPy version : 1.3.1
Scikit-Learn version : 0.21.3
TensorFlow version : 2.0.0


'\nSciPy version : 1.3.1\nScikit-Learn version : 0.21.3\nTensorFlow version : 2.0.0\n'

<br/>
<br/>

___
# Part 1 - Getting the data
In this exercice, we study one of the most famous dataset : Astorb

astorb.dat is an ASCII file of high-precision osculating orbital elements, ephemeris uncertainties, and some additional data for all numbered asteroids and the majority of unnumbered asteroids (multi- and single-apparition).
### Problematics of astorb database :

(1) 	Asteroid number (blank if unnumbered).

(2) 	Name or preliminary designation.

(3) 	Orbit computer.

(4) 	Absolute magnitude H, mag [see E. Bowell et al., pp. 549-554, in "Asteroids II", R. P. Binzel et al. (eds.), The 
University of Arizona Press, Tucson, 1989 and more recent Minor Planet Circulars]. Note that H may be given to 2 decimal places (e.g., 13.41), 1 decimal place (13.4) or as an integer (13), depending on its estimated accuracy. H is given to two decimal places for all unnumbered asteroids, even though it may be very poorly known.

(5) 	Slope parameter G ( ibid.).

(6) 	Color index B-V, mag (blank if unknown; see E. F. Tedesco, pp. 1090-1138, op. cit. ).

(7) 	IRAS diameter, km (blank if unknown; see E. F. Tedesco et al., pp. 1151-1161, op.cit.).

(8) 	IRAS Taxonomic classification (blank if unknown; ibid.).

(9) 	Six integer codes (see table of explanation below). Note that not all codes have been correctly computed.

(10) 	Orbital arc, days, spanned by observations used in orbit computation.

(11) 	Number of observations used in orbit computation.

(12) 	Epoch of osculation, yyyymmdd (TDT). The epoch is the Julian date ending in 00.5 nearest the date the file was 
created. Thus, as the file is updated, epochs will succeed each other at 100-day intervals on or after Julian dates ending in 50.5 (19980328, 19980706, 19981014, 19990122,...)

(13) 	Mean anomaly, deg.

(14) 	Argument of perihelion, deg (J2000.0).

(15) 	Longitude of ascending node, deg (J2000.0).

(16) 	Inclination, deg (J2000.0).

(17) 	Eccentricity.

(18) 	Semimajor axis, AU.

(19) 	Date of orbit computation, yymmdd (MST, = UTC - 7 hr).

(20) 	Absolute value of the current 1-sigma ephemeris uncertainty (CEU), arcsec.

(21) 	Rate of change of CEU, arcsec/day.

(22) 	Date of CEU, yyyymmdd (0 hr UT).

(23) 	Next peak ephemeris uncertainty (PEU), arcsec, from date of CEU, and date of its occurrence, yyyymmdd.

(24) 	Greatest PEU, arcsec, in 10 years from date of CEU, and date of its occurrence, yyyymmdd.

(25) 	Greatest PEU, arcsec, in 10 years from date of next PEU, and date of its occurrence, yyyymmdd, if two observations (of accuracy equal to that of the observations currently included in the orbit--typically ± 1 arcsec) were to be made on the date of the next PEU [parameter (23)].

In [13]:
def split_line(line, ind, skip_blank=True):
    ''' Splits line at ind, returns left and right side.'''
    left = '{:s},'.format(line[:ind]).replace(' ', '')
    if skip_blank:
        right = line[ind+1:]
    else:
        right = line[ind:]
    return left, right


header = 'Asteroid Number,Name,Orbit Computer,H,G,B-V,IRAS Diameter,IRAS Classification,I1,I2,I3,I4,I5,I6,Orbital Arc,'\
'Number of Observations,Epoch of Osculation,Mean Anomaly,Argument of Perihelion,Longitude of Ascending Node,'\
'Inclination,Eccentricity,Semimajor Axis,Date of Orbit Computation,CEU,dCEU,Date of CEU,PEU,Date of PEU,'\
'Greatest PEU from CEU,Date of Greatest PEU from CEU,Greatest PEU from PEU, Date of Greatest PEU from PEU\n'

# Separation char for CSV
separation_char=','

# Check if astorb.dat is present in current directory. 
try:
    astorb_dat =  open(astorb_data_file, 'r')
except FileNotFoundError:
    sys.exit('\n! Could not find astorb.dat in current directory.\n')

print('\n Starting conversion..\n')

with open(astorb_csv_file, 'w') as astorb_csv:
    # Write header
    astorb_csv.write(header)
    
    for i, line in enumerate(astorb_dat):
        if i % 38000 == 0:
            # Status Bar
            sys.stdout.write('\r')
            # the exact output you're looking for:
            sys.stdout.write("[%-40s] %d%%" % ('='*int(2*i/38000), 5*int(i/38000)))
            sys.stdout.flush()

        # First 6 characters are the Asteroid Number
        number, line  = split_line(line, 6)
        astorb_csv.write(number)
        
        # Next 18 Characters are the Name
        name, line = split_line(line, 18)
        
        # If the name stars with a year, add a space between year and number
        if name[:4].isdigit():
            name = name[:4] + ' '+ name[4:]
        astorb_csv.write(name)
        
        # Orbit Computer, 15 characters
        orbit_computer, line = split_line(line, 15)
        astorb_csv.write(orbit_computer)
        
        # Absolute Magnitude H, 5 characters
        absolute_magnitude, line = split_line(line, 5)
        astorb_csv.write(absolute_magnitude)
        
        # Slope parameter G, 4 characters
        slope_parameter, line = split_line(line, 4)
        astorb_csv.write(slope_parameter)
        
        # Color index B-V, 4 characters
        color_index, line = split_line(line, 4)
        astorb_csv.write(color_index)
        
        # IRAS diameter, 5 characters
        iras_diameter, line = split_line(line, 5)
        astorb_csv.write(iras_diameter)
        
        # IRAS Taxonomic classification, 5 characters
        iras_taxonomy, line = split_line(line, 5)
        astorb_csv.write(iras_taxonomy)
        for j in range(6):
            # Integer code, 4 characters
            integer_code, line = split_line(line, 4, skip_blank=False)
            astorb_csv.write(integer_code)
        else:
            line = line[1:]  # skip closing whitespace of integer codes

        # Orbital Arc, 5 characters
        orbital_arc, line = split_line(line, 5, skip_blank=True)
        astorb_csv.write(orbital_arc)
        
        # Number of Observations, 5 characters
        number_of_observations, line = split_line(line, 4)
        astorb_csv.write(number_of_observations)
        
        # Epoch of Osculation, 8 characters
        epoch_of_osculation, line = split_line(line, 8)
        astorb_csv.write(epoch_of_osculation)
        
        # Mean Anomaly, 10 characters
        mean_anomaly, line = split_line(line, 10)
        astorb_csv.write(mean_anomaly)
        
        # Argument of Perihelion, 10 characters
        argument_of_perihelion, line = split_line(line, 10)
        astorb_csv.write(argument_of_perihelion)
        
        # Longitude of Ascending Node, 10 characters
        longitude_of_ascending_node, line = split_line(line, 10)
        astorb_csv.write(longitude_of_ascending_node)
        
        # Inclination, 9 characters
        inclination, line = split_line(line, 9)
        astorb_csv.write(inclination)
        
        # Eccentricity, 9 characters
        eccentricity, line = split_line(line, 10)
        astorb_csv.write(eccentricity)
        
        # Semimajor Axis, 10 characters
        semimajor_axis, line = split_line(line, 12)
        astorb_csv.write(semimajor_axis)
        
        # Dat of Orbit Computation, 8 characters
        date_of_orbit_computation, line = split_line(line, 8)
        astorb_csv.write(date_of_orbit_computation)
        
        # CEU, 7 characters
        ceu, line = split_line(line, 7)
        astorb_csv.write(ceu)
        
        # Rate of change of CEU, 8 characters
        dceu, line = split_line(line, 8)
        astorb_csv.write(dceu)
        
        # Date of CEU, 8 characters
        date_of_ceu, line = split_line(line, 8)
        astorb_csv.write(date_of_ceu)
        
        # PEU, 7 characters
        peu, line = split_line(line, 7)
        astorb_csv.write(peu)
        
        # Date of PEU, 8 characters
        date_of_peu, line = split_line(line, 8)
        astorb_csv.write(date_of_peu)
        
        # PEU, 7 characters
        peu_from_ceu, line = split_line(line, 7)
        astorb_csv.write(peu_from_ceu)
        
        # Date of PEU, 8 characters
        date_of_peu_from_ceu, line = split_line(line, 8)
        astorb_csv.write(date_of_peu_from_ceu)
        
        # PEU, 7 characters
        peu_from_peu, line = split_line(line, 7)
        astorb_csv.write(peu_from_peu)
        
        # Date of PEU, 8 characters
        date_of_peu_from_peu, line = split_line(line, 8)
        astorb_csv.write(date_of_peu_from_peu[:-1])  # don't need the last comma
        
        # Finish with line break
        astorb_csv.write('\n')
print('\nDone!\n')


 Starting conversion..

Done!



In [14]:
pd.set_option('display.max_columns', 50)

df = pd.read_csv(astorb_csv_file, sep =separation_char)
print(df.shape)

df.head()

(717962, 33)


Unnamed: 0,Asteroid Number,Name,Orbit Computer,H,G,B-V,IRAS Diameter,IRAS Classification,I1,I2,I3,I4,I5,I6,Orbital Arc,Number of Observations,Epoch of Osculation,Mean Anomaly,Argument of Perihelion,Longitude of Ascending Node,Inclination,Eccentricity,Semimajor Axis,Date of Orbit Computation,CEU,dCEU,Date of CEU,PEU,Date of PEU,Greatest PEU from CEU,Date of Greatest PEU from CEU,Greatest PEU from PEU,Date of Greatest PEU from PEU
0,1.0,Ceres,L.H.Wasserman,3.34,0.1,0.7,848.0,G?,0,0,0,0,0,0,78700,6474,20161108,245.45922,72.848781,80.311635,10.59198,0.075681,2.768083,20160829,0.017,8.4e-05,20160924,0.019,20161026,0.025,20180208,0.025,20180208
1,2.0,Pallas,L.H.Wasserman,4.13,0.1,0.6,498.0,m,0,0,0,0,0,10,71301,7676,20161108,227.63152,309.995488,173.088344,34.840532,0.230751,2.772956,20160829,0.012,-4.9e-05,20160924,0.021,20171019,0.027,20221216,0.027,20221216
2,3.0,Juno,L.H.Wasserman,5.33,0.3,0.8,233.0,S,0,0,0,0,0,0,71251,6712,20161108,191.363113,248.22482,169.862053,12.99001,0.256721,2.668575,20160829,0.0091,-2e-05,20160924,0.017,20170708,0.036,20181128,0.036,20181128
3,4.0,Vesta,L.H.Wasserman,3.2,0.3,0.8,468.0,r,0,0,0,0,0,0,71233,6683,20161108,211.045679,151.108214,103.84225,7.140548,0.089116,2.361251,20160425,0.011,5.5e-05,20160924,0.021,20170111,0.03,20180608,0.03,20180608
4,5.0,Astraea,L.H.Wasserman,6.85,0.1,0.8,119.0,S,0,0,0,0,0,9,62332,2304,20161108,67.358757,358.819766,141.588001,5.367854,0.191515,2.573781,20160829,0.015,-1.8e-05,20160924,0.03,20170702,0.045,20240103,0.045,20240103


<br/>
<br/>

___
# Part 2 - Cleaning the data
Cleaning the data involves removing the duplicate rows, removing the outliers, finding the missing or null values, converting the object values into null values, and plotting them using graphs, these are some steps that are necessarily performed during cleaning the data.
### Informations :

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717962 entries, 0 to 717961
Data columns (total 33 columns):
Asteroid Number                   474120 non-null float64
Name                              717962 non-null object
Orbit Computer                    717962 non-null object
H                                 717962 non-null float64
G                                 717962 non-null float64
B-V                               944 non-null float64
IRAS Diameter                     2140 non-null float64
IRAS Classification               357 non-null object
I1                                717962 non-null int64
I2                                717962 non-null int64
I3                                717962 non-null int64
I4                                717962 non-null int64
I5                                717962 non-null int64
I6                                717962 non-null int64
Orbital Arc                       717962 non-null int64
Number of Observations            717962 non

### Removing unsued or irrevelent column :
Some columns will not be useful for classification :
- Asteropid number ;

Some columns don't have enough data :
- B-V : 944 values ;
- IRAS Diameter : 2140 values ;
- IRAS Classification : 357 values.

We can remove them.

In [16]:
colomns_to_drop = ['Asteroid Number',
                  'B-V',
                  'IRAS Diameter',
                  'IRAS Classification']
try :
    df.drop(colomns_to_drop, inplace=True, axis=1)
except:
    print("You've already deleated these columns.")

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717962 entries, 0 to 717961
Data columns (total 29 columns):
Name                              717962 non-null object
Orbit Computer                    717962 non-null object
H                                 717962 non-null float64
G                                 717962 non-null float64
I1                                717962 non-null int64
I2                                717962 non-null int64
I3                                717962 non-null int64
I4                                717962 non-null int64
I5                                717962 non-null int64
I6                                717962 non-null int64
Orbital Arc                       717962 non-null int64
Number of Observations            717962 non-null int64
Epoch of Osculation               717962 non-null int64
Mean Anomaly                      717962 non-null float64
Argument of Perihelion            717962 non-null float64
Longitude of Ascending Node       71796

In [19]:
#On regarde les NA values que Pandas détecte directement 
print (df.isnull().sum())

Name                              0
Orbit Computer                    0
H                                 0
G                                 0
I1                                0
I2                                0
I3                                0
I4                                0
I5                                0
I6                                0
Orbital Arc                       0
Number of Observations            0
Epoch of Osculation               0
Mean Anomaly                      0
Argument of Perihelion            0
Longitude of Ascending Node       0
Inclination                       0
Eccentricity                      0
Semimajor Axis                    0
Date of Orbit Computation         0
CEU                               0
dCEU                              0
Date of CEU                       0
PEU                               0
Date of PEU                       0
Greatest PEU from CEU             0
Date of Greatest PEU from CEU     0
Greatest PEU from PEU       

In [44]:
df['Date of CEU'].value_counts()

20160924    717764
0              198
Name: Date of CEU, dtype: int64