# My Asteroids Classification project☄️ 

### Data Acquisition

Hello and thank you for visiting my asteroids project. I'm trying to find out whether the orbits of asteroids can be classified.


Let's begin by crawling NASA's database for some info.   

<img src="images/Head.jpg" style="float:right;width:200px;height:100px;border-radius: 50%"/>



In [2]:
import pandas as pd
import json
import numpy as np
import requests
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")

So, as we can see from the "JPL Small-Body Database Search Engine," we want to use those parameters while ignoring others.
Finally, we'd like to see if we can classify the asteroids using some of the parameters mentioned below.
The parameters we're looking for are: 
***All parameters are also depicted in the diagram below.***

**SPK-ID**-Object primary SPK-ID

**a** -the major axis of an ellipse is its longest diameter a line segment that runs through the center and ends at the widest points of the perimeter.

The semi-major axis is the longest semidiameter or one half of the major axis (AU units).

**e** - Orbital eccentricity - The orbital eccentricity of a asteroid ,  a parameter that determines the amount by which its orbit around another body deviates from a perfect circle. A value of 0 is a circular orbit, values between 0 and 1 form an elliptic orbit, 1 is a parabolic escape orbit, and greater than 1 is a hyperbola

**q** -perihelion distance –the point in the orbit of asteroid that is nearest to the sun.(au)

**Peri** – deg - argument of perihelion ($ \omega$) 

**Q** - aphelion distance -the point in the asteroid most distant from the Sun(au)

**I** - inclination; angle with respect to x-y ecliptic plane (deg)

**Node** - longitude of the ascending node (degree)

**Period** - sidereal orbital period ,the time a given astronomical object takes to complete one orbit around another object (years)

**H** -absolute magnitude - An object's absolute magnitude measure of the luminosity of a asteroid, is defined to be equal to the apparent magnitude that the object would have if it were viewed from a distance of exactly 10 parsecs (32.6 light-years)

**NEO**- Near-Earth Object (NEO) flag , any small Solar System body whose orbit brings it into proximity with Earth. By convention, a Solar System body is a NEO if its closest approach to the Sun (perihelion) is less than 1.3 astronomical units (AU). (**q**<1.3AU)

**PHA**- Potentially Hazardous Asteroid (PHA) flag ,A potentially hazardous object (PHO) is a near-Earth object – either an asteroid or a comet – with an orbit that can make close approaches to the Earth and is large enough to cause significant regional damage in the event of impact , If a NEO's orbit crosses the Earth's, and the object is larger than 140 meters (460 ft) across, it is considered a potentially hazardous object (PHO). (**q**<1.3AU and **diameter** >=140)

**diameter**-object diameter(km)

<img src="images/Perihelion-Aphelion.png" style="float:right;width:400px;height:300px;"/>
<img src="images/terms.PNG" style="float:left;width:300px;height:300px;"/>







In [3]:
def getColnames():
    col_names = []
    url = "https://ssd.jpl.nasa.gov/sbdb_query.cgi?obj_group=all;obj_kind=ast;obj_numbered=all;OBJ_field=0;ORB_field=0;table_format=HTML;max_rows=500;format_option=comp;query=Generate%20Table;c_fields=AbAcBhBgBjBkBlBiBnBsCqCnCoCpAiAgAhAp;c_sort=;.cgifields=format_option;.cgifields=obj_kind;.cgifields=obj_group;.cgifields=obj_numbered;.cgifields=ast_orbit_class;.cgifields=table_format;.cgifields=com_orbit_class&page=1"
    response = requests.get(url)
    
    soup1 = BeautifulSoup(response.content, "html.parser")
    
    tags = soup1.find_all('a', {'title':'sort'})
    for tag in tags:
        col_names.append(tag.b.string)
        
    return col_names

In [4]:
getColnames()

['SPK-ID',
 'object fullname',
 'a',
 'e',
 'i',
 'node',
 'peri',
 'q',
 'Q',
 'period',
 'condition code',
 '# obs. used (total)',
 '# obs. used (del.)',
 '# obs. used (dop.)',
 'H',
 'NEO',
 'PHA',
 'diameter']

we want to find out how many pages we have got , it's written in a font on our website so:

In [5]:
def getLastpagenumber():
    table = requests.get('https://ssd.jpl.nasa.gov/sbdb_query.cgi?obj_group=all;obj_kind=ast;obj_numbered=all;OBJ_field=0;ORB_field=0;table_format=HTML;max_rows=500;format_option=comp;query=Generate%20Table;c_fields=AbAcBhBgBjBkBlBiBnBsCqCnCoCpAiAgAhAp;c_sort=;.cgifields=format_option;.cgifields=obj_kind;.cgifields=obj_group;.cgifields=obj_numbered;.cgifields=ast_orbit_class;.cgifields=table_format;.cgifields=com_orbit_class&page=1')
    
    soup1 = BeautifulSoup(table.content, "html.parser")
    links = soup1.find_all('a',href=True) ##let's find the link for last page
    
    for link in links:
        if link.string == 'Last':
               return int(link['href'].split("page=")[1])

In [6]:
getLastpagenumber()

2194

In [7]:
url="https://ssd.jpl.nasa.gov/sbdb_query.cgi?obj_group=all;obj_kind=ast;obj_numbered=all;OBJ_field=0;ORB_field=0;table_format=HTML;max_rows=500;format_option=comp;query=Generate%20Table;c_fields=AbAcBhBgBjBkBlBiBnBsCqCnCoCpAiAgAhAp;c_sort=;.cgifields=format_option;.cgifields=obj_kind;.cgifields=obj_group;.cgifields=obj_numbered;.cgifields=ast_orbit_class;.cgifields=table_format;.cgifields=com_orbit_class&page="

In [8]:
def getRowsfromJPL(id,url):
    
    url = url+str(id)
    response = requests.get(url)
    if response.status_code != 200:
        return None
     
    soup1 = BeautifulSoup(response.content, "html.parser")
    if 'No events matching' in soup1.contents:
        return None

    col_names = getColnames()  ##lets make a dict of our varibales each one is a list which will contain our data
    
    dic = {}
    for col in col_names:
        dic[col] = []   
    
    table = soup1.find_all('table' ,{ "cellspacing":"1" }) ##let's find the main table
    alltablerows = table[0].find_all('tr') ##all the table rows

                ##we iterate through the dict and for evrey varibale we collect the data from the row ,
    j=0        ##the row is a list in each cell there is a data respectively to the dict keys/varibales.
    for col in dic.keys():
        for row in alltablerows:
            if row.td.string!= None: ##check if the row is relveant and contains data or not
                rowdata = row.find_all('td')
                if rowdata[j].string == '\xa0' : dic[col].append('')  ##representive of null string in data table
                else: dic[col].append(str(rowdata[j].string))          
        j+=1
    
    
    ##lets make from this dict df we use pandas option from_dict
    
    return pd.DataFrame.from_dict(dic)

In [9]:
getRowsfromJPL(1,url)

Unnamed: 0,SPK-ID,object fullname,a,e,i,node,peri,q,Q,period,condition code,# obs. used (total),# obs. used (del.),# obs. used (dop.),H,NEO,PHA,diameter
0,2000001,1 Ceres (A801 AA),2.766,0.0782,10.59,80.27,73.72,2.550,2.98,4.6,0,1075,60,0,3.5,N,N,939.4
1,2000002,2 Pallas (A802 FA),2.774,0.2298,34.85,172.97,310.29,2.137,3.41,4.62,0,8933,,,4.2,N,N,545
2,2000003,3 Juno (A804 RA),2.668,0.2570,12.99,169.85,248.03,1.982,3.35,4.36,0,7304,,,5.3,N,N,246.596
3,2000004,4 Vesta (A807 FA),2.362,0.0884,7.14,103.81,150.92,2.153,2.57,3.63,0,9451,2977,0,3.3,N,N,525.4
4,2000005,5 Astraea (A845 XA),2.574,0.1908,5.37,141.57,358.63,2.083,3.06,4.13,0,3168,,,7.0,N,N,106.699
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,2000496,496 Gryphia (A902 UM),2.199,0.0797,3.79,207.58,258.84,2.024,2.37,3.26,0,3383,,,11.8,N,N,14.403
496,2000497,497 Iva (A902 VB),2.85,0.3009,4.82,6.31,3.52,1.992,3.71,4.81,0,3221,,,9.9,N,N,40.932
497,2000498,498 Tokio (A902 XA),2.652,0.2229,9.50,97.21,241.80,2.061,3.24,4.32,0,4596,,,8.9,N,N,81.83
498,2000499,499 Venusia (A902 YE),4.008,0.2163,2.09,256.25,174.79,3.141,4.88,8.02,0,3339,,,9.5,N,N,77.328


we want to iterate through all the pages(2168 pages) and concat all the dataframes made from each page
Now 2168 pages its alot of data and it takes a long time so we take 850 pages

in the cells below i used CONST_PAGES
when CONST_PAGES = the number of pages we want to crawl from JPL database

In [10]:
CONST_PAGES = 851 ## number of pages i want to crawl
df_JPL = getRowsfromJPL(1,url)

for i in range(2,CONST_PAGES):
    df_JPL = pd.concat([df_JPL,getRowsfromJPL(i,url)],ignore_index=True)

    
##I want to save the dataframe to csv because its a heavy computing evrey time so:

df_JPL.to_csv('df_JPL.csv', index=False)

df_JPL


Unnamed: 0,SPK-ID,object fullname,a,e,i,node,peri,q,Q,period,condition code,# obs. used (total),# obs. used (del.),# obs. used (dop.),H,NEO,PHA,diameter
0,2000001,1 Ceres (A801 AA),2.766,0.0782,10.59,80.27,73.72,2.550,2.98,4.6,0,1075,60,0,3.5,N,N,939.4
1,2000002,2 Pallas (A802 FA),2.774,0.2298,34.85,172.97,310.29,2.137,3.41,4.62,0,8933,,,4.2,N,N,545
2,2000003,3 Juno (A804 RA),2.668,0.2570,12.99,169.85,248.03,1.982,3.35,4.36,0,7304,,,5.3,N,N,246.596
3,2000004,4 Vesta (A807 FA),2.362,0.0884,7.14,103.81,150.92,2.153,2.57,3.63,0,9451,2977,0,3.3,N,N,525.4
4,2000005,5 Astraea (A845 XA),2.574,0.1908,5.37,141.57,358.63,2.083,3.06,4.13,0,3168,,,7.0,N,N,106.699
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406995,2424996,424996 (2009 CZ40),3.169,0.1102,9.40,140.76,273.09,2.820,3.52,5.64,0,143,,,16.0,N,N,
406996,2424997,424997 (2009 CH45),2.987,0.1178,8.14,324.59,79.77,2.635,3.34,5.16,0,98,,,16.7,N,N,
406997,2424998,424998 (2009 CU48),3.172,0.0740,11.88,307.42,319.26,2.937,3.41,5.65,0,136,,,16.0,N,N,4.756
406998,2424999,424999 (2009 CB58),3.148,0.0686,13.82,177.56,273.23,2.932,3.36,5.59,0,144,,,16.5,N,N,


Now after crawling all our data lets start some data manipulation techniques such as duplicated vals ,nulls etc
First , we want to drop some irrelevant columns and get information about our dataset 

In [11]:
df_JPL.drop(columns = ['condition code','# obs. used (total)','# obs. used (del.)','# obs. used (dop.)',],inplace=True)

In [12]:
df_JPL.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407000 entries, 0 to 406999
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   SPK-ID           407000 non-null  object
 1   object fullname  407000 non-null  object
 2   a                407000 non-null  object
 3   e                407000 non-null  object
 4   i                407000 non-null  object
 5   node             407000 non-null  object
 6   peri             407000 non-null  object
 7   q                407000 non-null  object
 8   Q                407000 non-null  object
 9   period           407000 non-null  object
 10  H                407000 non-null  object
 11  NEO              407000 non-null  object
 12  PHA              407000 non-null  object
 13  diameter         407000 non-null  object
dtypes: object(14)
memory usage: 43.5+ MB


we can see that alot of cells are objects and not real int,float etc lets convert them 

In [13]:
df_JPL =df_JPL.astype({"SPK-ID": int, "object fullname": str,"a":float,"e":float,"i":float,"node":float,"peri":float,"q":float,"Q":float,"period":float,"H":float,"NEO":str,"PHA":str,"diameter":float},errors = 'ignore')
df_JPL['diameter'] = pd.to_numeric(df_JPL['diameter'])
df_JPL.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407000 entries, 0 to 406999
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   SPK-ID           407000 non-null  int32  
 1   object fullname  407000 non-null  object 
 2   a                407000 non-null  float64
 3   e                407000 non-null  float64
 4   i                407000 non-null  float64
 5   node             407000 non-null  float64
 6   peri             407000 non-null  float64
 7   q                407000 non-null  float64
 8   Q                407000 non-null  float64
 9   period           407000 non-null  float64
 10  H                407000 non-null  float64
 11  NEO              407000 non-null  object 
 12  PHA              407000 non-null  object 
 13  diameter         114781 non-null  float64
dtypes: float64(10), int32(1), object(3)
memory usage: 41.9+ MB


as we can see in the table above the diameter column contains alot of null data .
after a quick Google search the data above can be implemented from Wikipedia

In [14]:
def crawlDiamfromwiki(id):
    url = f"https://en.wikipedia.org/wiki/List_of_minor_planets:_{id}-{id+999}"
    response = requests.get(url)
    if response.status_code != 200:
        return None
     
    soup1 = BeautifulSoup(response.content, "html.parser")

    ##each page include about 10 tables of data
    
    tables= soup1.find_all('table',class_="wikitable")
    diameters=[]
    
    for table in tables:
        for row in table.find_all("tr"):
            numberofrows = len(row.find_all("td"))
            if numberofrows >0:
                diameters.append(float(row.find_all("td")[7].string.split()[0]))

    return pd.Series(diameters)

                               
           
               
    


In [15]:
CONST_ROWS = len(df_JPL.index) ##number of rows in the dataframe

diameters = crawlDiamfromwiki(1)
for i in range(1001,CONST_ROWS,1000):
    diameters=diameters.append(crawlDiamfromwiki(i),ignore_index = True)

    
diameters

0         939.0
1         545.0
2         247.0
3         525.0
4         107.0
          ...  
406995      2.2
406996      3.7
406997      2.8
406998      3.9
406999      2.9
Length: 407000, dtype: float64

after a deep check we can see that our dataframe and wiki's data is ordered in the same way, so we dont need to worry for another column such as id ...now after crawling the diameter part of wiki's data for asteroids .. we can compare to our dataframe and check :if we don't
have diam in this row we will add it from our new series ..

In [16]:
for i in range(CONST_ROWS):
    if np.isnan(df_JPL['diameter'][i]) and np.isnan(diameters[i])== False:
        df_JPL['diameter'][i] = diameters[i]

df_JPL.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407000 entries, 0 to 406999
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   SPK-ID           407000 non-null  int32  
 1   object fullname  407000 non-null  object 
 2   a                407000 non-null  float64
 3   e                407000 non-null  float64
 4   i                407000 non-null  float64
 5   node             407000 non-null  float64
 6   peri             407000 non-null  float64
 7   q                407000 non-null  float64
 8   Q                407000 non-null  float64
 9   period           407000 non-null  float64
 10  H                407000 non-null  float64
 11  NEO              407000 non-null  object 
 12  PHA              407000 non-null  object 
 13  diameter         407000 non-null  float64
dtypes: float64(10), int32(1), object(3)
memory usage: 41.9+ MB


lets convert NEO and PHA columns to numeric vals with get_dummies

In [21]:
df_JPL = pd.get_dummies(df_JPL,columns = ['NEO','PHA'])

##N indicates that the object is not near earth and not hazardous

df_JPL['NEO'] = df_JPL['NEO_Y']  ## 0 will indicate if its not near earth 
df_JPL['PHA'] = df_JPL['PHA_Y']  ## 0 will indicate if its not hazardous

df_JPL.drop(columns = ['NEO_N','NEO_Y','PHA_N','PHA_Y'],inplace=True)

df_JPL

Unnamed: 0,SPK-ID,object fullname,a,e,i,node,peri,q,Q,period,H,diameter,NEO,PHA
0,2000001,1 Ceres (A801 AA),2.766,0.0782,10.59,80.27,73.72,2.550,2.98,4.60,3.5,939.400,0,0
1,2000002,2 Pallas (A802 FA),2.774,0.2298,34.85,172.97,310.29,2.137,3.41,4.62,4.2,545.000,0,0
2,2000003,3 Juno (A804 RA),2.668,0.2570,12.99,169.85,248.03,1.982,3.35,4.36,5.3,246.596,0,0
3,2000004,4 Vesta (A807 FA),2.362,0.0884,7.14,103.81,150.92,2.153,2.57,3.63,3.3,525.400,0,0
4,2000005,5 Astraea (A845 XA),2.574,0.1908,5.37,141.57,358.63,2.083,3.06,4.13,7.0,106.699,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406995,2424996,424996 (2009 CZ40),3.169,0.1102,9.40,140.76,273.09,2.820,3.52,5.64,16.0,2.200,0,0
406996,2424997,424997 (2009 CH45),2.987,0.1178,8.14,324.59,79.77,2.635,3.34,5.16,16.7,3.700,0,0
406997,2424998,424998 (2009 CU48),3.172,0.0740,11.88,307.42,319.26,2.937,3.41,5.65,16.0,4.756,0,0
406998,2424999,424999 (2009 CB58),3.148,0.0686,13.82,177.56,273.23,2.932,3.36,5.59,16.5,3.900,0,0


In [22]:
df_JPL.drop_duplicates()

Unnamed: 0,SPK-ID,object fullname,a,e,i,node,peri,q,Q,period,H,diameter,NEO,PHA
0,2000001,1 Ceres (A801 AA),2.766,0.0782,10.59,80.27,73.72,2.550,2.98,4.60,3.5,939.400,0,0
1,2000002,2 Pallas (A802 FA),2.774,0.2298,34.85,172.97,310.29,2.137,3.41,4.62,4.2,545.000,0,0
2,2000003,3 Juno (A804 RA),2.668,0.2570,12.99,169.85,248.03,1.982,3.35,4.36,5.3,246.596,0,0
3,2000004,4 Vesta (A807 FA),2.362,0.0884,7.14,103.81,150.92,2.153,2.57,3.63,3.3,525.400,0,0
4,2000005,5 Astraea (A845 XA),2.574,0.1908,5.37,141.57,358.63,2.083,3.06,4.13,7.0,106.699,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
406995,2424996,424996 (2009 CZ40),3.169,0.1102,9.40,140.76,273.09,2.820,3.52,5.64,16.0,2.200,0,0
406996,2424997,424997 (2009 CH45),2.987,0.1178,8.14,324.59,79.77,2.635,3.34,5.16,16.7,3.700,0,0
406997,2424998,424998 (2009 CU48),3.172,0.0740,11.88,307.42,319.26,2.937,3.41,5.65,16.0,4.756,0,0
406998,2424999,424999 (2009 CB58),3.148,0.0686,13.82,177.56,273.23,2.932,3.36,5.59,16.5,3.900,0,0


As you can see, this filled out our diam column, and we converted the str objects to numeric,we didn't find any duplicates and that is completing our dataframe : )
Now we'll store our final data as a csv file.

In [None]:
df_JPL.to_csv('df_JPLfinal.csv', index=False)