# Could Star Wars planets exist in our universe?
<img src="https://www.starwarsnewsnet.com/wp-content/uploads/2020/09/Mando_Season_2_Trailer_01.png"><p style='text-align: right'> Source: www.starwarsnewsnet.com </p>

## Part 1. Data scraping


### Table of contents:
1. Introduction
2. Goal
3. Data scraping
4. Summary

## 1. Introduction

Some time ago I watched an interesting documentary about planets beyond our solar system - how they were discovered, how much data about them we are able to collect and how life on some of those planets could possibly look like. This immiedietly woke my nerd instincts and I started to wonder - could planets that we know from science fiction universe such as Star Wars exist in our universe? After all it all happened *a long time ago, in a galaxy far, far away...*

Well, ok - probably not. I am not THAT obsessed. But there can certainly exist similar planets.

Based on data that we are able to collect from both Star Wars fictional universe and our real life universe I decided to look for corresponding information that we can actually compare. With those information we will be able to find the most similar planets from both of those universes. This could be an interesting experiment!

This project will be divided into two parts. In this part I will prepare data frames by scraping data from **wikipedia** and **wookieepedia** (wikipedia for star wars), which will be used in part 2, where they will be cleaned and analysed. 

## 2. Goal

In this part we will need to create two dataframes following this scheme:
- find list of planets on wikipedia/wookieepedia
- create iterable list of planets
- find representative planet
- define which data we want to collect
- gather selected data from all sites about planets on wikipedia/wookiepedia
- create dataframes based on collected data
- clear collected data (pre-cleaning, mainly getting rid of typical wiki indexes)
- save created dataframes as csv files

## 3. Data scraping

For this part I will mainly explore possibilities of Bautiful Soup library. This is very useful tool that helps with gathering data from html and xml sites. From documentation:
<blockquote>Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.</blockquote>
You can read more about Beautiful Soup <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc"><b>HERE</b></a>

During this proces we will gather data from <a href="https://en.wikipedia.org/wiki/Main_Page">wikipedia</a> and <a href="https://starwars.fandom.com/wiki/Main_Page">wookiepedia</a>. For more detailed information I encourage you to check on them.

Without further ado let's start our process with importing some libraries. In addition to core libraries I decided to import time library as well to check how long it took to collect main data about planets

In [1]:
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

First of all we will need list of all Star Wars planets. I have found one <a href="https://starwars.fandom.com/wiki/List_of_planets?so=search">here</a>. As you can notice, the thing that distinguish planet name from other elements on this site is: being a **link** (so it posses html **'a' tag**) and being **bold** (html **'b' tag**). We can search and list all elements meeting those conditions **except first element**, which happen to be word "planets" (just under title and canon/legends buttons).

In [2]:
page = requests.get('https://starwars.fandom.com/wiki/List_of_planets?so=search')
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
planets = []
#dropping first element, because it's the name of the category - "planets"
for i in soup.find_all('b')[2:]:
    try:
        planets.append(i.a.get('title'))
    except:
        continue
planets

['5251977',
 'Aargonar',
 'Ab Dalis',
 'Abafar',
 'Abednedo (planet)',
 'Abelor',
 'Abregado-rae',
 'Absanz',
 'Actlyon',
 'Aeos Prime',
 'Affa',
 'Affadar',
 'Agamar',
 'Agaris',
 'Agoliba-Ado',
 'Agoliba-Ena',
 'Ahch-To',
 'Ajara',
 'Akiva',
 'Alaris',
 'Alderaan',
 'Aleen',
 'Aleen Minor',
 'Allst Prime',
 'Allyuen',
 'Aloxor',
 'Alpinn',
 'Alsakan',
 'Alun',
 'Ambria',
 'Amethia Prime',
 'Ammon IV',
 'Anantapar',
 'Anaxes',
 'Andelm IV',
 'Ando',
 'Ando Prime',
 'Anelsana',
 'Ankhural',
 'Anoat',
 'Ansion',
 'Antar',
 'Anthan Prime',
 'Aquaris',
 'Aq Vetina',
 'Arbiflux',
 'Arbooine',
 'Ardennia',
 'Argai Minor',
 'Aria Prime',
 'Aridus',
 'Arieli',
 'Arkanis',
 'Aris',
 'Arreyel',
 'Artiod Minor',
 'Arvala-7',
 'Ashas Ree',
 'Askaji',
 'Asmeru',
 'Asusto',
 'Athulla',
 'Atollon',
 'Atterra Alpha',
 'Atterra Bravo',
 'Atterra Primo',
 'Auratera',
 'Avedot',
 'Avidich',
 'Axxila',
 'Bakura',
 'Balnab',
 'Balosar',
 'Bamayar IX',
 "B'ankora homeworld",
 'Baraan-Fa',
 'Barab I',
 'Bar

We have succesfully created list of all Star Wars planets. Now, we want to pick one representative planet and determine the attributes that we would like to collect.

Probably the most popular planet in Star Wars universe is planet **Tatooine**, the home planet of Skywalker family known for it's characteristic two suns. Let's check it's <a href="https://starwars.fandom.com/wiki/Tatooine">wiki</a>.

After inspecting Tatooine's wiki page, I noticed that every element describing planet in infobox has the same html class, and attribute called "data-source" which describes it's name. This is the key by which we will collect list of all Tatooine attributes.

In [4]:
page2 = requests.get('https://starwars.fandom.com/wiki/Tatooine')
soup2 = BeautifulSoup(page2.content, 'html.parser')

In [5]:
#listing all "data source" tags from site
attributes = [i['data-source'] for i in soup2.find_all(attrs={'class':'pi-item pi-data pi-item-spacing pi-border-color'})]
attributes

['region',
 'sector',
 'system',
 'suns',
 'position',
 'moons',
 'coord',
 'routes',
 'distance',
 'lengthday',
 'lengthyear',
 'class',
 'diameter',
 'atmosphere',
 'climate',
 'terrain',
 'water',
 'interest',
 'flora',
 'fauna',
 'species',
 'otherspecies',
 'language',
 'population',
 'cities',
 'imports',
 'exports',
 'affiliation']

This quite a lot of information, but for my desired data set - the more the better. I assume no planet will have more information than Tatooine (and if some of them will, it's not that big of a deal considering how detailed information we already want to gather). Most of them will have less data.

Now we need to write a code, which will open every planet wiki site, gather available data about our attributes (skip if there are none) and write it down as list of dictionaries (one dictionary for each planet). This will be the base for our planet data frame.

Every wookieepedia page has the same url with different ending. This ending is the name of the link, so the planet name we collected in our list. In our loop we will create a **new url every next iteration with new ending** (from list of planets). In every iteration we need to create second loop that will **collect data by every element from our attribute list**. Attribute data will be written down in **attributeSet dictionary**, and appended to **planetsData list**.

Let's also check **how long** it will take to gather all requested data.

In [6]:
#creating list of dictionaries - our data for planets DataFrame object
planetsData = []
#time marker to measure how long does it take to collect all requested data
start_full = time.time()

for planet in planets:
    attributeSet = {}
    start = time.time()
    #creating path to planet site
    url = 'https://starwars.fandom.com/wiki/'+planet
    pageSet = requests.get(url)
    soupSet = BeautifulSoup(pageSet.content, 'html.parser')
    
    #getting all data about attributes from planet sites
    for attribute in attributes:
        data = soupSet.find(attrs={'data-source': attribute})
        if data is not None:
            attributeSet[attribute] = data.find(attrs={'class':'pi-data-value pi-font'}).get_text()
    planetsData.append(attributeSet)
    
print('Data uploaded!\nIt took {} minutes to collect all requested data'.format(round((time.time()-start_full)/60,1)))

Data uploaded!
It took 23.9 minutes to collect all requested data


Success! It took a while but we finnaly have all data we need. Let's create a data frame and check how it looks.

In [7]:
planetsFrame = pd.DataFrame(planetsData, index=planets)
planetsFrame.head()

Unnamed: 0,affiliation,interest,region,system,routes,class,atmosphere,terrain,population,sector,...,species,coord,moons,language,position,flora,lengthday,lengthyear,imports,distance
5251977,Alliance to Restore the Republic[1],,,,,,,,,,...,,,,,,,,,,
Aargonar,Confederacy of Independent Systems[1],Aargonar Separatist base[1],,,,,,,,,...,,,,,,,,,,
Ab Dalis,Alliance to Restore the Republic[2],Keftia district[1]Rendezvous Point Lambda-Four[2],Outer Rim Territories[1],Ab Dalis system[1],A hyperlane[1],Terrestrial[1],Breathable[2],Swamps[1],Over twenty million[1],,...,,,,,,,,,,
Abafar,Unallied (Separatist presence)[1],Rhydonium mining installation[9]The Void[1],Outer Rim Territories[1],Abafar system[3],,,Breathable[7],Desert[1],,Sprizen sector[2],...,,,,,,,,,,
Abednedo (planet),Galactic Empire[3]New Republic[5],Oddy's home[2],Colonies[1],,Corellian Run[1],,,,,,...,Abednedo[3],,,,,,,,,


As we can see, data contains very typical wiki index boxes (\[*number*\]). From what I see, every attribute has one. We need to **define a function** that will **clear our data from those indexes** and that will **insert comma and space when there are more elements in the cell**. We will do so by converting left "\[" bracket sign to righ "\]" bracket sign, splitting the string by it and creating new list containing only even elements (without internal numbers). If there are two elements in our cell (so originally one element and index) we want to return only first element. If there are more than two elements (so originally more than one), we will return list joined by comma and space (", "). If there is no attribute (NaN value) we return the same value.

In [8]:
def clear_data(data):
    if pd.isna(data):
        return data
    dataSplit = str(data).replace('[',']').split(']')
    dataCleared = [dataSplit[i] for i in range(len(dataSplit)) if i%2==0]
    if len(dataCleared) > 2:
        return ', '.join(dataCleared).rstrip()
    return dataCleared[0].rstrip()

Now let's apply our function on all elements of planetsFrame.

In [9]:
planetsFrame = planetsFrame.applymap(clear_data)
planetsFrame.head(10)

Unnamed: 0,affiliation,interest,region,system,routes,class,atmosphere,terrain,population,sector,...,species,coord,moons,language,position,flora,lengthday,lengthyear,imports,distance
5251977,Alliance to Restore the Republic,,,,,,,,,,...,,,,,,,,,,
Aargonar,Confederacy of Independent Systems,Aargonar Separatist base,,,,,,,,,...,,,,,,,,,,
Ab Dalis,Alliance to Restore the Republic,"Keftia district, Rendezvous Point Lambda-Four,",Outer Rim Territories,Ab Dalis system,A hyperlane,Terrestrial,Breathable,Swamps,Over twenty million,,...,,,,,,,,,,
Abafar,Unallied (Separatist presence),"Rhydonium mining installation, The Void,",Outer Rim Territories,Abafar system,,,Breathable,Desert,,Sprizen sector,...,,,,,,,,,,
Abednedo (planet),"Galactic Empire, New Republic,",Oddy's home,Colonies,,Corellian Run,,,,,,...,Abednedo,,,,,,,,,
Abelor,,,"Mid Rim Territories, Western Reaches,",,,Terrestrial,,,,,...,,,,,,,,,,
Abregado-rae,,"Unidentified aquarium, Abregado-Rae Spaceport,",Core Worlds,Abregado system,Rimma Trade Route,,,,,,...,,K-13,,,,,,,,
Absanz,"Galactic Empire, Sienar Fleet Systems,",Sienar Fleet Systems factory,,,,,Breathable,Deserts,,,...,,,,,,,,,,
Actlyon,,,"Outer Rim Territories, Western Reaches,",,,Terrestrial,Breathable,Mountains,,,...,,,,,,,,,,
Aeos Prime,Alliance to Restore the Republic,"Aeos Prime rebel outpost, Unidentified Aeosian...",Outer Rim Territories,Aeos system,,Terrestrial,Breathable,Islands,,,...,Aeosian,,,,,,,,,


Looks much better. First part done, now let's jump into our real life planets.

The category we will focus on here is **exoplanets**, so planets that exist outside of our solar system. The representative planet however will be Earth, since it has the biggest set of attributes and data. We will check on <a href="https://en.wikipedia.org/wiki/Earth">Earth wikipedia site</a> and decide which attributes we would like to collect.

After inspecting the site I found anouther dependency in attribute names. All of them have attribute **"class"** set as **"infobox-label"**.

In [10]:
earthPage = requests.get('https://en.wikipedia.org/wiki/Earth')
earthSoup = BeautifulSoup(earthPage.content, 'html.parser')

In [11]:
for i in earthSoup.find_all(attrs={"class":"infobox-label"}):
    print(i.get_text())

Alternative names
Adjectives
Aphelion
Perihelion
Semi-major axis
Eccentricity
Orbital period
Average orbital speed
Mean anomaly
Inclination
Longitude of ascending node
Time of perihelion
Argument of perihelion
Satellites
Mean radius
Equatorial radius
Polar radius
Flattening
Circumference
Surface area
Volume
Mass
Mean density
Surface gravity
Moment of inertia factor
Escape velocity
Synodic rotation period
Sidereal rotation period
Equatorial rotation velocity
Axial tilt
Albedo
Surface equivalent dose rate
Surface pressure
Composition by volume


This time I won't be needing most of this data. From our exoplanets I would like to see only this data, that can be useful in comparing to Star Wars planets data frame created earlier. I noticed some attributes that can correspond with each other:

| Exoplanets | Star Wars planets |
| --- | --- |
| Satellites | moons|
| Orbital period | lengthday |
| Rotation period | lengthyear |
| Mean radius | diameter |

Additionaly, I noticed that on other planets we have also **"Star"** attribute (not present on Earth, maybe it was too obvious). This attribute corresponds with **"suns"** attribute from Star Wars planets. So we will add "Stars" to our attribute list as well.

We will call this one attributesEP (EP from ExoPlanets), so we can distinguish them.

In [22]:
attributesEP = ['Star', 'Satellites', 'Orbital period', 'Rotation period', 'Mean radius']

Now we need exoplanets list. Sadly, there is no single list on wikipedia. There are exoplanets list divided into different categories. I have chosen lists of exoplanets by method of discovery (5 sites). List of this sites will be our basis for obtaining names of exoplanets.

In [13]:
exoplanetPages = ['https://en.wikipedia.org/wiki/List_of_transiting_exoplanets',
              'https://en.wikipedia.org/wiki/List_of_exoplanets_detected_by_radial_velocity',
              'https://en.wikipedia.org/wiki/List_of_exoplanets_detected_by_microlensing',
              'https://en.wikipedia.org/wiki/List_of_exoplanets_detected_by_timing',
              'https://en.wikipedia.org/wiki/List_of_directly_imaged_exoplanets']

There are also multiple tables on those sites - so how do we extract data from desired table? Looking at html code we can see, that exoplanets table is the only table that have **"sortable" value** in **"class" attribute**. This is our key.
There are also **"dead" links** present here (exoplanets which don't have their sites yet). While creating our list we want to skip those links  - they have **"page does not exist"** string in their title.

After gathering data we will convert our list to data series object, so we can easily apply **"unique" method** on it, making sure there are no duplicates there.

In [14]:
exoplanets = []
for url in exoplanetPages:
    pageEP = requests.get(url)
    soupEP = BeautifulSoup(pageEP.content, 'html.parser') 
    dataEP = []
    
    for table in soupEP.find_all('table'):
        if 'sortable' in table.get('class'):
            dataEP = [i.find('a').get('title') for i in table.find_all('td') if i.find('a') and i.find('a').get('title') is not None and "page does not exist" not in i.find('a').get('title')]
        exoplanets += dataEP
exoplanets = pd.Series(exoplanets)
exoplanets = exoplanets.unique()
exoplanets

array(['Kepler-42c', '55 Cancri e', 'WASP-19b', 'WASP-43b', 'Kepler-10b',
       'COROT-7b', 'WASP-18b', 'WASP-12b', 'OGLE-TR-56b', 'HAT-P-23b',
       'Kepler-42', 'WASP-33b', 'TrES-3b', 'HAT-P-36b', 'WASP-4b',
       'WASP-46b', 'OGLE-TR-113b', 'Kepler-17b', 'COROT-1b', 'COROT-14b',
       'WASP-36b', 'GJ 1214 b', 'Kepler-9d', 'WASP-64b', 'WASP-5b',
       'OGLE-TR-132b', 'WASP-52b', 'COROT-2b', 'SWEEPS-11', 'WASP-3b',
       'Kepler-41b', 'Kepler-42d', 'COROT-18b', 'WASP-50b', 'WASP-48b',
       'HAT-P-32b', 'WASP-2b', 'HAT-P-7b', 'HD 189733 b', 'WASP-14b',
       'WASP-24b', 'WASP-44b', 'KOI-254b', 'TrES-2b', 'OGLE2-TR-L9b',
       'WASP-1b', 'XO-2b', 'Gliese 436 b', 'WASP-32b', 'COROT-21b',
       'WASP-26b', 'HAT-P-16b', 'Kepler-21b', 'HAT-P-5b', 'WASP-49b',
       'WASP-57b', 'COROT-12b', 'HAT-P-20b', 'HD 149026 b', 'HAT-P-3b',
       'HAT-P-13b', 'COROT-11b', 'KOI-135b', 'TrES-1b', 'WASP-41b',
       'HAT-P-4b', 'HAT-P-8b', 'WASP-10b', 'OGLE-TR-10b', 'WASP-16b',
       'WASP-45

This is our list (or rather array) of exoplanets. Now we have everything we need to gather core data about attributes from each planet site.

Similar as before, we will create a loop that will create and open site for each planet from our list, and then gather data about it's attributes. All attribute labels in infobox have **"infobox-label"** in **"class"** tag, and all corresponding data have **"infobox-data"** in **"class"** tag. This is how we will extract this data.

Just like in previous scraping - let's also check **how long** did it take to collect all data.

In [23]:
exoplanetsData = []
start_full = time.time()

for exoplanet in exoplanets:
    attributeSetEP = {}
    start = time.time()
    
    #creating path to exoplanet site
    urlEP = 'https://en.wikipedia.org/wiki/'+exoplanet
    pageSetEP = requests.get(urlEP)
    soupSetEP = BeautifulSoup(pageSetEP.content, 'html.parser')
    
    #getting all data about attributes from planet site
    for attributeEP in attributesEP:
        #checking all table rows on planet site
        for i in soupSetEP.find_all('tr'):
            dataLabel = i.find(attrs={"class":"infobox-label"})
            dataInfo = i.find(attrs={"class":"infobox-data"})
            #checking if label exist and if it's text is our desired attribute
            if dataLabel is not None and dataLabel.get_text() == attributeEP:
                attributeSetEP[attributeEP] = dataInfo.get_text()                
    exoplanetsData.append(attributeSetEP)
    
print('Data uploaded!\nIt took {} minutes to collect all requested data'.format(round((time.time()-start_full)/60,1)))

Data uploaded!
It took 4.6 minutes to collect all requested data


Great! Now let's create data frame and take a look.

In [24]:
exoplanetsFrame = pd.DataFrame(exoplanetsData, index=exoplanets)
exoplanetsFrame.head(10)

Unnamed: 0,Star,Orbital period,Mean radius
Kepler-42c,Kepler-42,0.45328731±5*10−8[2] d,0.73±0.03[2] REarth
55 Cancri e,55 Cancri A,0.7365474 (± 0.0000014)[2] d17.677 h,1.875 ± 0.029[2] REarth
WASP-19b,WASP-19,0.78884 ± 0.0000003 d (18.9321600 ± 7.2×10−6 h...,1.386±0.032[2] RJ
WASP-43b,WASP-43,0.81347753 (± 0.00000071)[3] d,1.04 +0.07−0.09[4] RJ
Kepler-10b,Kepler-10[2],0.837495[1] d20.0999 h,1.47+0.03−0.02[3] REarth
COROT-7b,CoRoT-7,0.853585 ± 0.000024[1] d,0.14 RJ1.58 ± 0.1 REarth
WASP-18b,WASP-18,0.94145455+0.00000087−0.00000132[3] d22.59487 h,1.106+0.072−0.054[2] RJ
WASP-12b,WASP-12,1.091423 ± 3e-6 d,"1.900+0.057−0.055,[2] 1.736±0.056[3] RJ"
OGLE-TR-56b,OGLE-TR-56,1.211909 ± 0.000001 d29.08582 h,1.30 ± 0.05 RJ
HAT-P-23b,,,


Apparently we overestimated a bit the amount of available data. As we can see, all columns but Rotation period were filled. There is no data about rotation period in any of those exoplantes.

Once again we will need to get rid of wiki indexes. This time we will make **new function**, because indexes does not appear at the end of the lines, but in the **middle** of them. **Comma and space is not good separator in this case**, we will modify previous funcion to join created lists with **no sign in between**.

In [25]:
def clear_data2(data):
    if pd.isna(data):
        return data
    dataSplit = str(data).replace('[',']').split(']')
    dataCleared = [dataSplit[i] for i in range(len(dataSplit)) if i%2==0]
    if len(dataCleared) > 2:
        return ''.join(dataCleared).rstrip()
    return dataCleared[0].rstrip()

In [26]:
exoplanetsFrame = exoplanetsFrame.applymap(clear_data2)
exoplanetsFrame.head(10)

Unnamed: 0,Star,Orbital period,Mean radius
Kepler-42c,Kepler-42,0.45328731±5*10−8,0.73±0.03
55 Cancri e,55 Cancri A,0.7365474 (± 0.0000014),1.875 ± 0.029
WASP-19b,WASP-19,0.78884 ± 0.0000003 d (18.9321600 ± 7.2×10−6 h...,1.386±0.032
WASP-43b,WASP-43,0.81347753 (± 0.00000071),1.04 +0.07−0.09
Kepler-10b,Kepler-10,0.837495,1.47+0.03−0.02
COROT-7b,CoRoT-7,0.853585 ± 0.000024,0.14 RJ1.58 ± 0.1 REarth
WASP-18b,WASP-18,0.94145455+0.00000087−0.00000132,1.106+0.072−0.054
WASP-12b,WASP-12,1.091423 ± 3e-6 d,"1.900+0.057−0.055, 1.736±0.056 RJ"
OGLE-TR-56b,OGLE-TR-56,1.211909 ± 0.000001 d29.08582 h,1.30 ± 0.05 RJ
HAT-P-23b,,,


That's it. This data frame still doesn't look perfect, we have lots of different signs, descriptions, units etc. We will deal with this in the next part though during data cleaning. For now let's save our two data frames as csv files, that can be used later.

In [27]:
planetsFrame.to_csv('StarWarsPlanets.csv')
exoplanetsFrame.to_csv('Exoplanets.csv')

## 4. Summary

In this part we have succesfuly extracted (scraped) desired data from both wiki sites, creating two data sets containing selected information about planets in Star Wars fictional universe, and our universe. Based on this data I will try to create an algorithm that will compare those sets to find the most similar planets. First I will need to prepare data so they can have homogenous form and quantible values.

Can't wait to see the outcome.