<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Estimated time needed: **40** minutes


In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [62]:
!pip -q install beautifulsoup4 requests



In [63]:
import re
import unicodedata
import requests
import pandas as pd
from bs4 import BeautifulSoup

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)



and we will provide some helper functions for you to process web scraped HTML table


In [64]:
def date_time(table_cells):
    """Return date and time (list of two strings) from the first table cell."""
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """Return booster version string from the cell (handles interleaved strings)."""
    out = ''.join([bv for i, bv in enumerate(table_cells.strings) if i % 2 == 0][0:-1])
    return out

def landing_status(table_cells):
    """Return landing status from the cell."""
    out = [i for i in table_cells.strings][0]
    return out

def get_mass(table_cells):
    """Extract payload mass ending with 'kg'; return '0' if missing."""
    mass = unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass and ("kg" in mass):
        new_mass = mass[: mass.find("kg") + 2]
    else:
        new_mass = "0"
    return new_mass

def extract_column_from_header(row):
    """Clean a <th> header cell and return the column name."""
    if row.br: row.br.extract()
    if row.a: row.a.extract()
    if row.sup: row.sup.extract()
    colunm_name = ' '.join(row.contents)
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name


To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`


Next, request the HTML page from the above URL and get a `response` object


In [65]:
# Snapshot requerido por el lab (9-Jun-2021) + fallbacks
candidate_urls = [
    "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922",
    "https://m.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922",
    "https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches"
]

headers = {
    "User-Agent": ("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                   "KHTML, like Gecko) Chrome/122.0 Safari/537.36 "
                   "(contact: student@example.com)"),
    "Accept-Language": "en-US,en;q=0.9"
}

soup = None
last_err = None
used_url = None

for url in candidate_urls:
    try:
        r = requests.get(url, headers=headers, timeout=30, allow_redirects=True)
        r.raise_for_status()
        if "<table" not in r.text.lower():
            continue
        soup = BeautifulSoup(r.text, "html.parser")
        used_url = url
        break
    except Exception as e:
        last_err = e

if soup is None:
    raise RuntimeError(f"No se pudo obtener la página. Último error: {last_err}")

title_tag = soup.find("title")
print("URL usada:", used_url)
print("Título:", title_tag.get_text(strip=True) if title_tag else "(sin <title>)")


URL usada: https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922
Título: List of Falcon 9 and Falcon Heavy launches - Wikipedia


### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [66]:
# use requests.get() method with the provided static_url
# assign the response to a object

Create a `BeautifulSoup` object from the HTML `response`


In [67]:
# Use BeautifulSoup() to create a BeautifulSoup object from a response text content



Print the page title to verify if the `BeautifulSoup` object was created properly 


In [68]:
# Use soup.title attribute

### TASK 2: Extract all column/variable names from the HTML table header


Next, we want to collect all relevant column names from the HTML table header


Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab


In [69]:
# Use the find_all function in the BeautifulSoup object, with element type `table`
# Assign the result to a list called `html_tables`
# Todas las tablas de la página

html_tables = soup.find_all('table')
print("Total tablas encontradas:", len(html_tables))



Total tablas encontradas: 25


Starting from the third table is our target table contains the actual launch records.


In [71]:
# Let's print the third table and check its content
# La guía del lab usa la 3ª tabla: html_tables[2].
# Usamos fallback por si cambia el orden.
try:
    first_launch_table = html_tables[2]
    assert first_launch_table.find('th') is not None
except Exception:
    # Buscar por clase específica de las tablas de lanzamientos
    candidates = soup.find_all('table', class_='wikitable plainrowheaders collapsible')
    if candidates:
        first_launch_table = candidates[0]
    else:
        # fallback a la primera 'wikitable'
        candidates = soup.find_all('table', class_='wikitable')
        first_launch_table = candidates[0]

type(first_launch_table), len(first_launch_table.find_all('tr'))

(bs4.element.Tag, 16)

You should able to see the columns names embedded in the table header elements `<th>` as follows:


```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```


Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [73]:
column_names = []

# Apply find_all() function with `th` element on first_launch_table
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names
column_names = []
for th in first_launch_table.find_all('th'):
    name = extract_column_from_header(th)
    if name is not None and len(name) > 0:
        column_names.append(name)



Check the extracted column names


In [74]:
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


## TASK 3: Create a data frame by parsing the launch HTML tables


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [75]:
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:


In [82]:
extracted_row = 0

# Recorremos todas las tablas de lanzamientos (una por bloque/año)
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    for rows in table.find_all("tr"):
        # ¿La fila corresponde a un lanzamiento? (th con número)
        if rows.th and rows.th.string:
            flight_number = rows.th.string.strip()
            flag = flight_number.isdigit()
        else:
            flag = False

        if not flag:
            continue

        row = rows.find_all('td')
        if len(row) < 9:
            # fila incompleta (salta)
            continue

        extracted_row += 1

        # Flight No.
        launch_dict['Flight No.'].append(flight_number)

        # Date & Time
        datatimelist = date_time(row[0])
        date = datatimelist[0].strip(',') if len(datatimelist) > 0 else None
        time = datatimelist[1] if len(datatimelist) > 1 else None
        launch_dict['Date'].append(date)
        launch_dict['Time'].append(time)

        # Version Booster
        bv = booster_version(row[1])
        if not bv:
            try:
                bv = row[1].a.string
            except:
                bv = row[1].get_text(strip=True)
        launch_dict['Version Booster'].append(bv)

        # Launch site
        try:
            launch_site = row[2].a.string
        except:
            launch_site = row[2].get_text(strip=True)
        launch_dict['Launch site'].append(launch_site)

        # Payload
        try:
            payload = row[3].a.string
        except:
            payload = row[3].get_text(strip=True)
        launch_dict['Payload'].append(payload)

        # Payload mass
        payload_mass = get_mass(row[4])
        launch_dict['Payload mass'].append(payload_mass)

        # Orbit
        try:
            orbit = row[5].a.string
        except:
            orbit = row[5].get_text(strip=True)
        launch_dict['Orbit'].append(orbit)

        # Customer
        try:
            customer = row[6].a.string
        except:
            customer = row[6].get_text(strip=True)
        launch_dict['Customer'].append(customer)

        # Launch outcome
        try:
            launch_outcome = list(row[7].strings)[0]
        except:
            launch_outcome = row[7].get_text(strip=True)
        launch_dict['Launch outcome'].append(launch_outcome)

        # Booster landing
        try:
            booster_landing = landing_status(row[8])
        except:
            booster_landing = row[8].get_text(strip=True)
        launch_dict['Booster landing'].append(booster_landing)

extracted_row


121

After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [84]:
df= pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })
df.head(), df.shape, df.columns.tolist()

(  Flight No. Launch site                               Payload Payload mass  \
 0          1       CCAFS  Dragon Spacecraft Qualification Unit            0   
 1          2       CCAFS                                Dragon            0   
 2          3       CCAFS                                Dragon       525 kg   
 3          4       CCAFS                          SpaceX CRS-1     4,700 kg   
 4          5       CCAFS                          SpaceX CRS-2     4,877 kg   
 
   Orbit Customer Launch outcome   Version Booster Booster landing  \
 0   LEO   SpaceX      Success\n  F9 v1.07B0003.18         Failure   
 1   LEO     NASA        Success  F9 v1.07B0004.18         Failure   
 2   LEO     NASA        Success  F9 v1.07B0005.18    No attempt\n   
 3   LEO     NASA      Success\n  F9 v1.07B0006.18      No attempt   
 4   LEO     NASA      Success\n  F9 v1.07B0007.18    No attempt\n   
 
               Date   Time  
 0      4 June 2010  18:45  
 1  8 December 2010  15:43  
 2      2

We can now export it to a <b>CSV</b> for the next section, but to make the answers consistent and in case you have difficulties finishing this lab. 

Following labs will be using a provided dataset to make each lab independent. 


<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>


In [85]:
df.to_csv('spacex_web_scraped.csv', index=False)
f"Archivo generado: spacex_web_scraped.csv con {df.shape[0]} filas y {df.shape[1]} columnas."
df.to_csv('spacex_web_scraped.csv', index=False)
f"Archivo generado: spacex_web_scraped.csv con {df.shape[0]} filas y {df.shape[1]} columnas."


'Archivo generado: spacex_web_scraped.csv con 242 filas y 11 columnas.'

## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a>


<!--
## Change Log
-->


<!--
| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |
-->


Copyright © 2021 IBM Corporation. All rights reserved.
