<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Space X  Falcon 9 First Stage Landing Prediction**


## Web scraping Falcon 9 and Falcon Heavy Launches Records from Wikipedia


Estimated time needed: **40** minutes


In this lab, you will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/Falcon9_rocket_family.svg)


Falcon 9 first stage will land successfully


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Several examples of an unsuccessful landing are shown here:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


More specifically, the launch records are stored in a HTML table shown below:


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


  ## Objectives
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract a Falcon 9 launch records HTML table from Wikipedia
- Parse the table and convert it into a Pandas data frame


First let's import required packages for this lab


In [38]:
!pip3 install beautifulsoup4
!pip3 install requests



In [39]:
import sys

import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

and we will provide some helper functions for you to process web scraped HTML table


In [46]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


To keep the lab tasks consistent, you will be asked to scrape the data from a snapshot of the  `List of Falcon 9 and Falcon Heavy launches` Wikipage updated on
`9th June 2021`


In [47]:
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"

Next, request the HTML page from the above URL and get a `response` object


### TASK 1: Request the Falcon9 Launch Wiki page from its URL


First, let's perform an HTTP GET method to request the Falcon9 Launch HTML page, as an HTTP response.


In [None]:
import requests

# Static URL provided
static_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/API_call_spacex_api.json'

# Use requests.get() to fetch the data
response = requests.get(static_url)

# Check if the request was successful
if response.status_code == 200:
    print("Request successful!")
    print(f"Status Code: {response.status_code}")
    
    # Assign JSON response content to a variable
    data = response.json()
    print("JSON response has been assigned to 'data'.")
else:
    print(f"Request failed with status code: {response.status_code}")



Create a `BeautifulSoup` object from the HTML `response`


In [None]:
from bs4 import BeautifulSoup
import requests

# Example Static URL for an HTML page
html_url = 'https://en.wikipedia.org/wiki/Falcon_9'

# Send HTTP GET request
html_response = requests.get(html_url)

# Check if request is successful
if html_response.status_code == 200:
    print("Request successful!")
    # Create a BeautifulSoup object
    soup = BeautifulSoup(html_response.text, 'html.parser')
    print("BeautifulSoup object created successfully.")
else:
    print(f"Request failed with status code: {html_response.status_code}")

# Example: Print the title of the page
print(soup.title.text)



Print the page title to verify if the `BeautifulSoup` object was created properly 


In [None]:
import requests
from bs4 import BeautifulSoup

# Static URL or target URL
url = 'https://www.spacex.com/launches/falcon-9'

# Perform an HTTP GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Request successful!")
else:
    print(f"Request failed with status code: {response.status_code}")
    exit()

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Print the page title to verify
print("\nPage Title:")
print(soup.title.string)  # Extracts the title text


### TASK 2: Extract all column/variable names from the HTML table header


Next, we want to collect all relevant column names from the HTML table header


Let's try to find all tables on the wiki page first. If you need to refresh your memory about `BeautifulSoup`, please check the external reference link towards the end of this lab


In [None]:
import requests
from bs4 import BeautifulSoup

# Static URL or target URL
url = 'https://en.wikipedia.org/wiki/Falcon_9'

# Perform an HTTP GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Request successful!")
else:
    print(f"Request failed with status code: {response.status_code}")
    exit()

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all tables on the page
tables = soup.find_all('table', {'class': 'wikitable'})

print(f"\nNumber of tables found: {len(tables)}")

# Extract headers from the first table (as an example)
if len(tables) > 0:
    first_table = tables[0]
    headers = [th.text.strip() for th in first_table.find_all('th')]
    
    print("\nTable Headers:")
    print(headers)
else:
    print("No tables found on the page.")




Starting from the third table is our target table contains the actual launch records.


In [None]:
import requests
from bs4 import BeautifulSoup

# Static URL or target URL
url = 'https://en.wikipedia.org/wiki/Falcon_9'

# Perform an HTTP GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Request successful!")
else:
    print(f"Request failed with status code: {response.status_code}")
    exit()

# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all tables on the page
tables = soup.find_all('table', {'class': 'wikitable'})

print(f"\nNumber of tables found: {len(tables)}")

# Target the third table (index 2 since indexing starts at 0)
if len(tables) >= 3:
    target_table = tables[2]
    
    # Extract headers from the target table
    headers = [th.text.strip() for th in target_table.find_all('th')]
    
    print("\nHeaders of the Target Table (Third Table):")
    print(headers)
else:
    print("There are not enough tables on the page to select the third table.")


You should able to see the columns names embedded in the table header elements `<th>` as follows:


```
<tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_11-0"><a href="#cite_note-booster-11">[b]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_12-0"><a href="#cite_note-Dragon-12">[c]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
```


Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [None]:
column_names = []

# Apply find_all() function with `th` element on first_launch_table
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
# Append the Non-empty column name (`if name is not None and len(name) > 0`) into a list called column_names
import requests
from bs4 import BeautifulSoup

# Provided function to extract column names from a <th> element
def extract_column_from_header(th):
    if th:
        return th.get_text(strip=True)
    return None

# Step 1: Fetch the webpage
url = 'https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches'
response = requests.get(url)

if response.status_code == 200:
    print("Request successful!")
else:
    print(f"Request failed with status code: {response.status_code}")
    exit()

# Step 2: Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Find all tables
html_tables = soup.find_all('table')

# Step 4: Select the first table (you can change index if necessary)
first_launch_table = html_tables[0]

# Step 5: Extract column names from <th> elements
column_names = []

# Iterate through all <th> elements in the table
for th in first_launch_table.find_all('th'):
    column_name = extract_column_from_header(th)
    if column_name is not None and len(column_name) > 0:  # Non-empty names only
        column_names.append(column_name)

# Print the extracted column names
print("\nExtracted Column Names:")
print(column_names)



Check the extracted column names


In [None]:
print(column_names)

## TASK 3: Create a data frame by parsing the launch HTML tables


We will create an empty dictionary with keys from the extracted column names in the previous task. Later, this dictionary will be converted into a Pandas dataframe


In [None]:
# Step 1: Start with a list of column names
column_names = [
    'Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 
    'Payload mass', 'Orbit', 'Customer', 'Launch outcome'
]

# Step 2: Create a dictionary with keys from column_names and empty values
launch_dict = dict.fromkeys(column_names)

# Step 3: Remove an irrelevant column
del launch_dict['Date and time ( )']

# Step 4: Initialize each key in launch_dict with an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []

# Step 5: Add new columns with empty lists
launch_dict['Version Booster'] = []
launch_dict['Booster landing'] = []
launch_dict['Date'] = []
launch_dict['Time'] = []

# Print the resulting dictionary
print("Initialized launch_dict:")
print(launch_dict)


Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


To simplify the parsing process, we have provided an incomplete code snippet below to help you to fill up the `launch_dict`. Please complete the following code snippet with TODOs or you can choose to write your own logic to parse all launch tables:


In [None]:
# Initialize extracted_row
extracted_row = 0

# Iterate through tables with the specified class
for table_number, table in enumerate(soup.find_all('table', "wikitable plainrowheaders collapsible")):
    # Iterate through each row in the table
    for rows in table.find_all("tr"):
        # Check to see if the first table heading corresponds to a launch number
        if rows.th:
            if rows.th.string:
                flight_number = rows.th.string.strip()
                flag = flight_number.isdigit()  # Check if the value is a digit
        else:
            flag = False

        # Get table elements
        row = rows.find_all('td')

        # If it's a valid flight number, save the cells in the dictionary
        if flag:
            extracted_row += 1

            # Extract and append data to launch_dict
            # Flight Number
            launch_dict['Flight No.'].append(flight_number)

            # Date and Time
            datatimelist = date_time(row[0])
            date = datatimelist[0].strip(',')  # Date
            time = datatimelist[1]  # Time
            launch_dict['Date'].append(date)
            launch_dict['Time'].append(time)

            # Booster Version
            bv = booster_version(row[1])
            if not bv and row[1].a:
                bv = row[1].a.string
            launch_dict['Version Booster'].append(bv)

            # Launch Site
            launch_site = row[2].a.string if row[2].a else None
            launch_dict['Launch site'].append(launch_site)

            # Payload
            payload = row[3].a.string if row[3].a else None
            launch_dict['Payload'].append(payload)

            # Payload Mass
            payload_mass = get_mass(row[4])
            launch_dict['Payload mass'].append(payload_mass)

            # Orbit
            orbit = row[5].a.string if row[5].a else None
            launch_dict['Orbit'].append(orbit)

            # Customer
            customer = row[6].a.string if row[6].a else None
            launch_dict['Customer'].append(customer)

            # Launch Outcome
            launch_outcome = list(row[7].strings)[0] if row[7].strings else None
            launch_dict['Launch outcome'].append(launch_outcome)

            # Booster Landing
            booster_landing = landing_status(row[8])
            launch_dict['Booster landing'].append(booster_landing)

# Print the updated dictionary
print("\nExtracted Launch Data:")
for key, value in launch_dict.items():
    print(f"{key}: {value[:5]}")  # Print the first 5 values for each key


After you have fill in the parsed launch record values into `launch_dict`, you can create a dataframe from it.


In [None]:
import pandas as pd

# Example launch_dict (replace this with your actual dictionary)
launch_dict = {
    'Flight No.': ['1', '2', '3'],
    'Date': ['2010-06-04', '2010-12-08', '2012-05-22'],
    'Time': ['18:45', '15:43', '07:44'],
    'Version Booster': ['Falcon 9 v1.0', 'Falcon 9 v1.0', 'Falcon 9 v1.0'],
    'Launch site': ['Cape Canaveral', 'Cape Canaveral', 'Cape Canaveral'],
    'Payload': ['Dragon Qualification', 'Dragon C1', 'Dragon C2+'],
    'Payload mass': [None, '5000 kg', '5250 kg'],
    'Orbit': ['LEO', 'LEO', 'LEO'],
    'Customer': ['NASA', 'NASA', 'NASA'],
    'Launch outcome': ['Success', 'Success', 'Success'],
    'Booster landing': ['Failure', 'Failure', 'Success']
}

# Convert launch_dict to a Pandas DataFrame
df = pd.DataFrame({key: pd.Series(value) for key, value in launch_dict.items()})

# Display the DataFrame
print("Converted DataFrame:")
print(df)


In [None]:
import requests
import pandas as pd

# ------------------ Tarea 1: Extraer el año de 'static_fire_date_utc' ------------------

# URL de la API de SpaceX
url = 'https://api.spacexdata.com/v4/launches'

# Realizar la solicitud GET a la API
response = requests.get(url)

# Verificar que la solicitud fue exitosa
if response.status_code == 200:
    print("Solicitud exitosa!")
else:
    print(f"Error en la solicitud: {response.status_code}")
    exit()

# Convertir la respuesta JSON a un DataFrame
data = pd.json_normalize(response.json())

# Extraer el año de la primera fila en la columna 'static_fire_date_utc'
if 'static_fire_date_utc' in data.columns:
    first_date = data['static_fire_date_utc'].iloc[0]
    year = pd.to_datetime(first_date).year if pd.notna(first_date) else "Sin fecha"
    print(f"\n1. El año en la primera fila de la columna 'static_fire_date_utc' es: {year}")
else:
    print("\n1. La columna 'static_fire_date_utc' no existe en el DataFrame.")

# ------------------ Tarea 2: Contar lanzamientos de Falcon 9 ------------------

# Realizar otra solicitud GET para obtener detalles de los cohetes
rockets_response = requests.get('https://api.spacexdata.com/v4/rockets')
rockets_data = rockets_response.json()

# Crear un diccionario para mapear nombres de cohetes a sus IDs
rocket_names = {rocket['id']: rocket['name'] for rocket in rockets_data}

# Reemplazar los IDs de la columna 'rocket' con los nombres de los cohetes
data['rocket_name'] = data['rocket'].map(rocket_names)

# Filtrar los lanzamientos de Falcon 9
falcon_9_launches = data[data['rocket_name'] == 'Falcon 9']

# Mostrar el número de lanzamientos
print(f"\n2. Número de lanzamientos de Falcon 9: {len(falcon_9_launches)}")

# ------------------ Tarea 3: Contar valores faltantes en 'landingPad' ------------------

import requests
import pandas as pd

# URL de la API de SpaceX
url = 'https://api.spacexdata.com/v4/launches'

# Realizar la solicitud GET a la API
response = requests.get(url)

# Verificar que la solicitud fue exitosa
if response.status_code == 200:
    print("Solicitud exitosa!")
else:
    print(f"Error en la solicitud: {response.status_code}")
    exit()

# Convertir la respuesta JSON en un DataFrame
data = pd.json_normalize(response.json())

# Extraer 'landingPad' de la columna anidada 'cores'
if 'cores' in data.columns:
    data['landingPad'] = data['cores'].apply(lambda x: x[0]['landpad'] if len(x) > 0 and 'landpad' in x[0] else None)

# Contar valores nulos en la columna 'landingPad' utilizando .isnull()
missing_landing_pad = data['landingPad'].isnull().sum()

# Mostrar el resultado
print(f"\nNúmero de valores faltantes en la columna 'landingPad': {missing_landing_pad}")


<code>df.to_csv('spacex_web_scraped.csv', index=False)</code>


## Authors


<a href="https://www.linkedin.com/in/yan-luo-96288783/">Yan Luo</a>


<a href="https://www.linkedin.com/in/nayefaboutayoun/">Nayef Abou Tayoun</a>


<!--
## Change Log
-->


<!--
| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |
| ----------------- | ------- | ---------- | ----------------------- |
| 2021-06-09        | 1.0     | Yan Luo    | Tasks updates           |
| 2020-11-10        | 1.0     | Nayef      | Created the initial version |
-->


Copyright © 2021 IBM Corporation. All rights reserved.
