<h1>Crawling Step 2 - Building Aircraft-Engine Dictionary File</h1>

<h2>Preliminary Steps</h2>

Let's begin with importing the necessary libraries:

In [19]:
import requests
import bs4
from bs4 import BeautifulSoup
import json

We also need to declare a necessary variable that will be used throughout this step:

In [23]:
aircrafts_url = "https://aviation-safety.net/database/type/" # The main page of the aircrafts database, to be used in the following section

<h2>The Process</h2>

Before we start accessing each individual page, we found out that each page refers to the aircraft model and engine model in a very specific way. The accident page provides a hyperlink to an info page of the aircraft that provides more general details about the model, omitting the sub-model name. Instead of scraping the specific aircraft model mentioned in the accident page, we looked for the more general aircraft model as mentioned in the aircraft models list on the website.

In this section, we scraped data from the aircraft model list to create a Python dictionary object.<br>
The <b>Key</b> - The aircraft model name<br>
The <b>Value</b> - The model's appropriate engine type and the amount of engines

We will use this dictionary while scraping the accident page in order to match the aircraft model type to the type and amount of engine it bears. While this data is often mentioned in accident pages, it might be missing in some, and this dictionary will prevent missing data about the engines as long as the aricraft model is known.

The aircraft model involved in an individual itself will be scraped differently, but will be used to compare with the content of the dictionary.

In [14]:
response_aircrafts = requests.get(aircrafts_url, headers={'User-Agent': 'Mozilla/5.0'}) # Accesses the desired page that presents aircrafts and their engines
aircrafts_soup = BeautifulSoup(response_aircrafts.content, "html.parser") # Creates a BeautifulSoup object for the response we recevied

ae_table = aircrafts_soup.find("table") # Finds the table object (there's only one in the page) 
ae_rows = ae_table.find_all("tr") # Creates an array with all the HTML codes of the table's rows
ae_size = len(ae_rows) # Calculates the number of rows found in the table, for looping purposes
ae_dict = {} # Creates an empty dictionary, where scraped data will be appended
for i in range (1,ae_size-1): 
    ae_row = ae_table.select('tr')[i] # Runs on every row HTML code in the array
    ae_dict[ae_row.select('td')[0].text] = ae_row.select('td')[1].text # Scrapes the text from the row's first cell (used as the key), and the row's second cell (used as the value)

The dictionary below shows the aircraft and its appropriate engine type and number:

In [15]:
ae_dict

{'Aero Modifications AMI DC-3-65TP': 'turboprop (2)',
 'Aero Spacelines Mini Guppy Turbine': 'turboprop (4)',
 'Aeromarine 75': 'piston (2)',
 'Aérospatiale SN.601 Corvette': 'jet (2)',
 'Airbus A220 / Bombardier Cseries': 'jet (2)',
 'Airbus A220-100': 'jet (2)',
 'Airbus A220-300': 'jet (2)',
 'Airbus A300': 'jet (2)',
 'Airbus A310': 'jet (2)',
 'Airbus A318': 'jet (2)',
 'Airbus A319': 'jet (2)',
 'Airbus A319/320/321': 'jet (2)',
 'Airbus A319neo': 'jet (2)',
 'Airbus A320': 'jet (2)',
 'Airbus A320neo': 'jet (2)',
 'Airbus A321': 'jet (2)',
 'Airbus A321neo': 'jet (2)',
 'Airbus A330': 'jet (2)',
 'Airbus A330neo': 'jet (2)',
 'Airbus A340': 'jet (4)',
 'Airbus A350 XWB': 'jet (2)',
 'Airbus A380': 'jet (4)',
 'Airbus Military A400M': 'turboprop (4)',
 'Airspeed AS.57 Ambassador': 'piston (2)',
 'Alenia C-27J': 'turboprop (2)',
 'Alenia G-222': 'turboprop (2)',
 'all models BAe-125/HS-125/DH-125': 'jet (2)',
 'Antonov An-10': 'turboprop (4)',
 'Antonov An-12': 'turboprop (4)',
 '

We now use a JSON format in order to extract the dictionary to an external text file,<br>
to have the ability to use it in the next step:

In [25]:
with open("aircrafts_to_engines.txt", "w") as textfile:
    textfile.write(json.dumps(ae_dict))

The output file, named <b>"aircrafts_to_engines.txt</b> can be found in our project folder in GitHub