<h1>Crawling Step 3 - Building The Dataframe</h1>

<h2>Preliminary Steps</h2>

Let's begin with importing the necessary libraries:

In [2]:
import json
import time
import os
import requests
import bs4
from bs4 import BeautifulSoup  
import pandas as pd
import scipy as sc
import numpy as np
import re

We need to declare multiple necessary general variables that will be used throughout this step:

In [3]:
main_url = "https://aviation-safety.net" # The main website's URL to whom various extensions will be concatenated

The files we created in the previous two steps will come in handy during this step<br>
therefore, we need to load them and convert them to the structures they were extracted from beforehand:

In [4]:
# Loads the accidents' URLs file, created in Crawling Step 1:
with open("accidents_urls.txt", "r") as lk_file:
    list_of_urls = [line.strip() for line in lk_file]

In [5]:
# Loads the aircrafts to engines file, created in Crawling Step 2:
with open('aircrafts_to_engines.txt') as ae_file:
    ae_dict = json.load(ae_file)

Let's test if the files were loaded properly:

In [27]:
for u in range(0,5):
    print(list_of_urls[u])
print("...")

/database/record.php?id=19190802-0
/database/record.php?id=19190811-0
/database/record.php?id=19200223-0
/database/record.php?id=19200225-0
/database/record.php?id=19200630-0
...


Let's see how many URLs we need to access, to have better assessment of our scraping process:

In [6]:
len(list_of_urls)

22252

In [26]:
for u in range(0,5):
    print(list(ae_dict.items())[u])
print("...")

('Aero Modifications AMI DC-3-65TP', 'turboprop (2)')
('Aero Spacelines Mini Guppy Turbine', 'turboprop (4)')
('Aeromarine 75', 'piston (2)')
('Aérospatiale SN.601 Corvette', 'jet (2)')
('Airbus A220 / Bombardier Cseries', 'jet (2)')
...


<h2>The Process</h2>

<h3>Scraping Accident Data</h3>

We are ready to begin scraping the necessary data from each and every accident page!<br>
Let's begin with declaring lists that will represent the columns in our dataframe to-be:

In [34]:
# Columns will be elaborately explained in the Data Handling step.
weekday = []
day = []
month =[]
year = []
time_c = []
aircraft_type = []
num_of_engines = []
engine_type = []
engine_model = []
years_active = []
airframe_hrs = []
cycles = []
operator = []
occupants = []
accident_loc =[]
above_ocean = []
flight_phase = []
damage = []
fate = []
accident_latitude = []
accident_longtitude = []
fatalities = []

In the following section, the code will loop through the URLs that were collected in Step 1.<br>
Since all these pages have the same template, we know that all the necessary data will appear in two fields:<br>
<ul><b>A 'table' HTML tag</b> - that contains string and numerical data about the accident, in two seperate columns. These are appropriate to be organized in a dictionary for better data processing.</ul>
<ul><b>An external web object</b> - that redirects to a world map website that shows the accident's location. We will scrape the coordinates from the map script and will append them directly to the appropriate lists</ul><br><br>
After scraping is complete, we will append every given data to its appropriate list. If there's a list whose appropriate data is missing from the page, or whose relevant data is corrupted, a NaN object will be appended to the list instead.<br><br>
<small>Note: Most of the commands to be performed in this section need to be inserted to a loop, which require us to implement them in a single Jupyter notebook code cell. It means that code-specific remarks must be included as comments within the code, and not as markdowns</small>

In [35]:
# We're about to access the website a large amount of times, therefore pauses need to be introduced to avoid server blocking.

sleep_counter = 0

#-------------------

for x in list_of_urls:
    
# For every fifth iteration we will introduce a pause (on one hand, to avoid being blocked. On the other hand, we don't want the process to take a VERY long time)
# With 22,252 projected entries, it is estimated that the scraping will be a few hours long...

    if(sleep_counter%10 == 0):
        time.sleep(0.3)
    sleep_counter = sleep_counter + 1
    
#Now it's time to send a request to every URL, and build a BeautifulSoup object from every response we receive.
#The website is very wary of scraping attempts, therefore we mask our presence using the 'User-Agent': 'Mozilla/5.0' header.

    accident_url = main_url + x #Declares the complete URL to scrape from
    response_data = requests.get(accident_url, headers={'User-Agent': 'Mozilla/5.0'}) #Request and response
    soup = BeautifulSoup(response_data.content, "html.parser") #Setting the BeautifulSoup object
    
#Now that we have the BeautifulSoup object, let's access the two fields mentioned above to scrape data from:

#STEP 1.1 - The 'table' HTML tag:
#_______________________________
#The accident page has multiple 'table' tags, but we sampled few dozens and found out the desired table will always be the
#the first one on the page. Therefore, we can use the '.find' attribute to isolate it to an object.
#Then we will set up an empty dictionary with the key being the left column of the table, and the value being the right.
#We will create an array of the table' rows using the '.find_all' attribute on the 'tr' tags in the table object.
#Then we move on to scrape the data and append it to the dictionary:

    try:
        table = soup.find("table")
        tb_rows = table.find_all("tr")
        tb_length = len(tb_rows) #Calculates the number of rows in the table, for looping purposes
        tb_dict = {}
        for i in range(0, tb_length): #Loops from the first row to the last row in the rows list
            tb_row = table.select('tr')[i] #Selects the row
            tb_dict[tb_row.select('td')[0].text[:-1]] = tb_row.select('td')[1].text #Creates a dictionary record for each row
    except: #If the page is corrupted, it won't include the table object, Therfore we will skip it to the next page
        continue

#The dictionary is ready for processing!
#------------------------------------------------------------------------------------------------------------------------------

#STEP 1.2: The external map object:
#________________________________
#This one was a bit trickier. The data is not shown on the page as text, but rather as a visual representation of it in a map.
#The map object is external - which means that unlike the table, its code is not loaded into our BeautifulSoup object.
#Therefore, we need to access the object seperately and get its HTML code in another BeautifulSoup object.

#The first step is to understand, and obviously locate the object and the external link the is used to represent it
#within the accident page. Other information objects in th page are located in 'iframe' tags, including our map.
#However, unlike the table, its location in the page is not fixed, and the map might even be missing.
#We understood how the URL of the map object looks like, so we will find all the 'iframe' tags, and loop through them to
#find the specific tag that includes the map URL template (or to return nothing if there is no map in the page):

    num_of_infos = len(soup.find_all("iframe")) #Calculates the number of info objects, for looping purposes
    for n in range(0,num_of_infos):
        try:
            tmap = soup.select("iframe")[n] #Selects the current info object
            if (re.search('/statistics/geographical/kml_map_iframe.php' ,tmap['src'])): #Checks if it's external URL has the map object's URL template
                map_url = main_url + tmap['src'] #If yes, declares a variable with the map object's full URL
                map_script = requests.get(map_url, headers={'User-Agent': 'Mozilla/5.0'}).text #Request-response of the map object's HTML code
                break #Loop is broken, as the map object was found. No need to keep looking further
            else: #
                map_script = None
        except: #In a case there is not a single info object in the page, so an error won't abort our loop
            map_script = None
            break;

#------------------------------------------------------------------------------------------------------------------------------

#Now we have the following variables with scraped data in hand that we need to process in different approaches:
#1) A dictionary of the accident data
#2) A dictionary of aircrafts and engines data
#3) The HTML code of the map script.

#We can now start processing our data and append the processed values to the relevant column lists.
#As mentioned above, if a certain page doesn't include the data (and that data doesn't appear in the dictionary/script as
#a result), a NaN value will be appended to the matching list instead.

#STEP 2.1: Processing of the dictionary object:
#____________________________________________

#Processing values from the accident's freshly created dictionary is pretty straightforward. We will either take the value
#directly from the dictionary, or - in a case the needed value is enclaved in a string - will use text processing:

#Weekday - Always indicated in the first word of the "Date" dictionary value
    try:
        weekday_i = re.split(" ",tb_dict["Date"])[0]
        if ((weekday_i == 'xx') | (weekday_i == 'XX')): # If exact date is unknown
            weekday.append(np.nan)
        else:
            weekday.append(weekday_i)
    except:
        weekday_i = np.nan
        weekday.append(weekday_i)

#Day - Mostly the second word of the "Date" dictionary value. Numerical - converted to float
    try:
        if ((weekday_i == 'xx') | (weekday_i == 'XX')): # If exact date is unknown
            day_i = np.nan
        else:
            day_i = float(re.split(' ',tb_dict["Date"])[1]) # If exact date is unknown
    except:
        day_i = np.nan
    day.append(day_i)

#Month - Mostly the third word of the "Date" dictionary value.
    try:
        if ((weekday_i == 'xx') | (weekday_i == 'XX')): # If exact date is unknown
            month_i = np.nan
        else:
            month_i = re.split(" ",tb_dict["Date"])[2]
    except:
        month_i = np.nan
    month.append(month_i)

#Year - #Day - Mostly the fourth word of the "Date" dictionary value. Numerical - converted to float
    try:
        if ((weekday_i == 'xx') | (weekday_i == 'XX')): # If exact date is unknown
            year_i = float(re.split(" ",tb_dict["Date"])[2]) #Year will appear in the third word
        else:
            year_i = float(re.split(" ",tb_dict["Date"])[3]) #Year will appear in the fourth word
    except:
        year_i = np.nan
    year.append(year_i)

#Time - Appears in HH:MM format as the value of "Time" in the dictionary. We wanted to represent the time as a float number
#for it to be a sequential variable. The hour is taken as the number, and the minutes are taken as the relative part
#out of 60 and added to the number.
    try:
        time_i = float(re.search("\d\d", re.split(":",tb_dict["Time"])[0]).group(0))+float((re.split(":",tb_dict["Time"])[1]))/60
    except:
        time_i = np.nan
    time_c.append(time_i)

#Airframe Hours - directly taken from "Total airframe hrs" dictionary value.Numerical - converted to float
    try:
        airframe_hrs_i = float(tb_dict["Total airframe hrs"])
    except:
        airframe_hrs_i = np.nan
    airframe_hrs.append(airframe_hrs_i)

#Cycles - directly taken from the "Cycles" dictionary value. Numerical - converted to float
    try:
        cycles_i = float(tb_dict["Cycles"])
    except:
        cycles_i = np.nan
    cycles.append(cycles_i)

#Operator - directly taken from the "Operator" dictionary value.
    try:
        operator_i = tb_dict["Operator"]
    except:
        operator_i = np.nan
    operator.append(operator_i)

#Flight Phase - The sub-string that is located between the parentheses in the "Phase" dictionary value.
    try:
        flight_phase_i = re.split('[()]',tb_dict["Phase"])[1]
    except:
        flight_phase_i = np.nan
    flight_phase.append(flight_phase_i)

#Years Active - Processes the "First flight" dictionary value, as its template might vary depending on  the aircraft's age
#and the amount of available information given in this field.
    try:
        try:
            re.search('\d\d\d\d-', tb_dict['First flight']).span() #Checks if more than first flight's year is provided. If yes, continues to the conditional statement
            if(re.split(" ", tb_dict['First flight'])[4] == 'months)'): #Checks if aircraft is less than 1 year old, if it is - takes the months relatively to 12
                years_active_i = float(re.split(" ", tb_dict['First flight'])[3][1:])/12
            elif(re.split(" ", tb_dict['First flight'])[4] == 'years)'): #Checks if the aircraft is exactly # years old (no months), if it is - takes the year as an int
                years_active_i = float(re.split(" ", tb_dict['First flight'])[3][1:])
            else: #Gets the year and months since aircraft's first flight and converts them to a float number
                years_active_i = float(re.findall("\d+", tb_dict['First flight'])[3])+float(re.findall("\d+", tb_dict['First flight'])[4])/12
        except:
            years_active_i = year_i - float(tb_dict["First flight"][1:5]) #If error occurs in the nested 'try' (may occur if it checks out of range), it means that not enough data is available, therefore only a year is given and caculates the time elapsed in natural number of years
    except:
        years_active_i = np.nan
    years_active.append(years_active_i)

#Occupants - The second number that appears in the "Total" dictionary value. Numerical - converted to float
    try:
        occupants_i = float(re.findall('\d+',tb_dict["Total"])[1])
    except:
        occupants_i = np.nan
    occupants.append(occupants_i)

#Accident Location - Appears inside the parentheses and after the word '\xa0' in the "Location" dictionary value.
    try:
        accident_loc_i = re.split('[)]',re.split('\xa0 ', tb_dict['Location'])[1])[0]
    except:
        accident_loc_i = np.nan
    accident_loc.append(accident_loc_i)

#Above Ocean - Looks for appropriate word in the processed Accident Location variable.
    try:
        if (re.search("Ocean|Sea", accident_loc_i)):
            above_ocean_i = 1
        else:
            above_ocean_i = 0
    except:
        above_ocean_i = np.nan
    above_ocean.append(above_ocean_i)

#Damage - The sub-string before parentheses in "Aircraft damage" dictionary value (if there are any, if not - takes the entire value)
    try:
        damage_i = re.split('[(]',tb_dict['Aircraft damage'])[0][1:]
    except:
        damage_i = np.nan
    damage.append(damage_i)

#Fate - The sub-string before parentheses in "Aircraft fate" dictionary value (if there are any, if not - takes the entire value)
    try:
        fate_i = re.split('[(]',tb_dict['Aircraft fate'])[0][1:]
    except:
        fate_i = np.nan
    fate.append(fate_i)

#Fatalities - The first number that appears in the "Total" dictionary value. Numerical - converted to float
    try:
        fatalities_i = float(re.findall('\d+',tb_dict["Total"])[0])
    except:
        fatalities_i = np.nan
    fatalities.append(fatalities_i)

#------------------------------------------------------------------------------------------------------------------------------

#STEP 2.2: Processing of aircraft and engines information
#________________________________________________________

#In Crawling Step 2 we created a special dictionary that matches the engine type and amount to the aircraft's model.
#In the accident page, the model is mentioned with a very specific name, that points on a sub-model, that if we decided to
#scrape directly, would bloat the number of unique values for the Aircraft Model column.
#The accident page also contains a hyperlink to the model's details page, where the aricraft model is mentioned with a more
#general name, that we want to use as the value for the accident's involved aircraft model.
#Since this name doesn't appear on our accident page, we need to access this hyperlink and scrape the name from the other page.
#Therefore, we need yet another BeautifulSoup object to be able to scrape from a different page.
#We will use the same method on the Engine Model column, as it follows the same pattern (sub-models mentioned in the accident
#page, but more general name in the engine information page)

#Some accident pages also lack the amount and type of engines, but these are pre-known as long as the aircraft model is known.
#That's where the dictionary we created in Crawling Step 2, comes into play. We will match the general name of the aircraft model
#to the type and amount of engines it is set to have. So even if the accident page lacks information about engines, we will be
#able to gather it from elsewhere:

    acen_links = table.find_all("a") #Finds all hyperlinks in the accident's page

# Aircaft Model - Scrapes the aircraft model info page's title, omits excess characters (They appear the same in every page)
    try:
        ac_link = None
        for lnk in acen_links: #Looks for a hyperlink with the template of aircraft model info page's URL
            if (re.search('database/types', str(lnk))):
                ac_link = lnk['href'] # If found, declares a variable with the URL
            
        if ac_link != None: #If a matching hyperlink was found:
            ac_url = main_url + ac_link #Declares a variable with the aircraft model's info page's full URL
            ac_response = requests.get(ac_url, headers={'User-Agent': 'Mozilla/5.0'}) #Request-response of the info page's HTML code
            ac_soup = BeautifulSoup(ac_response.content, "html.parser") #Creates BeautifulSoup object for it
            
            aircraft_type_i = ac_soup.find('div', attrs={"class":"pagetitle"}).text[1:-7]
        else:
            aircraft_type_i = np.nan
    except:
        aircraft_type_i = np.nan
    aircraft_type.append(aircraft_type_i)

#Engine Type - The first word in the dictionary value from Crawling Step 2 of the aircraft model we just scraped (the key)
    try:
        engine_i = re.split(' ', ae_dict[aircraft_type_i])[0]
    except:
        engine_i = np.nan
    engine_type.append(engine_i)
        

#Num of Engines - Omits the parentheses of the second word from Crawling Step 2 of the aircraft model we just scraped (the key)
#Numerical - converted to float
    try:
        num_of_engines_i = float(re.split(' ', ae_dict[aircraft_type_i])[1][1:-1])
    except:
        num_of_engines_i = np.nan
    num_of_engines.append(num_of_engines_i)

#Engine Model - Scrapes the engine model info page's title, omits excess characters (They appear the same in every page)   
    try:
        en_link = None
        for lnk in acen_links: #Looks for a hyperlink with the template of engine model info page's URL
            if (re.search('database/engine', str(lnk))):
                en_link = lnk['href'] # If found, declares a variable with the URL
        
        if en_link != None: #If a matching hyperlink was found:
            en_url = main_url + en_link #Declares a variable with the aircraft model's info page's full URL
            en_response = requests.get(en_url, headers={'User-Agent': 'Mozilla/5.0'}) #Request-response of the info page's HTML code
            en_soup = BeautifulSoup(en_response.content, "html.parser") #Creates BeautifulSoup object for it
        
            engine_model_i = en_soup.find('div', attrs={"class":"pagetitle"}).text[1:-7]
        else:
            engine_model_i = np.nan
    except:
        engine_model_i = np.nan
    engine_model.append(engine_model_i)
    
#------------------------------------------------------------------------------------------------------------------------------

#STEP 2.3 - Processing the map script
#____________________________________

#The map script we got is treated like a string. The accident's coordinates appear inside that string, along with other
#coordinates we decided not to include in our dataframe: Departure airport and destination airport.
#Sampling some map scripts, we realized that the accident's coordintes will always appear after the departure and destination
#airports in the script. 

#They also follow three other patterns:
#1)Every coordinates field appears right after the term "L.marker".  The values we need are located in the third coordinates
#field, so we need to process the sub-string that comes after the third "L.marker" appearance,
#which means - the fourh sub-string overall.
#2)A pair of coordinates always appear inside square bracket '[ ]'
#3)The individual values in the pair of coordinates (latitude and longtitude) are seperated by a comma ','
#The first value refers to the latitude, and the scond to the longtitude


#We now follow these patterns from 3) to 1) and use text processing to isolate the values we need

#Accident's Latitude - The first value in the third square brackets after the third "L.marker" appearance. Numerical - converted to float
    try:
        accident_latitude_i = float(re.split('[, ]', re.split('[\[\]]', re.split("L.marker", map_script)[3])[1])[0])
    except:
        accident_latitude_i = np.nan
    accident_latitude.append(accident_latitude_i)

#Accident's Latitude - The second value in the third square brackets after the third "L.marker" appearance. Numerical - converted to float
    try:
        accident_longtitude_i = float(re.split('[, ]', re.split('[\[\]]', re.split("L.marker", map_script)[3])[1])[2])
    except:
        accident_longtitude_i = np.nan
    accident_longtitude.append(accident_longtitude_i)

We're finally done! We have all the data we need appended to the column lists!<br>
It means we're ready to build the dataframe with all the data!

In [15]:
df = pd.DataFrame({'weekday':weekday, 'day':day, 'month':month, 'year':year, 'time':time_c, 'aircraft_type':aircraft_type, 'num_of_engines':num_of_engines, 'engine_type':engine_type, 'engine_model':engine_model, 'years_active':years_active, 'airframe_hrs':airframe_hrs, 'cycles':cycles, 'operator':operator, 'occupants':occupants, 'accident_loc':accident_loc, 'above_ocean':above_ocean, 'flight_phase':flight_phase, 'damage':damage, 'fate':fate, 'accident_latitude':accident_latitude, 'accident_longtitude':accident_longtitude, 'fatalities':fatalities})
df

Unnamed: 0,weekday,day,month,year,time,aircraft_type,num_of_engines,engine_type,engine_model,years_active,...,operator,occupants,accident_loc,above_ocean,flight_phase,damage,fate,accident_latitude,accident_longtitude,fatalities
0,Saturday,2.0,August,1919,,Caproni Ca.48,3.0,piston,Liberty L-12,0.00,...,Caproni,14.0,Italy,0,ENR,Destroyed,Written off,45.396389,10.888056,14.0
1,Monday,11.0,August,1919,,Felixstowe Fury,5.0,piston,Rolls-Royce Eagle VIII,0.75,...,Royal Air Force - RAF,7.0,United Kingdom,0,ICL,Damaged beyond repair,,51.941370,1.306789,1.0
2,Monday,23.0,February,1920,,Handley Page Type O,2.0,piston,,1.00,...,Handley Page Transport,10.0,South Africa,0,ENR,Damaged beyond repair,,,,0.0
3,Wednesday,25.0,February,1920,,Handley Page Type O,2.0,piston,,,...,Handley Page Transport,4.0,Sudan,0,UNK,Damaged beyond repair,,,,0.0
4,Wednesday,30.0,June,1920,,Handley Page Type O,2.0,piston,,1.00,...,Handley Page Transport,2.0,Sweden,0,ENR,Damaged beyond repair,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22215,Monday,27.0,December,2021,19.233333,Learjet 35 / 36,2.0,jet,,36.00,...,Aeromedevac Air Ambulance,4.0,United States of America,0,APR,Destroyed,Written off,32.821172,-116.939520,4.0
22216,Tuesday,28.0,December,2021,,Cessna 208 Caravan,1.0,turboprop,Pratt & Whitney Canada PT6,25.00,...,Halsted Aviation Corporation (HAC),,Mozambique,0,LDG,Damaged beyond repair,,,,0.0
22217,Tuesday,28.0,December,2021,11.883333,Beechcraft 100 King Air,2.0,turboprop,Pratt & Whitney Canada PT6,47.00,...,Sky-Bound Aviation LLC,,United States of America,0,LDG,Substantial,,,,0.0
22218,Thursday,30.0,December,2021,17.000000,Beechcraft Super King Air,2.0,turboprop,,34.00,...,Unknown,3.0,Guatemala,0,LDG,Damaged beyond repair,,,,0.0


To proceed to the next step in our project, we will export the dataframe into a .csv file, for later use:

In [18]:
df.to_csv("df.csv", index=False)

The output file, named <b>"df.csv"</b> can be found in our project folder in GitHub