## Civil Aviation Authority of the Philippines - Civil Aviation Accidents, Incidents, and Serious Incidents

In [130]:
accident_df.loc[120:]

Unnamed: 0,date,aircraft_registration,aircraft_type,type_of_occurance,place_of_occurance,status,report,report_link,geocode,latitude,longitude,formatted_address
120,"Jan 11, 2009",RP-C8893,MA-60,Undershoot,Caticlan Airport Runway 06,Completed,Summary,http://www.caap.gov.ph?download=3530,{'address_components': [{'long_name': 'Godofre...,11.925306,121.953506,"Godofredo P. Ramos Airport (MPH), Aklan West R..."
121,"Jan 09, 2009",RP-C3023,"Gru,,am Agcat G164A",Engine Failure,"Rico Vista Airstrip, Brgy. Taglawig Maco, Dava...",Completed,Summary,http://www.caap.gov.ph?download=3527,{'address_components': [{'long_name': 'Taglawi...,7.451223,125.841629,"Taglawig, Compostela Valley, Philippines"
122,"Dec 24, 2008",RP-R2924,Grumman American Agcat,Forced Landing,"Kasig-ang, Sto. Tomas, Davao del Norte",Completed,Summary,http://www.caap.gov.ph?download=3668,{'address_components': [{'long_name': 'Casig-A...,7.474977,125.65875,"Casig-Ang, Santo Tomas, Davao del Norte, Phili..."
123,"Dec 22, 2008",RP-R318,Cessna 188B,Overshoot,"Kasilak Airstrip, Panabo City",Completed,Summary,http://www.caap.gov.ph?download=3665,{'address_components': [{'long_name': 'Kasilak...,7.333621,125.597283,"Kasilak Airstrip, Panabo, Davao del Norte, Phi..."
124,"Dec 01, 2008",RP-R2806,Ayres Thrush S2RHG-T34,Engine Failure,"Camoning Farm, Brgy. Magastos, Asuncion, Davao...",Completed,Summary,http://www.caap.gov.ph?download=3661,{'address_components': [{'long_name': 'Asuncio...,7.606026,125.762524,"Asuncion, Davao del Norte, Philippines"
125,"Nov 30, 2008",RP-R2823,Grumman American Agcat,Engine Failure,"Solid Wood Hangar Old International Airport, S...",Completed,Summary,http://www.caap.gov.ph?download=3657,{'address_components': [{'long_name': 'Old Air...,7.128306,125.65352,Solidwood Hangar Gen. Av Davao international A...
126,"Nov 21, 2008",RP-R2756,Cessna Agtruck A188B,Overshoot,"P-4 La Filipina Aerodrome, Tagum City",Completed,Summary,http://www.caap.gov.ph?download=3654,{'address_components': [{'long_name': 'La Fili...,7.477925,125.797995,"La Filipina, Tagum, Davao del Norte, Philippines"
127,"Sep 25,2008",RP-C1124,Cessna 150,Landing Flare/ Touchdown,"Plaridel, Airport Rwy 17 Plaridel, Bulacan",Completed,Summary,http://www.caap.gov.ph?download=3651,{'address_components': [{'long_name': 'Plaride...,14.890675,120.852802,"Plaridel Airport, 0385 General Alejo G. Santos..."
128,"Sep 05, 2008",RP-C2826,Cessna 172,Overshot,"Plaridel Airport Bulacan, Runway 35",Completed,Summary,http://www.caap.gov.ph?download=3648,{'address_components': [{'long_name': 'Plaride...,14.890675,120.852802,"Plaridel Airport, 0385 General Alejo G. Santos..."
129,"Jun 11, 2008",RP-C2584,Cessna 152,Recovering from a bounced landing,"Iba National Airport, Iba, Zambales, Philippines",Completed,Summary,http://www.caap.gov.ph?download=3645,{'address_components': [{'long_name': 'Iba Air...,15.325009,119.969276,"Iba Airport (Zambales), Lipay Dingin, Iba, Zam..."


Data Source:
https://www.caap.gov.ph/

The *same data* can be found at [Open Data Philippines](https://data.gov.ph/?q=dataset/civil-aviation-authority-philippines-aircraft-accidents), but I chose to scrape the Civil Aviation Authority of the Philippines (CAAP) website since it contains more complete information on aircraft accidents, incidents, and serious incidents. I also found that there are some error in the data uploaded by CAAP in the Open Data Philippines website.

#### Errors found in the data uploaded at [Open Data Philippines](https://data.gov.ph/?q=dataset/civil-aviation-authority-philippines-aircraft-accidents):  
- [(Typo?) Error found at 6th row of place of occurrence column:](https://data.gov.ph/?q=dataset/civil-aviation-authority-philippines-aircraft-accidents/resource/29c1d129-11b2-4aac-89e7#{view-grid:{columnsWidth:[{column:!place_of_occurance,width:504}]}})
The place of occurrence written is 'Runway Excursion during Landing' which is not a place. The correct row is found at the [CAAP 2014 Accidents Page](https://www.caap.gov.ph/?page_id=3096).
- Occurence is mispelled as 'occurance' on the Place of Occurence and Type of Occurrence columns.


In [1]:
import glob
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# Get website source
website = (requests.get("https://www.caap.gov.ph/").text) 
soup = BeautifulSoup(website, 'lxml')

In [3]:
# Find the 'Aircraft Accident and Incident Report'
dd = soup.find("dd", {"class": "level1 nextend-nav-6425 parent"})

In [4]:
# Get the description terms of dt('Aircraft Accident and Incident Report')
dt = dd.find_all("dt")
print(dt)

[<dt class="level2 nextend-nav-6506 parent first" data-menuid="6506">
<span class="outer">
<span class="inner">
<a><span>Accidents</span></a> </span>
</span>
</dt>, <dt class="level3 nextend-nav-7521 notparent first" data-menuid="7521">
<span class="outer">
<span class="inner">
<a href="https://www.caap.gov.ph/?page_id=7509"><span>2018 Accidents</span></a> </span>
</span>
</dt>, <dt class="level3 nextend-nav-6508 notparent" data-menuid="6508">
<span class="outer">
<span class="inner">
<a href="https://www.caap.gov.ph/?page_id=6439"><span>2017 Accidents</span></a> </span>
</span>
</dt>, <dt class="level3 nextend-nav-6546 notparent" data-menuid="6546">
<span class="outer">
<span class="inner">
<a href="https://www.caap.gov.ph/?page_id=2960"><span>2016 Accidents</span></a> </span>
</span>
</dt>, <dt class="level3 nextend-nav-6614 notparent" data-menuid="6614">
<span class="outer">
<span class="inner">
<a href="https://www.caap.gov.ph/?page_id=3055"><span>2015 Accidents</span></a> </span>


In [5]:
accident_urls = []
incident_urls = []
serious_incident_urls = []
switch = 'a'
for item in dt:
    if item.find("a").get_text() == 'Accidents':
        continue
    if item.find("a").get_text() == 'Incidents':
        switch = 'i'
        continue
    if item.find("a").get_text() == 'Serious Incidents':
        switch = 's'
        continue
    if switch == 'a':
        accident_urls.append(item.find("a")['href'])
    elif switch == 'i':
        incident_urls.append(item.find("a")['href'])
    else:
        #print(item.find("a"))
        serious_incident_urls.append(item.find("a")['href'])

In [6]:
# Check contents of url
print("List of accident url: ", accident_urls)
print("List of incident url: ", incident_urls)
print("List of serious incident url: ", serious_incident_urls)

List of accident url:  ['https://www.caap.gov.ph/?page_id=7509', 'https://www.caap.gov.ph/?page_id=6439', 'https://www.caap.gov.ph/?page_id=2960', 'https://www.caap.gov.ph/?page_id=3055', 'https://www.caap.gov.ph/?page_id=3096', 'https://www.caap.gov.ph/?page_id=3175', 'https://www.caap.gov.ph/?page_id=3283', 'https://www.caap.gov.ph/?page_id=3349', 'https://www.caap.gov.ph/?page_id=3446', 'https://www.caap.gov.ph/?page_id=3526', 'https://www.caap.gov.ph/?page_id=3620']
List of incident url:  ['https://www.caap.gov.ph/?page_id=6421', 'https://www.caap.gov.ph/?page_id=7850', 'https://www.caap.gov.ph/?page_id=3028', 'https://www.caap.gov.ph/?page_id=3086', 'https://www.caap.gov.ph/?page_id=3154', 'https://www.caap.gov.ph/?page_id=3250', 'https://www.caap.gov.ph/?page_id=3333', 'https://www.caap.gov.ph/?page_id=3401', 'https://www.caap.gov.ph/?page_id=3468', 'https://www.caap.gov.ph/?page_id=3568', 'https://www.caap.gov.ph/?page_id=3676']
List of serious incident url:  ['https://www.caap.

In [7]:
def get_data(url_list):
    """
    Get the concatenated DataFrame from the url in the url_list
    
    @param url_list: List of urls from the CAAP website (accident, incident, or serious_incident)
    """
    data_df = pd.DataFrame(columns=['date', 'aircraft_registration', 'aircraft_type', 'type_of_occurance',
                                           'place_of_occurance', 'status', 'report', 'report_link'])
    for url in url_list:
        print(url)
        website = (requests.get(url).text)
        soup = BeautifulSoup(website, 'lxml')
        accident_table = soup.find('tbody')
        listOfRows = accident_table.find_all('tr')
        temp_data_df = pd.DataFrame(columns=['date', 'aircraft_registration', 'aircraft_type', 'type_of_occurance',
                                           'place_of_occurance', 'status', 'report', 'report_link'])
        i = 0
        for row in listOfRows:
            listOfCells = row.find_all('td')
            if len(listOfCells) == 7 and listOfCells[6].get_text() == "Back":
                continue
            if len(listOfCells) == 7 and i > 1:
                getText = lambda x: x.get_text()
                temp_data_df.loc[i] = list(map(getText, listOfCells)) + list([listOfCells[6].find('a')['href']])
            i += 1
        data_df = pd.concat([data_df,temp_data_df], axis = 0)
    return data_df

In [8]:
accident_df = get_data(accident_urls)
incident_df = get_data(incident_urls)
serious_incident_df = get_data(serious_incident_urls)

https://www.caap.gov.ph/?page_id=7509
https://www.caap.gov.ph/?page_id=6439
https://www.caap.gov.ph/?page_id=2960
https://www.caap.gov.ph/?page_id=3055
https://www.caap.gov.ph/?page_id=3096
https://www.caap.gov.ph/?page_id=3175
https://www.caap.gov.ph/?page_id=3283
https://www.caap.gov.ph/?page_id=3349
https://www.caap.gov.ph/?page_id=3446
https://www.caap.gov.ph/?page_id=3526
https://www.caap.gov.ph/?page_id=3620
https://www.caap.gov.ph/?page_id=6421
https://www.caap.gov.ph/?page_id=7850
https://www.caap.gov.ph/?page_id=3028
https://www.caap.gov.ph/?page_id=3086
https://www.caap.gov.ph/?page_id=3154
https://www.caap.gov.ph/?page_id=3250
https://www.caap.gov.ph/?page_id=3333
https://www.caap.gov.ph/?page_id=3401
https://www.caap.gov.ph/?page_id=3468
https://www.caap.gov.ph/?page_id=3568
https://www.caap.gov.ph/?page_id=3676
https://www.caap.gov.ph/?page_id=7895
https://www.caap.gov.ph/?page_id=8002
https://www.caap.gov.ph/?page_id=4343
https://www.caap.gov.ph/?page_id=4274
https://www.

In [9]:
accident_df

Unnamed: 0,date,aircraft_registration,aircraft_type,type_of_occurance,place_of_occurance,status,report,report_link
2,"September 9, 2018",RP-C8158,McDonnell Douglas (MD) 369 E,Ditched into the sea during over,International Water near the Federated States ...,Completed,Summary,https://www.caap.gov.ph?download=9397
3,"August 16, 2018",B-5498,Boeing 737-800,Runway Lateral Excursion,Ninoy Aquino International Airport (RPLL) Mani...,Completed,Summary,https://www.caap.gov.ph?download=9119
4,"June 6, 2018",RP-C1811,Robinson Helicopter Company R44 II,Main Rotor Failure during Take-Off,"DOLE Plantation, Sitio Glandang, Barangay Kabl...",Completed,Summary,http://www.caap.gov.ph?download=7517
2,"December 10, 2017",RP-C938,BE-P35,Departure Stall,"Skyhawk Airstrip, Tuy, Batangas, Philippines",Completed,Summary,http://www.caap.gov.ph?download=7806
3,"November 26, 2017",RP-C2586,Cessna 152,Engine Fire,"South East Apron, Subic Bay International Airp...",Completed,Summary,http://www.caap.gov.ph?download=6472
...,...,...,...,...,...,...,...,...
12,"Mar 01, 2008",RP-C1129,Cessna 150 M,Crashed short at threshold of Runway 35,"Plaridel Airport, Bulacan",Completed,Summary,http://www.caap.gov.ph?download=3636
13,"Feb 24, 2008",RP-C5328,Dornier DO-328-100,Runway Excursion,Ninoy Aquino International Airport,Completed,Summary,http://www.caap.gov.ph?download=3633
14,"Feb 17, 2008",RP-C654,Cessna 172M,Engine Malfunction,"Hermana Mayor, Zambales",Completed,Summary,http://www.caap.gov.ph?download=3627
15,"Feb 01, 2008",RP-C229,Cessna 172M,Fuel Starvation,"Reclamation area near S&R, Parañaque City",Completed,Summary,http://www.caap.gov.ph?download=3624


In [10]:
accident_df = accident_df.reset_index(drop=True)
incident_df = incident_df.reset_index(drop=True)
serious_incident_df = serious_incident_df.reset_index(drop=True)

#### Get geolocation of the Place of Occurence column

Originally, I used the Nominatim's geocoders to get the location's coordinates (`from geopy.geocoders import Nominatim`); however, it returned a lot of Nones because of the inconsistent addresses. I switched to the Googlemaps api which returns the closest address string location if the exact location isn't found. For example, in the 2008 ACCIDENTS Dataset, "Taliban, Bohol" is one of the places of occurrence. Nominatim's geocoder returns None, while Google maps returns details of "Talibon, Bohol".

In [11]:
import googlemaps
from datetime import datetime

with open('api_key.txt') as f:
    api_key = f.readline()
    f.close
gmaps = googlemaps.Client(api_key)

In [12]:
# Geocode helper functions

# returns the geocode (json containing details of the location)
def google_get_geocode(location):
    try: 
        geocode_result = gmaps.geocode(location)
        return geocode_result[0]
    except: return None

# returns latitude given the geocode json
def google_get_latitude(geocode):
    try: 
        return geocode['geometry']['location']['lat']
    except: 
        return None

# returns longitude given the geocode json
def google_get_longitude(geocode):
    try: 
        return geocode['geometry']['location']['lng']
    except: 
        return None

# returns address given the geocode json
def google_get_address(geocode):
    try: 
        return geocode['formatted_address']
    except: 
        return None

In [13]:
geocode_data = []
data_list = [accident_df, incident_df, serious_incident_df]

# get geocode, latitude, longitude, and formatted address of each place of occurrence 
# in the accident_df, incident_df, and serious_incident_df tables
for table in data_list:
    table['geocode'] = table.place_of_occurance.apply(google_get_geocode)
    table['latitude'] = table.geocode.apply(google_get_latitude)
    table['longitude'] = table.geocode.apply(google_get_longitude)
    table['formatted_address'] = table.geocode.apply(google_get_address)

In [14]:
# Check how many 'None' geocode of the places of occurence did the google maps return
print(f"Accident_df: \n{accident_df.geocode.isna().value_counts()}\n")
print(f"Incident_df: \n{incident_df.geocode.isna().value_counts()}\n", )
print(f"Serious_incident_df: \n{serious_incident_df.geocode.isna().value_counts()}", )

Accident_df: 
False    135
True       2
Name: geocode, dtype: int64

Incident_df: 
False    78
True      2
Name: geocode, dtype: int64

Serious_incident_df: 
False    15
Name: geocode, dtype: int64


Manually input the geocode, latitude, longitude, and formatted_address of the 4 places of occurence with 'None' geocodes.

In [15]:
accident_df.loc[accident_df.geocode.isna()]

Unnamed: 0,date,aircraft_registration,aircraft_type,type_of_occurance,place_of_occurance,status,report,report_link,geocode,latitude,longitude,formatted_address
0,"September 9, 2018",RP-C8158,McDonnell Douglas (MD) 369 E,Ditched into the sea during over,International Water near the Federated States ...,Completed,Summary,https://www.caap.gov.ph?download=9397,,,,
69,"May 8, 2013",RP-C1095,Piper Aztec Twin-Engine,Belly Landing,"MIA, Runway 06",Completed,Summary,http://www.caap.gov.ph?download=3230,,,,


In [16]:
temp = pd.DataFrame({'latitude' : [3.65, 14.49], 'longitude': [160.266667, 121.001500]})
accident_df.loc[accident_df.geocode.isna(), ['latitude', 'longitude']] = temp.values

- September 9, 2018
    - International Water near the Federated States of Micronesia
        - As listed from the report_link the helicopter's final position was noted at coordinates 3°39'00.0"N 160°16'00.0"E (latitude: 3.650000, longitude: 160.266667)
- May 8, 2013
    - MIA, Runway 06 (Ninoy Aquino International Airport Runway 06)
        - 14°29'54.6"N 121°00'05.4"E (latitude: 14.498500, longitude: 121.001500)

In [17]:
incident_df.loc[incident_df.geocode.isna()]

Unnamed: 0,date,aircraft_registration,aircraft_type,type_of_occurance,place_of_occurance,status,report,report_link,geocode,latitude,longitude,formatted_address
47,"Mar 12, 2010",RP-C1322,Cessna 172,Jet Blast,Runway 13 Extension,Completed,Summary,http://www.caap.gov.ph?download=3484,,,,
52,"Jan 24, 2010",RP-C1086,Eurocopter Deutschland GMBH,Detachment of Unsecured Detachable Step,"Gen. Aviation Area, Domestic Airport",Completed,Summary,http://www.caap.gov.ph?download=3472,,,,


In [18]:
temp = pd.DataFrame({'latitude' : [14.522667, 121.005000], 'longitude': [14.524806, 121.001028]})
incident_df.loc[incident_df.geocode.isna(), ['latitude', 'longitude']] = temp.values

- Mar 12, 2010
    - Runway 13 Extension
        - NAIA Runway 13 Extension 14°31'21.6"N 121°00'18.0"E (latitude: 14.522667, longitude: 121.005000)
- Jan 24, 2010
    - Gen. Aviation Area, Domestic Airport
        - Domestic Airport, Brgy 191, Pasay City 14°31'29.3"N 121°00'03.7"E (latitude: 14.524806, longitude: 121.001028)

In [19]:
serious_incident_df.loc[serious_incident_df.geocode.isna()]

Unnamed: 0,date,aircraft_registration,aircraft_type,type_of_occurance,place_of_occurance,status,report,report_link,geocode,latitude,longitude,formatted_address


Looks like geocode is returned successfully for all the places of occurence in the serious incident table.

---
### Save to csv

In [21]:
dict_df = {'accidents.csv': accident_df, 
           'incidents.csv': incident_df, 
           'serious_incidents.csv': serious_incident_df}

for key,val in dict_df.items():
    val.to_csv(key, encoding='utf-8', index=False)

Recommendations:

Explore other database like:
- https://aviation-safety.net/database/country/country.php?id=RP
- http://planecrashinfo.com/