# Data Acquisition for Exploratory Data Analysis of Aviation Accidents

This notebook documents the acquisition of the aviation accidents dataset from Aviation Safety Network ([ASN](https://asn.flightsafety.org/)). It focuses on data source scraping, collection steps and difficulties workarounds.

---
## Table of Contents

### Importing Libraries

Importing necessary Python libraries for web scraping, data processing, HTML parsing, and concurrent execution. These tools let us to collect and organize accident data efficiently.

In [2]:
import pandas as pd
import time
import requests
import re
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

### Web Scraping Process

In [3]:
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://asn.flightsafety.org/",
}

def get_soup(url, headers=None):
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    return BeautifulSoup(response.text, 'html.parser')

In [4]:
def fetch_detail(link):
    row_extra = {}
    try:
        isoup = get_soup(link, headers=headers)
        rows = isoup.find("table").find_all("tr")
        for tr in rows:
            tds = tr.find_all("td")
            if len(tds) >= 2:
                key = tds[0].get_text(strip=True).rstrip(":")
                val = tds[1].get_text(strip=True)
                row_extra[key] = val

        locmap = isoup.find("iframe")
        if locmap and hasattr(locmap, "attrs"):
            src = locmap.attrs.get("src")
            if src:
                match = pattern.search(src)
                if match:
                    lat, lon = map(float, match.groups())
                    row_extra["ll"] = [lat, lon]
                    
    except Exception as e:
        row_extra["error"] = str(e)
    return row_extra

In [5]:
pattern = re.compile(r'll=([-.\d]+),([-.\d]+)')
data = []

year, page = 1995, 1

while year <= 2025:
    try:
        print(f"Scraping: https://asn.flightsafety.org/database/year/{year}/{page}")
        soup = get_soup(f"https://asn.flightsafety.org/database/year/{year}/{page}", headers=headers)

        table = soup.find("table", class_="hp")
        table_rows = table.find_all("tr")

        if year == 1995 and page == 1:
            columns = [th.text.strip() for th in table_rows[0].find_all("th")]
            print("Extracted table column headers")

        rows_data = []
        links_to_fetch = []

        for tr in table_rows[1:]:
            cols = tr.find_all("td")
            if not cols:
                continue

            row = {}
            link = None
            for i, td in enumerate(cols):
                text = td.get_text(strip=True)
                if i == 0:
                    a_tag = td.find("span")
                    if a_tag:
                        a = a_tag.find("a")
                        if a and "href" in a.attrs:
                            link = "https://asn.flightsafety.org" + a["href"]
                column = columns[i] if i < len(columns) else f"col_{i}"
                row[column] = text

            row["link"] = link
            if link:
                links_to_fetch.append((row, link))
            else:
                rows_data.append(row)

        print(f"Found {len(rows_data)} entries without detail links")
        print(f"Found {len(links_to_fetch)} entries with detail links")

        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = {executor.submit(fetch_detail, link): row for row, link in links_to_fetch}
            for future in futures:
                row = futures[future]
                extra_data = future.result()
                row.update(extra_data)
                rows_data.append(row)

        data.extend(rows_data)
        print(f"Finished page {page} of year {year} â€” total records so far: {len(data)}")
        page += 1
        time.sleep(0.2)
        
    except Exception as e:
        if isinstance(e, AttributeError):
            print(f"Finished scraping year {year}")
        else:
            print(f"Error on year {year}, page {page}: {e}. Skipping to next year")
        year += 1
        page = 1

print("Scraping completed")


https://asn.flightsafety.org/database/year/1995/1
https://asn.flightsafety.org/database/year/1995/2
https://asn.flightsafety.org/database/year/1995/3
https://asn.flightsafety.org/database/year/1995/4
https://asn.flightsafety.org/database/year/1995/5
Finished year 1995.
https://asn.flightsafety.org/database/year/1996/1
https://asn.flightsafety.org/database/year/1996/2
https://asn.flightsafety.org/database/year/1996/3
https://asn.flightsafety.org/database/year/1996/4
https://asn.flightsafety.org/database/year/1996/5
Finished year 1996.
https://asn.flightsafety.org/database/year/1997/1
https://asn.flightsafety.org/database/year/1997/2
https://asn.flightsafety.org/database/year/1997/3
https://asn.flightsafety.org/database/year/1997/4
https://asn.flightsafety.org/database/year/1997/5
Finished year 1997.
https://asn.flightsafety.org/database/year/1998/1
https://asn.flightsafety.org/database/year/1998/2
https://asn.flightsafety.org/database/year/1998/3
https://asn.flightsafety.org/database/ye

### Data Export

- The resulting complete dataset is displayed for review.  
- Finally, the dataset is exported to a CSV file named `"full_asn_dataset"` with UTF-8 encoding, including column headers and excluding the DataFrame index.

In [None]:
df = pd.DataFrame(data)
df.head()

In [None]:
df.to_feather("full_asn_dataset.feather")