### Importing essential libraries required for data manipulation, timing control, and system-level operations

In [2]:
import pandas as pd
import time
import sys
from bs4 import BeautifulSoup
import requests

### Web Scraping Process

This code scrapes aviation safety data from the ASN Flightsafety website by year and page.

- Custom HTTP headers are set to mimic a browser and avoid being blocked.  
- The script loops through years starting from 1995 up to 2025, fetching all available pages for each year.  
- Data tables from each page are read with pandas and stored in a list.  
- Progress is shown in the console with real-time updates.  
- When no more pages are found for a year, it moves to the next year automatically.  
- Errors are handled to continue scraping without stopping unexpectedly.  
- A 1-second pause between requests helps to be polite to the server.  
- The process stops after scraping all years up to 2025.


In [3]:
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://asn.flightsafety.org/",
}

year = 1995
page = 1
dfs = []

while True:
    if year <= 2025:
        url = f"https://asn.flightsafety.org/database/year/{year}/{page}"
        try:
            sys.stdout.write(f"\rScraping page - {year}, page {page}...")
            sys.stdout.flush()

            data = pd.read_html(
                url,
                storage_options=headers,
                skiprows=0,
                header=None
            )
            dfs.append(data[0])

            # Overwrite the line to indicate success
            sys.stdout.write(f"\rScraping year - {year}, page {page}... Finished successfully!\n")
            sys.stdout.flush()

            page += 1
            time.sleep(1)
            continue

        except Exception as e:
            if isinstance(e, ValueError):
                print(f"\nFinished scraping data for {year} successfully!")
                print(f"No more data found at {year}. Continuing with year {year + 1}\n")
            else:
                print(f"\nUnexpected error at year {year}, page {page}: {e}")
            year += 1
            page = 1
            continue
    else:
        print("\nScraping completed for all years.")
        break

Scraping year - 1995, page 1... Finished successfully!
Scraping year - 1995, page 2... Finished successfully!


KeyboardInterrupt: 

### Data Aggregation and Export

- All collected page DataFrames stored in the list `dfs` are combined into a single DataFrame using `pd.concat`, with the index reset to ensure continuous numbering.  
- The resulting complete dataset is displayed for review.  
- Finally, the dataset is exported to a CSV file named `"asn_dataset"` with UTF-8 encoding, including column headers and excluding the DataFrame index.

In [None]:
dataset = pd.concat(dfs, ignore_index=True)
dataset
dataset.to_csv("asn_dataset", encoding='utf-8', index=False, header=True)

In [44]:
req = requests.get("https://asn.flightsafety.org/database/year/1995/1", headers=headers)
soup = BeautifulSoup(req.text)

table = soup.body.find("table", class_="hp")
table_rows = table.find_all("tr")

data = []
for tr in table_rows:
    cols = tr.find_all("td")
    if not cols:
        continue

    
    row = []
    link = None
    
    for i, td in enumerate(cols):
        text = td.get_text(strip=True)

        if i == 0:
            span = td.find("span")
            a_tag = span.find("a")
            if a_tag and "href" in a_tag.attrs:
                link = a_tag["href"]
        row.append(text)

    row.append("https://asn.flightsafety.org" + link if link else None)
    data.append(row)

data

[['2 Jan 1995',
  'Boeing 737-298C',
  '9Q-CNI',
  'Air Zaire',
  '0',
  "Kinshasa-N'Djili Airport (FIH)",
  '',
  'w/o',
  '',
  'https://asn.flightsafety.org/wikibase/324813'],
 ['2 Jan 1995',
  'Cessna 208 Caravan I',
  'N242SS',
  'Taquan Air Service',
  '0',
  'Craig, AK',
  '',
  'sub',
  '',
  'https://asn.flightsafety.org/wikibase/359440'],
 ['3 Jan 1995',
  'de Havilland Canada DHC-6 Twin Otter 310',
  'P2-IAA',
  'Islands Nationair',
  '0',
  'Bili',
  '',
  'w/o',
  '',
  'https://asn.flightsafety.org/wikibase/324812'],
 ['4 Jan 1995',
  'Fokker 50?',
  '',
  'Sudan Airways',
  '0',
  'Port Sudan Airport (PZU)',
  '',
  'non',
  '',
  'https://asn.flightsafety.org/wikibase/324811'],
 ['5 Jan 1995',
  'Fokker 50',
  'LN-BBA',
  'Braathens SAFE, lsf Norwegian Air Shuttle',
  '0',
  'Ålesund-Vigra Airport (AES)',
  '',
  'sub',
  '',
  'https://asn.flightsafety.org/wikibase/324809'],
 ['5 Jan 1995',
  'Lockheed L-1329-25 JetStar II',
  '1003',
  'Imperial Iranian Air Force - II