### Importing essential libraries required for data manipulation, timing control, and system-level operations

In [16]:
import pandas as pd
import time
import sys

### Web Scraping Process

This code scrapes aviation safety data from the ASN Flightsafety website by year and page.

- Custom HTTP headers are set to mimic a browser and avoid being blocked.  
- The script loops through years starting from 1995 up to 2025, fetching all available pages for each year.  
- Data tables from each page are read with pandas and stored in a list.  
- Progress is shown in the console with real-time updates.  
- When no more pages are found for a year, it moves to the next year automatically.  
- Errors are handled to continue scraping without stopping unexpectedly.  
- A 1-second pause between requests helps to be polite to the server.  
- The process stops after scraping all years up to 2025.


In [17]:
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://asn.flightsafety.org/",
}

year = 1995
page = 1
dfs = []

while True:
    if year <= 2025:
        url = f"https://asn.flightsafety.org/database/year/{year}/{page}"
        try:
            sys.stdout.write(f"\rScraping page - {year}, page {page}...")
            sys.stdout.flush()

            data = pd.read_html(
                url,
                storage_options=headers,
                skiprows=0,
                header=None
            )
            dfs.append(data[0])

            # Overwrite the line to indicate success
            sys.stdout.write(f"\rScraping year - {year}, page {page}... Finished successfully!\n")
            sys.stdout.flush()

            page += 1
            time.sleep(1)
            continue

        except Exception as e:
            if isinstance(e, ValueError):
                print(f"\nFinished scraping data for {year} successfully!")
                print(f"No more data found at {year}. Continuing with year {year + 1}\n")
            else:
                print(f"\nUnexpected error at year {year}, page {page}: {e}")
            year += 1
            page = 1
            continue
    else:
        print("\nScraping completed for all years.")
        break

Scraping page - 1995, page 1... Finished successfully!
Scraping page - 1995, page 2... Finished successfully!
Scraping page - 1995, page 3... Finished successfully!
Scraping page - 1995, page 4... Finished successfully!
Scraping page - 1995, page 5...
Finished scraping data for 1995 successfully!
No more data found at 1995. Continuing with year 1996

Scraping page - 1996, page 1... Finished successfully!
Scraping page - 1996, page 2... Finished successfully!
Scraping page - 1996, page 3... Finished successfully!
Scraping page - 1996, page 4... Finished successfully!
Scraping page - 1996, page 5...
Finished scraping data for 1996 successfully!
No more data found at 1996. Continuing with year 1997

Scraping page - 1997, page 1... Finished successfully!
Scraping page - 1997, page 2... Finished successfully!
Scraping page - 1997, page 3... Finished successfully!
Scraping page - 1997, page 4... Finished successfully!
Scraping page - 1997, page 5...
Finished scraping data for 1997 successful

### Data Aggregation and Export

- All collected page DataFrames stored in the list `dfs` are combined into a single DataFrame using `pd.concat`, with the index reset to ensure continuous numbering.  
- The resulting complete dataset is displayed for review.  
- Finally, the dataset is exported to a CSV file named `"asn_dataset"` with UTF-8 encoding, including column headers and excluding the DataFrame index.

In [21]:
dataset = pd.concat(dfs, ignore_index=True)
dataset
dataset.to_csv("asn_dataset", encoding='utf-8', index=False, header=True)