# API Request and Extraction from Aviationstack

## Project Overview

This notebook demonstrates the workflow of extracting Rio de Janeiro International Airport departures data from the **Aviationstack API**, normalizing and transforming the nested JSON response using **Pandas**, and applying data cleaning and enrichment steps (such as handling null values, codeshare mapping, and datetime formatting). Finally, the processed dataset is loaded into a CSV file for further analysis and use in subsequent applications or databases.  For a ETL pipeline using Cron Job, take a look at [Aviationstack API ETL with Daily Cron Job](https://github.com/rodolfoplng/Portfolio/blob/main/Aviationstack%20ETL%20Cron.md)  For a full ETL pipeline using Apache Airflow, take a look at [Airflow DAG for API ETL](https://github.com/rodolfoplng/Airflow-DAG-for-API-ETL-Process)

### API Request and Data Extraction

In [1]:
import requests

params = {
    'access_key': 'your_key_here', # Insert your API key
    'dep_iata' : 'GIG', # Rio de Janeiro International Airport (Galeão)
}

response = requests.get('https://api.aviationstack.com/v1/flights', params=params)

# Show the request status code
print("Status code:", response.status_code)

Status code: 200


In [2]:
#
try:
    data = response.json()
    #print("API Response:", data)
except Exception as e:
    print("Error trying to read JSON:", e)

In [3]:
data.keys()

dict_keys(['pagination', 'data'])

In [4]:
from pandas import json_normalize

# Flatten all nested fields at any depth
df = json_normalize(data['data'])

df.head()

Unnamed: 0,flight_date,flight_status,aircraft,live,departure.airport,departure.timezone,departure.iata,departure.icao,departure.terminal,departure.gate,...,flight.number,flight.iata,flight.icao,flight.codeshared.airline_name,flight.codeshared.airline_iata,flight.codeshared.airline_icao,flight.codeshared.flight_number,flight.codeshared.flight_iata,flight.codeshared.flight_icao,flight.codeshared
0,2025-09-15,active,,,Galeao Antonio Carlos Jobim,America/Sao_Paulo,GIG,SBGL,2,B43,...,4233,TP4233,TAP4233,gol,g3,glo,2014.0,g32014,glo2014,
1,2025-09-15,active,,,Galeao Antonio Carlos Jobim,America/Sao_Paulo,GIG,SBGL,2,B43,...,2014,G32014,GLO2014,,,,,,,
2,2025-09-15,active,,,Galeao Antonio Carlos Jobim,America/Sao_Paulo,GIG,SBGL,1,A22,...,7570,AR7570,ARG7570,gol,g3,glo,2046.0,g32046,glo2046,
3,2025-09-15,active,,,Galeao Antonio Carlos Jobim,America/Sao_Paulo,GIG,SBGL,1,A22,...,3715,EK3715,UAE3715,gol,g3,glo,2046.0,g32046,glo2046,
4,2025-09-15,active,,,Galeao Antonio Carlos Jobim,America/Sao_Paulo,GIG,SBGL,1,A22,...,4047,TP4047,TAP4047,gol,g3,glo,2046.0,g32046,glo2046,


### Data Cleaning and Processing

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 42 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   flight_date                      100 non-null    object 
 1   flight_status                    100 non-null    object 
 2   aircraft                         0 non-null      object 
 3   live                             0 non-null      object 
 4   departure.airport                100 non-null    object 
 5   departure.timezone               100 non-null    object 
 6   departure.iata                   100 non-null    object 
 7   departure.icao                   100 non-null    object 
 8   departure.terminal               100 non-null    object 
 9   departure.gate                   100 non-null    object 
 10  departure.delay                  47 non-null     float64
 11  departure.scheduled              100 non-null    object 
 12  departure.estimated    

In [6]:
df["flight_status"].unique()

array(['active', 'scheduled', 'landed'], dtype=object)

#### Codeshare mapping

In [7]:
df[["airline.icao", "flight.number", "flight.codeshared.flight_icao"]]

Unnamed: 0,airline.icao,flight.number,flight.codeshared.flight_icao
0,TAP,4233,glo2014
1,GLO,2014,
2,ARG,7570,glo2046
3,UAE,3715,glo2046
4,TAP,4047,glo2046
...,...,...,...
95,GLO,7656,
96,JAT,3817,
97,JAT,3813,
98,ARG,7450,glo2073


In [8]:
import pandas as pd

# Get the rows that are codeshare (pointing to an operated flight)
ref = (
    df.loc[df["flight.codeshared.flight_icao"].notna(),
           ["flight.codeshared.flight_icao", "airline.name", "flight.number"]]
      .copy()
)

# Build the label "Airline / Flight Number" for each codeshare
ref["pair"] = ref["airline.name"].fillna("").astype(str) + " " + ref["flight.number"].astype(str)

# Aggregate by operated flight (key = flight.codeshared.flight_icao)
agg = (
    ref.groupby(ref["flight.codeshared.flight_icao"].str.upper())["pair"]
       .apply(lambda s: " / ".join(sorted(set(s))))
       .reset_index()
       .rename(columns={
           "flight.codeshared.flight_icao": "flight.icao",
           "pair": "codeshare"
       })
)

# Merge with the original DataFrame using the operated flight key (flight.icao)
df = df.merge(agg, how="left", on="flight.icao")

# Keep "codeshare" ONLY in rows where the flight is the operated one
df.loc[df["flight.codeshared.flight_icao"].notna(), "codeshare"] = pd.NA

#### Handling null values

In [9]:
df.dropna(subset = "codeshare", inplace = True)

#### Datetime formatting

In [10]:
df[["departure.delay", "departure.scheduled", "departure.estimated", "departure.actual"]].head()

Unnamed: 0,departure.delay,departure.scheduled,departure.estimated,departure.actual
1,,2025-09-15T07:55:00+00:00,2025-09-15T07:55:00+00:00,
5,,2025-09-15T07:55:00+00:00,2025-09-15T07:55:00+00:00,
8,,2025-09-15T07:55:00+00:00,2025-09-15T07:55:00+00:00,
13,,2025-09-15T08:00:00+00:00,2025-09-15T08:00:00+00:00,
17,,2025-09-15T08:10:00+00:00,2025-09-15T08:10:00+00:00,


In [11]:
(df["departure.scheduled"] == df["departure.estimated"]).value_counts()

True    18
Name: count, dtype: int64

In [12]:
# Convert to datetime
df["departure.scheduled"] = pd.to_datetime(df["departure.scheduled"])
df["departure.estimated"] = pd.to_datetime(df["departure.estimated"])
df["departure.actual"] = pd.to_datetime(df["departure.actual"])

# Creat new date and hour columns
df["scheduled_date"] = df["departure.scheduled"].dt.date
df["scheduled_time"] = df["departure.scheduled"].dt.time

df["estimated_date"] = df["departure.estimated"].dt.date
df["estimated_time"] = df["departure.estimated"].dt.time

df["actual_date"] = df["departure.actual"].dt.date
df["actual_time"] = df["departure.actual"].dt.time

#### Droppping unused columns

In [13]:
# Keeping only departure related columns
columns_to_keep = ['flight_status', 'departure.airport',
       'departure.timezone', 'departure.iata', 'departure.icao',
       'departure.terminal', 'departure.gate', 'departure.delay',
       'arrival.airport', 'arrival.iata', 'arrival.icao',
       'airline.name', 'airline.iata', 'airline.icao', 'flight.number',
       'flight.iata', 'flight.icao',
       'codeshare', 'scheduled_date', 'scheduled_time', 'estimated_date',
       'estimated_time', 'actual_date', 'actual_time']

In [17]:
df = df[columns_to_keep].reset_index()

### Loading data to a CSV file

In [19]:
df.to_csv("Departures.csv", index = False)