<img src="../images/airplane-symbol.jpg" style="float: left; margin: 20px;" width="50" height="50"> 
#  Predicting Flight Delays (<i>a Proof-of-Concept</i>)

Author: Solomon Heng

---

# (1) Extracting METAR data for KATL

## Processes covered in this notebook:
1. [Importing METAR txt](#(1)-Importing-METAR-txt)
2. [Extracting Date info](#(2)-Extracting-Date-info)
3. [Extracting METAR data](#(3)-Extracting-METAR-data)
4. [Combining METAR and date](#(4)-Combining-METAR-and-date)
5. [Exporting the weather data](#(5)-Exporting-the-weather-data)

In [1]:
import pandas as pd
import numpy as np
import re

---
### (1) Importing METAR txt

Importing the METAR txt data scraped from https://www.ogimet.com/

_The scraping of the data is done using API calls on the relevant webpage_

---

In [2]:
f = open('../datasets/Metar KATL.txt', 'r')

In [3]:
met = []

for i in f:
    met.append(i)

In [4]:
met

['KATL,2015,01,01,00,52,METAR KATL 010052Z 33005KT 10SM FEW200 SCT250 05/01 A3037 RMK AO2 SLP289 T00500006=\n',
 'KATL,2015,01,01,01,52,METAR KATL 010152Z 31004KT 10SM FEW250 04/00 A3037 RMK AO2 SLP290 T00440000=\n',
 'KATL,2015,01,01,02,52,METAR KATL 010252Z 00000KT 10SM FEW250 04/00 A3036 RMK AO2 SLP285 T00440000 58001=\n',
 'KATL,2015,01,01,03,52,METAR KATL 010352Z 32005KT 10SM FEW250 03/M01 A3037 RMK AO2 SLP289 T00331006=\n',
 'KATL,2015,01,01,04,52,METAR KATL 010452Z 32006KT 10SM BKN200 03/M01 A3035 RMK AO2 SLP283 T00281011 401060028=\n',
 'KATL,2015,01,01,05,52,METAR KATL 010552Z 33004KT 10SM BKN200 03/M02 A3034 RMK AO2 SLP280 T00281017 10061 20028 58005=\n',
 'KATL,2015,01,01,06,52,METAR KATL 010652Z 32006KT 10SM FEW200 SCT250 02/M02 A3035 RMK AO2 SLP281 T00221017=\n',
 'KATL,2015,01,01,07,52,METAR KATL 010752Z 33005KT 10SM FEW250 02/M02 A3034 RMK AO2 SLP280 T00171022=\n',
 'KATL,2015,01,01,08,52,METAR KATL 010852Z 00000KT 10SM FEW200 02/M03 A3032 RMK AO2 SLP274 T00171033 58006=

---
### (2) Extracting Date info

Extracting the relevant date information from the txt data into a dataframe

---

In [5]:
metar_year = []

for i in met:
    year = i.split(',')[1]
    metar_year.append(year)

In [6]:
metar_month = []

for i in met:
    month = i.split(',')[2]
    metar_month.append(month)

In [7]:
metar_day = []

for i in met:
    day = i.split(',')[3]
    metar_day.append(day)

In [8]:
metar_hour = []

for i in met:
    hour = i.split(',')[4]
    metar_hour.append(hour)

In [9]:
metar_min = []

for i in met:
    min = i.split(',')[5]
    metar_min.append(min)

In [10]:
date = pd.DataFrame(metar_month)

In [11]:
date.columns = ['month']
date.head()

Unnamed: 0,month
0,1
1,1
2,1
3,1
4,1


In [12]:
date['month'].unique()

array(['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11',
       '12'], dtype=object)

In [13]:
date['year'] = metar_year
date['day'] = metar_day
date['hour'] = metar_hour
date['min'] = metar_min

In [14]:
date['day'].unique()

array(['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11',
       '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '30', '31'], dtype=object)

In [15]:
date['hour'].unique()

array(['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21',
       '22', '23'], dtype=object)

In [16]:
date['min'].unique()

array(['52', '45', '22', '47', '34', '30', '26', '57', '00', '18', '08',
       '11', '01', '35', '19', '37', '28', '15', '12', '21', '16', '29',
       '42', '07', '20', '46', '17', '48', '44', '59', '43', '33', '24',
       '49', '05', '10', '25', '06', '39', '04', '14', '38', '50', '32',
       '54', '40', '02', '23', '13', '31', '56', '27', '03', '58', '41',
       '36', '55', '09', '53'], dtype=object)

In [17]:
date.head()

Unnamed: 0,month,year,day,hour,min
0,1,2015,1,0,52
1,1,2015,1,1,52
2,1,2015,1,2,52
3,1,2015,1,3,52
4,1,2015,1,4,52


---
### (3) Extracting METAR data

Extracting the relevant METAR data from the txt data into a dataframe

---

In [18]:
metar_only_data = []

for i in met:
    wx = i.split(',')[6]
    metar_only_data.append(wx)    

In [19]:
metar_only_data

['METAR KATL 010052Z 33005KT 10SM FEW200 SCT250 05/01 A3037 RMK AO2 SLP289 T00500006=\n',
 'METAR KATL 010152Z 31004KT 10SM FEW250 04/00 A3037 RMK AO2 SLP290 T00440000=\n',
 'METAR KATL 010252Z 00000KT 10SM FEW250 04/00 A3036 RMK AO2 SLP285 T00440000 58001=\n',
 'METAR KATL 010352Z 32005KT 10SM FEW250 03/M01 A3037 RMK AO2 SLP289 T00331006=\n',
 'METAR KATL 010452Z 32006KT 10SM BKN200 03/M01 A3035 RMK AO2 SLP283 T00281011 401060028=\n',
 'METAR KATL 010552Z 33004KT 10SM BKN200 03/M02 A3034 RMK AO2 SLP280 T00281017 10061 20028 58005=\n',
 'METAR KATL 010652Z 32006KT 10SM FEW200 SCT250 02/M02 A3035 RMK AO2 SLP281 T00221017=\n',
 'METAR KATL 010752Z 33005KT 10SM FEW250 02/M02 A3034 RMK AO2 SLP280 T00171022=\n',
 'METAR KATL 010852Z 00000KT 10SM FEW200 02/M03 A3032 RMK AO2 SLP274 T00171033 58006=\n',
 'METAR KATL 010952Z 33007KT 10SM FEW200 01/M03 A3031 RMK AO2 SLP270 T00111028=\n',
 'METAR KATL 011052Z 31003KT 10SM FEW250 01/M03 A3030 RMK AO2 SLP268 T00061033=\n',
 'METAR KATL 011152Z 3100

In [20]:
# Looping through the strings to extract relevant data using Regex

metar = []

for i in metar_only_data:
    
    data={}
    
    try:
        report_type = re.search('METAR|SPECI', i).group()
    except:
        pass
    
    try:
        varywind = re.search(' \d{3}\D\d{3} ', i).group()
    except:
        pass
    
    try:
        vis = re.search('(\d\/\d|\d.)[S][M] ', i).group()
    except:
        pass
    
    try:
        cloud = re.search('((SKC|NCD|CLR|NSC|FEW|SCT|BKN|OVC|VV)\d* )+', i).group()
    except:
        pass
    
    try:
        qnh = re.search('A\d{4}', i).group()
    except:
        pass
    
    try:
        temp = re.search('[M]*\d{2}\/', i).group()
    except:
        pass
    
    try:
        dew = re.search('\/[M]*\d{2}', i).group()
    except:
        pass
    ####
    try:
        TS = re.search('TS', i).group()
    except:
        pass
    
    try:
        snow_ground = re.search('SOG', i).group()
    except:
        pass
    
    try:
        lightning = re.search('LTG', i).group()
    except:
        pass
    
    try:
        hail = re.search('GR|GS', i).group()
    except:
        pass
    
    try:
        shower = re.search('SH', i).group()
    except:
        pass
    
    try:
        rain = re.search('RA', i).group()
    except:
        pass
    
    try:
        snow = re.search('SN', i).group()
    except:
        pass
    
    try:
        low_intensity = re.search('-', i).group()
    except:
        pass
    
    try:
        high_intensity = re.search('+', i).group()
    except:
        pass
    
    try:
        vicinity = re.search('VC', i).group()
    except:
        pass
    
    try:
        squall = re.search('SQ', i).group()
    except:
        pass
    
    # Putting in dict
    try:
        data['type'] = report_type
    except:
        pass
    
    try:
        data['aerodrome'] = i[6:10]
    except:
        pass
    
    try:
        data['DayTime'] = i[11:18]
    except:
        pass
    
    try:
        data['winddirspd'] = i[19:26]
    except:
        pass
    
    try: 
        data['wind_variation'] = varywind
    except:
        pass
    
    try:
        data['visibility'] = vis
    except:
        pass    
    
    try:
        data['cloud'] = cloud
    except:
        pass
        
    try:
        data['temp'] = temp
    except:
        pass
    
    try:
        data['dew_point'] = dew
    except:
        pass
    
    try:
        data['QNH'] = qnh  
    except:
        pass
    
    try:
        data['thunderyshower'] = TS  
    except:
        pass
    
    try:
        data['snow_on_grnd'] = snow_ground  
    except:
        pass
    
    try:
        data['lightning'] = lightning
    except:
        pass
    
    try:
        data['hail'] = hail
    except:
        pass
    
    try:
        data['shower'] = shower
    except:
        pass
    
    try:
        data['rain'] = rain
    except:
        pass
    
    try:
        data['snow'] = snow
    except:
        pass
    
    try:
        data['low_intensity'] = low_intensity
    except:
        pass
    
    try:
        data['high_intensity'] = high_intensity
    except:
        pass
    
    try:
        data['squall'] = squall
    except:
        pass
    
    try:
        data['vicinity'] = vicinity
    except:
        pass
    
    metar.append(data)

In [21]:
df = pd.DataFrame(metar)

In [22]:
df.head()

Unnamed: 0,type,aerodrome,DayTime,winddirspd,visibility,cloud,temp,dew_point,QNH,vicinity,low_intensity,rain,wind_variation,thunderyshower,lightning,snow,shower,squall
0,METAR,KATL,010052Z,33005KT,10SM,FEW200 SCT250,05/,/01,A3037,,,,,,,,,
1,METAR,KATL,010152Z,31004KT,10SM,FEW250,04/,/00,A3037,,,,,,,,,
2,METAR,KATL,010252Z,00000KT,10SM,FEW250,04/,/00,A3036,,,,,,,,,
3,METAR,KATL,010352Z,32005KT,10SM,FEW250,03/,/M01,A3037,,,,,,,,,
4,METAR,KATL,010452Z,32006KT,10SM,BKN200,03/,/M01,A3035,,,,,,,,,


In [23]:
pd.set_option('display.max_rows', 10000)
df = df.fillna(0)

---
### (4) Combining METAR and date

Combining the METAR data and the date information

---

In [24]:
df = date.merge(df, how='left', left_index=True, right_index=True)

In [25]:
# DayTime is now redundant and we will drop it

df.drop('DayTime', axis=1, inplace=True)

In [26]:
# We will also drop aerodrome as it is a constant (we are only looking at KATL)
# Also dropping cloud as it (technically) does not affect the approach

df.drop(['aerodrome', 'cloud'], axis=1, inplace=True)

_Clouds in itself do not affect the approach unless it is a thunderstorm cloud in the approach path or near it. If it is a thunderstorm, it could be represented by strong winds, gusts and thunderyshowers. The cloud variable here in itself does not tell us the exact location of the cloud and as such would be of no purpose to include in our model._

In [27]:
pd.set_option('display.max_columns', 40)
df.head()

Unnamed: 0,month,year,day,hour,min,type,winddirspd,visibility,temp,dew_point,QNH,vicinity,low_intensity,rain,wind_variation,thunderyshower,lightning,snow,shower,squall
0,1,2015,1,0,52,METAR,33005KT,10SM,05/,/01,A3037,0,0,0,0,0,0,0,0,0
1,1,2015,1,1,52,METAR,31004KT,10SM,04/,/00,A3037,0,0,0,0,0,0,0,0,0
2,1,2015,1,2,52,METAR,00000KT,10SM,04/,/00,A3036,0,0,0,0,0,0,0,0,0
3,1,2015,1,3,52,METAR,32005KT,10SM,03/,/M01,A3037,0,0,0,0,0,0,0,0,0
4,1,2015,1,4,52,METAR,32006KT,10SM,03/,/M01,A3035,0,0,0,0,0,0,0,0,0


In [28]:
df['type'].unique()

array(['METAR', 'SPECI'], dtype=object)

In [29]:
df[df['type'] == 'SPECI'].head()

Unnamed: 0,month,year,day,hour,min,type,winddirspd,visibility,temp,dew_point,QNH,vicinity,low_intensity,rain,wind_variation,thunderyshower,lightning,snow,shower,squall
42,1,2015,2,18,45,SPECI,09005KT,10SM,09/,/07,A3025,VC,-,RA,0,0,0,0,0,0
44,1,2015,2,19,22,SPECI,33007KT,1/2SM,09/,/08,A3029,VC,-,RA,0,0,0,0,0,0
45,1,2015,2,19,47,SPECI,04004KT,1/2SM,09/,/35,A3029,VC,-,RA,0,0,0,0,0,0
47,1,2015,2,20,34,SPECI,07010KT,1/2SM,09/,/07,A3025,VC,-,RA,0,0,0,0,0,0
49,1,2015,2,21,30,SPECI,05007KT,1/2SM,09/,/07,A3028,VC,-,RA,0,0,0,0,0,0


For the purpose of predictions we will only use METAR. 

In a real world situation, we will only be able to get the TAF which is the weather forecast. METAR is the actual weather observed at the time while SPECI is the ammended weather observed if it fluctuates above a certain threshold. For the purpose of this project, we will not take SPECI into account for the models and will solely depend on METAR readings for our predictions. 

_For real life predictions, we will replace METAR reports with TAF reports and take in the inaccuracies of the TAF reports as a variance of our model_

In [30]:
df.shape

(10990, 20)

In [31]:
df = df[df['type'] == 'METAR']

In [32]:
df.shape

(8794, 20)

In [31]:
# Type feature is no longer needed as SPECI is dropped

df.drop('type', axis=1, inplace=True)
df.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
0,1,2015,1,0,52,A3037,/01,0,0,0,0,0,0,05/,0,0,10SM,0,33005KT
1,1,2015,1,1,52,A3037,/00,0,0,0,0,0,0,04/,0,0,10SM,0,31004KT
2,1,2015,1,2,52,A3036,/00,0,0,0,0,0,0,04/,0,0,10SM,0,00000KT
3,1,2015,1,3,52,A3037,/M01,0,0,0,0,0,0,03/,0,0,10SM,0,32005KT
4,1,2015,1,4,52,A3035,/M01,0,0,0,0,0,0,03/,0,0,10SM,0,32006KT


---
### (5) Exporting the weather data

---

In [32]:
df.to_csv('../datasets/unclean_wx.csv', index=False)

In [33]:
# Testing regex

re.search('VC', 'OVC VC VCR').group() # Gives exact 'VC'

'VC'

In [34]:
df[df['vicinity'] == 'VC'].head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
20,1,2015,1,20,52,A3026,/M03,0,0,0,0,0,0,13/,0,VC,10SM,0,VRB03KT
21,1,2015,1,21,52,A3026,/M01,0,0,0,0,0,0,12/,0,VC,10SM,0,28005KT
22,1,2015,1,22,52,A3027,/M01,0,0,0,0,0,0,12/,0,VC,10SM,0,30006KT
23,1,2015,1,23,52,A3028,/M01,0,-,0,0,0,0,11/,0,VC,10SM,0,00000KT
24,1,2015,2,0,52,A3031,/01,0,-,RA,0,0,0,10/,0,VC,10SM,0,33007KT
