<img src="../images/airplane-symbol.jpg" style="float: left; margin: 20px;" width="50" height="50"> 
#  Predicting Flight Delays (<i>a Proof-of-Concept</i>)

Author: Solomon Heng

---

# (2) Cleaning Weather Data

## Processes covered in this notebook:
1. [Importing Extracted METAR data](#(1)-Importing-Extracted-METAR-data)
2. [Cleaning dew_point](#(2)-Cleaning-dew_point)
3. [Cleaning temp](#(3)-Cleaning-temp)
4. [Cleaning visibility](#(4)-Cleaning-visibility)
5. [Cleaning winddirspd](#(5)-Cleaning-winddirspd)
6. [Cleaning QNH](#(6)-Cleaning-QNH)
7. [Cleaning wind variation](#(7)-Cleaning-wind-variation)
8. [Final checks and exporting dataset](#(8)-Final-checks-and-exporting-dataset)

In [1]:
import re
import pandas as pd
import numpy as np

---
### (1) Importing Extracted METAR data

---

In [2]:
wx = pd.read_csv('../datasets/unclean_wx.csv')

In [3]:
wx.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
0,1,2015,1,0,52,A3037,/01,0,0,0,0,0,0,05/,0,0,10SM,0,33005KT
1,1,2015,1,1,52,A3037,/00,0,0,0,0,0,0,04/,0,0,10SM,0,31004KT
2,1,2015,1,2,52,A3036,/00,0,0,0,0,0,0,04/,0,0,10SM,0,00000KT
3,1,2015,1,3,52,A3037,/M01,0,0,0,0,0,0,03/,0,0,10SM,0,32005KT
4,1,2015,1,4,52,A3035,/M01,0,0,0,0,0,0,03/,0,0,10SM,0,32006KT


---
### (2) Cleaning dew_point

---

In [4]:
wx['dew_point'] = [i.replace('/', '') for i in wx['dew_point']]
wx['dew_point'] = [i.replace('M', '-') for i in wx['dew_point']]

In [5]:
# Converting to correct dtype

wx['dew_point'] = wx['dew_point'].astype(int)

In [6]:
wx.head(2)

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
0,1,2015,1,0,52,A3037,1,0,0,0,0,0,0,05/,0,0,10SM,0,33005KT
1,1,2015,1,1,52,A3037,0,0,0,0,0,0,0,04/,0,0,10SM,0,31004KT


---
### (3) Cleaning temp

---

In [7]:
wx['temp'] = [i.replace('/', '') for i in wx['temp']]
wx['temp'] = [i.replace('M', '-') for i in wx['temp']]

In [8]:
# Converting to correct dtype

wx['temp'] = wx['temp'].astype(int)

In [9]:
wx.head(2)

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
0,1,2015,1,0,52,A3037,1,0,0,0,0,0,0,5,0,0,10SM,0,33005KT
1,1,2015,1,1,52,A3037,0,0,0,0,0,0,0,4,0,0,10SM,0,31004KT


---
### (4) Cleaning visibility

We will be converting all of them with Statute Mile as base unit

---

In [10]:
wx.visibility.unique()

array(['10SM ', '1/2SM ', '3/4SM ', '1/4SM ', '16SM ', '1/8SM '],
      dtype=object)

In [11]:
wx[wx.visibility == '16SM ']

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
1495,3,2015,4,6,52,A3007,16,LTG,-,RA,SH,SN,0,10,TS,VC,16SM,040V110,14003KT
1496,3,2015,4,7,52,A3006,16,LTG,-,RA,SH,SN,0,10,TS,VC,16SM,040V110,15004KT


In [12]:
wx.iloc[1846,:].visibility

'10SM '

In [14]:
wx['visibility'] = [i.replace('SM', '') for i in wx['visibility']]
wx['visibility'] = [i.replace('1/2', '0.5') for i in wx['visibility']]
wx['visibility'] = [i.replace('3/4', '0.75') for i in wx['visibility']]
wx['visibility'] = [i.replace('1/4', '0.25') for i in wx['visibility']]
wx['visibility'] = [i.replace('1/8', '0.125') for i in wx['visibility']]

In [15]:
wx.visibility.unique()

array(['10 ', '0.5 ', '0.75 ', '0.25 ', '16 ', '0.125 '], dtype=object)

In [16]:
# Converting to correct dtype

wx['visibility'] = wx['visibility'].astype(float)

In [17]:
wx.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
0,1,2015,1,0,52,A3037,1,0,0,0,0,0,0,5,0,0,10.0,0,33005KT
1,1,2015,1,1,52,A3037,0,0,0,0,0,0,0,4,0,0,10.0,0,31004KT
2,1,2015,1,2,52,A3036,0,0,0,0,0,0,0,4,0,0,10.0,0,00000KT
3,1,2015,1,3,52,A3037,-1,0,0,0,0,0,0,3,0,0,10.0,0,32005KT
4,1,2015,1,4,52,A3035,-1,0,0,0,0,0,0,3,0,0,10.0,0,32006KT


---
### (5) Cleaning winddirspd

We will be splitting the information up into:
1. Wind Direction _(winddir)_
2. Wind Speed _(windspd)_
3. Wind Gusts _(windgust)_

---

In [18]:
wx.winddirspd.unique()

array(['33005KT', '31004KT', '00000KT', '32005KT', '32006KT', '33004KT',
       '33007KT', '31003KT', '31006KT', 'VRB03KT', '26003KT', '29004KT',
       '30003KT', '28005KT', '30006KT', '34004KT', '34005KT', '35006KT',
       '03005KT', '20003KT', '30004KT', '15003KT', '33006KT', '04004KT',
       'COR 000', '06003KT', '04006KT', '03007KT', '07006KT', '11004KT',
       '11006KT', '14006KT', '13006KT', '12006KT', '10005KT', '12007KT',
       '10006KT', '11009KT', '10008KT', '09009KT', '11007KT', '11008KT',
       '10009KT', '12010KT', '12009KT', '12008KT', '11005KT', '09006KT',
       '10007KT', '14007KT', '18008KT', '20008KT', '17005KT', '17006KT',
       '16005KT', '13004KT', '24012G2', '20007KT', '19005KT', '19008KT',
       '25005KT', '24008KT', '24010KT', '24009G1', '26013G2', '26012G2',
       '27014G1', '29016G2', '30015G2', '31014G2', '29013G1', '29014G2',
       '31013G2', '32011KT', '32011G1', '33014KT', '33019G2', '33014G2',
       'COR 330', '34012KT', '33010G2', '32014KT', 

In [19]:
wx[wx['winddirspd'].str.contains('COR', na=False)].head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
40,1,2015,2,16,52,A3032,7,0,-,RA,0,0,0,8,0,VC,10.0,0,COR 000
107,1,2015,5,11,52,A3041,-2,LTG,-,RA,0,0,0,4,TS,VC,10.0,003V007,COR 330
327,1,2015,14,15,52,A3028,3,LTG,-,RA,0,0,0,5,TS,VC,0.5,003V006,COR 350
410,1,2015,18,2,52,A3013,1,LTG,-,RA,0,SN,0,9,TS,VC,10.0,008V011,COR 150
836,2,2015,4,20,52,A3014,-4,LTG,-,RA,SH,SN,0,10,TS,VC,10.0,004V008,COR 180


In [20]:
wx.iloc[105:109,]

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd
105,1,2015,5,9,52,A3037,-1,LTG,-,RA,0,0,0,4,TS,VC,10.0,003V007,33019G2
106,1,2015,5,10,52,A3040,-1,LTG,-,RA,0,0,0,4,TS,VC,10.0,003V007,33014G2
107,1,2015,5,11,52,A3041,-2,LTG,-,RA,0,0,0,4,TS,VC,10.0,003V007,COR 330
108,1,2015,5,12,52,A3045,-2,LTG,-,RA,0,0,0,3,TS,VC,10.0,003V007,34012KT


In [21]:
re.findall(r'\d{3}', wx['winddirspd'][0])[0]

'330'

In [22]:
# Imputing nan with 999 for subsequent parsing

wx['winddirspd'].fillna('999', inplace=True)

In [23]:
wx['winddir'] = [re.findall(r'\d{3}', i) for i in wx['winddirspd']]
wx['windspd'] = [re.findall(r'\d{2}G|\d{2}KT', i) for i in wx['winddirspd']]
wx['windgust'] = [re.findall(r'G\d+', i) for i in wx['winddirspd']]

In [24]:
wx.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,...,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd,winddir,windspd,windgust
0,1,2015,1,0,52,A3037,1,0,0,0,...,0,5,0,0,10.0,0,33005KT,[330],[05KT],[]
1,1,2015,1,1,52,A3037,0,0,0,0,...,0,4,0,0,10.0,0,31004KT,[310],[04KT],[]
2,1,2015,1,2,52,A3036,0,0,0,0,...,0,4,0,0,10.0,0,00000KT,[000],[00KT],[]
3,1,2015,1,3,52,A3037,-1,0,0,0,...,0,3,0,0,10.0,0,32005KT,[320],[05KT],[]
4,1,2015,1,4,52,A3035,-1,0,0,0,...,0,3,0,0,10.0,0,32006KT,[320],[06KT],[]


In [25]:
def extract_from_list(x):
    """Defining function to help us extract the string values from the list within the cell"""
    try:
        return x[0]
    except:
        pass

In [26]:
wx['windspd'] = wx['windspd'].apply(extract_from_list)
wx['winddir'] = wx['winddir'].apply(extract_from_list)
wx['windgust'] = wx['windgust'].apply(extract_from_list)

In [27]:
# Imputing nan with 999 for subsequent parsing of wind speed and dir

wx['windspd'].fillna('999', inplace=True)
wx['winddir'].fillna('999', inplace=True) # possibility of new nans as if wind is VRB, it would not be captured by regex above

In [28]:
# Dropping the units 'KT's from the wind speed feature
wx['windspd'] = [i.replace('KT', '') for i in wx['windspd']]
wx['windspd'] = [i.replace('G', '') for i in wx['windspd']]

# Replacing null with 0 as there is no gust
wx['windgust'].fillna(0, inplace=True)

In [29]:
wx.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,...,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd,winddir,windspd,windgust
0,1,2015,1,0,52,A3037,1,0,0,0,...,0,5,0,0,10.0,0,33005KT,330,5,0
1,1,2015,1,1,52,A3037,0,0,0,0,...,0,4,0,0,10.0,0,31004KT,310,4,0
2,1,2015,1,2,52,A3036,0,0,0,0,...,0,4,0,0,10.0,0,00000KT,0,0,0
3,1,2015,1,3,52,A3037,-1,0,0,0,...,0,3,0,0,10.0,0,32005KT,320,5,0
4,1,2015,1,4,52,A3035,-1,0,0,0,...,0,3,0,0,10.0,0,32006KT,320,6,0


'COR' means corrected observations and we will not have the actual readings of wind speed if we have this. As such, if we have a corrected observation, we will the speed with the next recorded value.

We will also do the same _(take average of previous and next hour)_ for all other wind null values.

In [30]:
wx['windspd'] = wx['windspd'].astype(int)
wx['winddir'] = wx['winddir'].astype(int)

In [31]:
wx['windspd'] = [np.nan if i == 999 else i for i in wx['windspd']]
wx['winddir'] = [np.nan if i == 999 else i for i in wx['winddir']]

In [32]:
wx[wx['windspd'].isnull()].head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,...,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddirspd,winddir,windspd,windgust
40,1,2015,2,16,52,A3032,7,0,-,RA,...,0,8,0,VC,10.0,0,COR 000,0.0,,0
107,1,2015,5,11,52,A3041,-2,LTG,-,RA,...,0,4,TS,VC,10.0,003V007,COR 330,330.0,,0
327,1,2015,14,15,52,A3028,3,LTG,-,RA,...,0,5,TS,VC,0.5,003V006,COR 350,350.0,,0
410,1,2015,18,2,52,A3013,1,LTG,-,RA,...,0,9,TS,VC,10.0,008V011,COR 150,150.0,,0
836,2,2015,4,20,52,A3014,-4,LTG,-,RA,...,0,10,TS,VC,10.0,004V008,COR 180,180.0,,0


In [33]:
wx['windspd'].fillna(method='bfill', inplace=True)
wx['winddir'].fillna(method='bfill', inplace=True)

In [34]:
# Dropping winddirspd feature as we no longer need it

wx.drop('winddirspd', axis=1, inplace=True)

In [35]:
pd.set_option('display.max_columns', 40)
wx.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddir,windspd,windgust
0,1,2015,1,0,52,A3037,1,0,0,0,0,0,0,5,0,0,10.0,0,330.0,5.0,0
1,1,2015,1,1,52,A3037,0,0,0,0,0,0,0,4,0,0,10.0,0,310.0,4.0,0
2,1,2015,1,2,52,A3036,0,0,0,0,0,0,0,4,0,0,10.0,0,0.0,0.0,0
3,1,2015,1,3,52,A3037,-1,0,0,0,0,0,0,3,0,0,10.0,0,320.0,5.0,0
4,1,2015,1,4,52,A3035,-1,0,0,0,0,0,0,3,0,0,10.0,0,320.0,6.0,0


---
### (6) Cleaning QNH

_A3006 indicates the altimeter setting is 30.06 inHg (1,018 hPa) - Wikipedia_

---

In [36]:
# Correcting the QNH feature

wx['QNH'] = [i.replace('A', '') for i in wx['QNH']]

In [37]:
wx['QNH'] = wx['QNH'].astype(int)
wx['QNH'] = [i/100 for i in wx['QNH']]

In [38]:
wx.head()

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,wind_variation,winddir,windspd,windgust
0,1,2015,1,0,52,30.37,1,0,0,0,0,0,0,5,0,0,10.0,0,330.0,5.0,0
1,1,2015,1,1,52,30.37,0,0,0,0,0,0,0,4,0,0,10.0,0,310.0,4.0,0
2,1,2015,1,2,52,30.36,0,0,0,0,0,0,0,4,0,0,10.0,0,0.0,0.0,0
3,1,2015,1,3,52,30.37,-1,0,0,0,0,0,0,3,0,0,10.0,0,320.0,5.0,0
4,1,2015,1,4,52,30.35,-1,0,0,0,0,0,0,3,0,0,10.0,0,320.0,6.0,0


---
### (7) Cleaning wind variation

We shall opt to drop the wind variation feature as it "would not serve much purpose in the model. It would be too specific and technically every aircraft would experience a different variation (even diff variation for different approaches) if we were to take it into account, making it nearly impossible to determine what an aircraft actually experiences. I felt that using the averaged out wind direction ('winddir') would be more meaningful.

---

In [39]:
wx.drop('wind_variation', axis=1, inplace=True)

---
### (8) Final checks and exporting dataset

---

In [40]:
# Checking for duplicates

wx[wx.duplicated(subset=['month', 'day', 'hour'])]

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust
1179,2,2015,19,2,52,30.15,-13,LTG,-,RA,SH,SN,0,-3,TS,VC,10.0,310.0,19.0,G2
3280,5,2015,17,14,52,30.28,18,LTG,-,RA,SH,SN,0,24,TS,VC,10.0,160.0,5.0,0
4016,6,2015,17,10,52,30.13,21,LTG,-,RA,SH,SN,0,25,TS,VC,10.0,300.0,5.0,0
4096,6,2015,20,17,52,30.06,22,LTG,-,RA,SH,SN,0,26,TS,VC,10.0,290.0,6.0,0
4194,6,2015,24,21,52,30.13,20,LTG,-,RA,SH,SN,0,22,TS,VC,0.5,220.0,9.0,0
4345,7,2015,1,5,52,30.04,18,LTG,-,RA,SH,SN,0,22,TS,VC,10.0,250.0,6.0,0
4868,7,2015,23,0,52,29.95,21,LTG,-,RA,SH,SN,0,26,TS,VC,10.0,10.0,6.0,0
5629,8,2015,23,16,52,30.03,21,LTG,-,RA,SH,SN,SQ,27,TS,VC,10.0,320.0,3.0,0
5639,8,2015,24,2,52,29.99,20,LTG,-,RA,SH,SN,SQ,26,TS,VC,10.0,350.0,4.0,0
5648,8,2015,24,10,52,29.98,20,LTG,-,RA,SH,SN,SQ,22,TS,VC,10.0,310.0,6.0,0


Found some anomalies later on and came back to zoom in on them here 

In [41]:
wx.iloc[1177:1182,]

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust
1177,2,2015,19,1,52,30.12,-12,LTG,-,RA,SH,SN,0,-3,TS,VC,10.0,310.0,18.0,G3
1178,2,2015,19,2,0,30.12,-12,LTG,-,RA,SH,SN,0,-3,TS,VC,10.0,310.0,19.0,0
1179,2,2015,19,2,52,30.15,-13,LTG,-,RA,SH,SN,0,-3,TS,VC,10.0,310.0,19.0,G2
1180,2,2015,19,3,52,30.18,-14,LTG,-,RA,SH,SN,0,-4,TS,VC,10.0,300.0,17.0,G2
1181,2,2015,19,4,52,30.19,-14,LTG,-,RA,SH,SN,0,-5,TS,VC,10.0,330.0,18.0,G2


In [42]:
wx.iloc[3277:3283,]

Unnamed: 0,month,year,day,hour,min,QNH,dew_point,lightning,low_intensity,rain,shower,snow,squall,temp,thunderyshower,vicinity,visibility,winddir,windspd,windgust
3277,5,2015,17,12,52,30.28,16,LTG,-,RA,SH,SN,0,21,TS,VC,10.0,180.0,4.0,0
3278,5,2015,17,13,52,30.28,17,LTG,-,RA,SH,SN,0,23,TS,VC,10.0,170.0,7.0,0
3279,5,2015,17,14,0,30.28,17,LTG,-,RA,SH,SN,0,23,TS,VC,10.0,160.0,5.0,0
3280,5,2015,17,14,52,30.28,18,LTG,-,RA,SH,SN,0,24,TS,VC,10.0,160.0,5.0,0
3281,5,2015,17,15,52,30.28,18,LTG,-,RA,SH,SN,0,26,TS,VC,10.0,160.0,3.0,0
3282,5,2015,17,16,52,30.27,18,LTG,-,RA,SH,SN,0,27,TS,VC,10.0,160.0,6.0,0


In [43]:
wx['min'].unique()

array([52,  0], dtype=int64)

Notice how there is a 0 min in some columns. METARs in US are supposed to be releases on the 52nd minute, as such we will treat them as anomalies _(probably wrongly keyed in data)_ and drop all 0 min readings.

In [44]:
wx.shape

(8794, 20)

In [45]:
wx = wx[wx['min'] != 0]

In [46]:
wx.shape

(8739, 20)

We if we have the METAR every hour for the entire year of 365 days, we would have 8760 rows of readings, but notice that after we dropped the duplicates, we only have 3739 rows. It is likely that there are missing rows of data within for certain hours. 

We will continue with 8739 rows of data as of now and later on drop the combined data with no weather.

In [47]:
wx.isnull().sum()

month             0
year              0
day               0
hour              0
min               0
QNH               0
dew_point         0
lightning         0
low_intensity     0
rain              0
shower            0
snow              0
squall            0
temp              0
thunderyshower    0
vicinity          0
visibility        0
winddir           0
windspd           0
windgust          0
dtype: int64

In [48]:
wx.to_csv('../datasets/cleaned_wx.csv', index=False)