# Does the weather affect the delay?

Use the API to pull the weather information for flights. There is no need to get weather for ALL flights. We can choose the right representative sample. Let's focus on four weather types:

- sunny
- cloudy
- rainy
- snow.

Test the hypothesis that these 4 delays are from the same distribution. If they are not, which ones are significantly different?

**Note**: This notebook has the prep work only. For the actual statistical analysis, go to: [Task 3](Task%203.ipynb)

In [378]:
# initialization - importing the required libraries
import pandas as pd 
import numpy as np
import requests
# a hack to print everything . change 'all' to 'last_expr' to revert to default
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## setting up the API


In [379]:
# function for calling api from weatheronline.com
def weather_api(city, date): 
    url = "https://api.worldweatheronline.com/premium/v1/past-weather.ashx?"
    params = {
        "q":city, 
        "date":date,
        "format":"json",
        "tp":1 ,
        "key": key
    }
    headers = {
        "Accept": "application/json"
    }
    response = requests.request("GET", url, params=params, headers=headers)   
    return response.json()

In [None]:
# testing...
result = weather_api('Bakersfield, CA', '2019-04-01')

hourly_report = result['data']['weather'][0]['hourly']
hourly_report

In [None]:
for _ in range(24):
    hourly_report[_]['time']
    hourly_report[_]['weatherDesc'][0]['value']

## Creating the sample dataframe

Steps:
1. Import the main dataframe with just the relevant columns
2. Do EDA cleaning, etc and create new columns as necessary (e.g. flight hour)
3. Filter to keep only recorded weather delays
4. Use `groupby` to create 1 dataframe of delays by date and origin_city,
    and another dataframe of delays by date and dest_city
5. create a for loop to go through the dataframe, calling the API function to retrieve weather information for that date/hour/city
6. compile (5) into a dataframe
7. simplify the weather_details e.g. 'moderate_snow' -> 'snow'

In [347]:
## 1. Import the main dataframe with just the relevant columns
usecols = ['fl_date',
           'origin_city_name',
           'dest_city_name',
           'crs_dep_time',
           'crs_arr_time',
           'weather_delay',
          'arr_delay',
          'dep_delay']

chunk = pd.read_csv('flights.csv', 
                    usecols = usecols, 
                    chunksize=1000000, 
                    low_memory=False)
df = pd.concat(chunk)

In [348]:
## 2. Do EDA cleaning, etc and create new columns as necessary (e.g. flight hour)

In [349]:
# extract hour
df['dep_hour'] = (np.round(df['crs_dep_time'],-2)/100).astype(int)
df['arr_hour'] = (np.round(df['crs_arr_time'],-2)/100).astype(int)

In [350]:
df.head()

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,dep_delay,crs_arr_time,arr_delay,weather_delay,dep_hour,arr_hour
0,2018-07-14,"Boston, MA","Orlando, FL",1550,-10.0,1910,-42.0,,16,19
1,2018-07-14,"Los Angeles, CA","Minneapolis, MN",1828,6.0,2359,-1.0,,18,24
2,2018-07-14,"Chicago, IL","Fort Lauderdale, FL",1644,0.0,2051,-1.0,,16,21
3,2018-07-14,"Minneapolis, MN","Atlanta, GA",1955,-6.0,2326,-14.0,,20,23
4,2018-07-14,"Las Vegas, NV","Oakland, CA",1759,-5.0,1927,11.0,,18,19


In [351]:
## 3. Filter to keep only recorded weather delays
filter = df['weather_delay'] > 0
df_weather = df[filter].copy()

In [352]:
# cleaning... checking for nulls and dropping
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 181998 entries, 36 to 15927381
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   fl_date           181998 non-null  object 
 1   origin_city_name  181998 non-null  object 
 2   dest_city_name    181998 non-null  object 
 3   crs_dep_time      181998 non-null  int64  
 4   dep_delay         181998 non-null  float64
 5   crs_arr_time      181998 non-null  int64  
 6   arr_delay         181998 non-null  float64
 7   weather_delay     181998 non-null  float64
 8   dep_hour          181998 non-null  int64  
 9   arr_hour          181998 non-null  int64  
dtypes: float64(3), int64(4), object(3)
memory usage: 15.3+ MB


In [353]:
df_weather.dropna(inplace=True)
df_weather.head()

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,dep_delay,crs_arr_time,arr_delay,weather_delay,dep_hour,arr_hour
36,2018-07-14,"Las Vegas, NV","Fort Lauderdale, FL",2250,128.0,638,124.0,117.0,22,6
148,2018-07-14,"Jackson/Vicksburg, MS","Charlotte, NC",1757,4.0,2055,20.0,1.0,18,21
153,2018-07-14,"Tallahassee, FL","Charlotte, NC",1513,157.0,1657,145.0,3.0,15,17
155,2018-07-14,"Jacksonville, FL","Charlotte, NC",1519,52.0,1654,39.0,30.0,15,17
251,2018-07-14,"Cedar Rapids/Iowa City, IA","Charlotte, NC",644,15.0,1005,27.0,15.0,6,10


**Use groupby to create 1 dataframe of delays by date and origin_city, 
and another dataframe of delays by date and dest_city**

In [354]:
# 2 dataframes for weather incidents
# assumption: most frequent location will indicate whether it originated from destination or origin
weather_groupby_origin = df_weather.groupby('fl_date')['origin_city_name'].agg(pd.Series.mode).to_frame()
weather_groupby_dest = df_weather.groupby('fl_date')['dest_city_name'].agg(pd.Series.mode).to_frame()

In [355]:
len(weather_groupby_origin), len(weather_groupby_dest)

(730, 730)

In [372]:
weather_groupby_origin.head()

Unnamed: 0_level_0,origin_city_name
fl_date,Unnamed: 1_level_1
2018-01-01,"Chicago, IL"
2018-01-02,"Chicago, IL"
2018-01-03,"Chicago, IL"
2018-01-04,"Chicago, IL"
2018-01-05,"New York, NY"


In [373]:
weather_groupby_dest.head()

Unnamed: 0_level_0,dest_city_name
fl_date,Unnamed: 1_level_1
2018-01-01,"Chicago, IL"
2018-01-02,"Chicago, IL"
2018-01-03,"Chicago, IL"
2018-01-04,"Atlanta, GA"
2018-01-05,"New York, NY"


In [374]:
# get sample data
df_origin_sample = weather_groupby_origin.sample(n=500)
df_dest_sample = weather_groupby_dest.sample(n=500)

**Scratch work** Skip to the next header

In [375]:
df_origin_sample

Unnamed: 0_level_0,origin_city_name
fl_date,Unnamed: 1_level_1
2019-09-17,"Atlanta, GA"
2019-07-22,"Denver, CO"
2018-12-19,"Raleigh/Durham, NC"
2018-09-19,"Chicago, IL"
2018-08-18,"Dallas/Fort Worth, TX"
...,...
2018-08-31,"Charlotte, NC"
2018-12-01,"Chicago, IL"
2018-04-08,"Minneapolis, MN"
2019-06-12,"Orlando, FL"


In [None]:
idx = 0
date = df_origin_sample.index[idx]
city = df_origin_sample['origin_city_name'][idx]
date, city

In [None]:
filter_date = df_weather['fl_date'] == date
filter_city = df_weather['origin_city_name'] == city
# hours when there was a weather delay
hours = df_weather[(filter_date)& (filter_city)]['dep_hour'].unique()
# index of records with weather delay
index_time = df_weather[(filter_date)& (filter_city)]['dep_hour'].index

In [None]:
df_weather[(filter_date)& (filter_city)]['dep_hour'].index

In [None]:
# make a frame with date/city/dep_hour/weather_delay/type_of_weather

In [None]:
for hour in hours: 
    hourly_report[hour]['weatherDesc'][0]['value']

In [None]:
hour = 19
hourly_report[hour]['weatherDesc'][0]['value']

In [None]:
df_weather.loc[index_time[0]]

**End of scratch work**

### 5. Create a for loop to go through the dataframe, calling the API function to retrieve weather information for that date/hour/city

**Capturing the bad weather for departing flights**

In [380]:
# create a copy of the main df_weather for safety
df_weather_final = df_weather.copy()

In [None]:
# loop through all the records in the df_origin dataset

for date in df_origin_sample.index:
    # get the date and city that will be sent to the API
    city = df_origin_sample.loc[date].values[0]
    
    # call the API and save the hourly_report of that day and city
    result = weather_api(city, date)
    hourly_report = result['data']['weather'][0]['hourly']
    
    # make a dictionary of this report
    hourly_report_dictionary = {}
    for hr in range(24):
        hourly_report_dictionary[hr] = hourly_report[hour]['weatherDesc'][0]['value']

    
    # use date and city to filter the main df_weather_final dataframe
    filter_date = df_weather_final['fl_date'] == date
    # capture 'city' is actually a list error
    try:
        filter_city = df_weather_final['origin_city_name'] == city
    except:
        city = city[0]
        filter_city = df_weather_final['origin_city_name'] == city
    
    # retrive the hours when there was a weather delay
    hours = df_weather_final[(filter_date)& (filter_city)]['dep_hour'].unique()
    # index of records with weather delay
    index_time = df_weather_final[(filter_date)& (filter_city)]['dep_hour'].index
    
    # loop through the records of bad weather
    for idx in index_time:  
        # get the reference hour of the flight
        ref_hour = df_weather_final.loc[idx, 'dep_hour']
        # cross-reference it in the dictionary to get the weather at that time
        
        # capture the rounding error
        try:
            df_weather_final.loc[idx, 'weather'] = hourly_report_dictionary[ref_hour]
        except:
            if ref_hour == 24:
                ref_hour = 23
            else:
                continue
        
df_weather_final.drop(columns=['arr_delay'], inplace=True)
df_weather_final.rename(columns={'dep_delay': 'delay'}, inplace=True)



In [383]:
df_weather_final.dropna()

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,delay,crs_arr_time,weather_delay,dep_hour,arr_hour,weather
332,2018-07-14,"Atlanta, GA","Austin, TX",1730,92.0,1850,56.0,17,18,Moderate or heavy rain shower
374,2018-07-14,"Atlanta, GA","Jacksonville, FL",1710,72.0,1825,57.0,17,18,Moderate or heavy rain shower
377,2018-07-14,"Atlanta, GA","Las Vegas, NV",1740,39.0,1855,37.0,17,19,Moderate or heavy rain shower
419,2018-07-14,"Atlanta, GA","Tampa, FL",1715,73.0,1845,69.0,17,18,Moderate or heavy rain shower
3878,2018-07-14,"Atlanta, GA","Phoenix, AZ",1802,80.0,1902,5.0,18,19,Moderate or heavy rain shower
...,...,...,...,...,...,...,...,...,...,...
15921600,2018-07-14,"Atlanta, GA","Chattanooga, TN",1748,41.0,1841,41.0,17,18,Moderate or heavy rain shower
15921738,2018-07-14,"Atlanta, GA","Fort Wayne, IN",2026,91.0,2207,77.0,20,22,Moderate or heavy rain shower
15921768,2018-07-14,"Atlanta, GA","Albany, GA",1037,175.0,1133,166.0,10,11,Moderate or heavy rain shower
15923518,2018-07-14,"Atlanta, GA","Charlotte, NC",1954,39.0,2126,3.0,20,21,Moderate or heavy rain shower


**Capturing the bad weather for arrival flights**

In [384]:
df_weather_final2 = df_weather.copy()

In [385]:
# same as above but for arrival flights, so: dest_city and arr_hour, etc

for date in df_dest_sample.index:
    # get the date and city that will be sent to the API
    city = df_dest_sample.loc[date].values[0]
    
    # call the API and save the hourly_report of that day and city
    result = weather_api(city, date)
    hourly_report = result['data']['weather'][0]['hourly']
    
    # make a dictionary of this report
    hourly_report_dictionary = {}
    for hr in range(24):
        hourly_report_dictionary[hr] = hourly_report[hour]['weatherDesc'][0]['value']

    
    # use date and city to filter the df_weather_final2 dataframe
    filter_date = df_weather_final2['fl_date'] == date
    try:
        filter_city = df_weather_final2['dest_city_name'] == city
    except:
        city = city[0]
        filter_city = df_weather_final2['dest_city_name'] == city
    
    # hours when there was a weather delay
    hours = df_weather_final2[(filter_date)& (filter_city)]['arr_hour'].unique()
    # index of records with weather delay
    index_time = df_weather_final2[(filter_date)& (filter_city)]['arr_hour'].index
    
    for idx in index_time:  
        # get the reference hour of the flight
        ref_hour = df_weather_final2.loc[idx, 'arr_hour']
        # cross-reference it in the dictionary to get the weather at that time    
        # capture the rounding error
        try:
            df_weather_final2.loc[idx, 'weather'] = hourly_report_dictionary[ref_hour]
        except:
            if ref_hour == 24:
                ref_hour = 23
            else:
                continue
        
df_weather_final2.drop(columns=['dep_delay'], inplace=True)
df_weather_final2.rename(columns={'arr_delay': 'delay'}, inplace=True)    

In [386]:
# inspect the dataframes
df_weather_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 181998 entries, 36 to 15927381
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   fl_date           181998 non-null  object 
 1   origin_city_name  181998 non-null  object 
 2   dest_city_name    181998 non-null  object 
 3   crs_dep_time      181998 non-null  int64  
 4   delay             181998 non-null  float64
 5   crs_arr_time      181998 non-null  int64  
 6   weather_delay     181998 non-null  float64
 7   dep_hour          181998 non-null  int64  
 8   arr_hour          181998 non-null  int64  
 9   weather           37105 non-null   object 
dtypes: float64(2), int64(4), object(4)
memory usage: 19.3+ MB


In [387]:
df_weather_final2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 181998 entries, 36 to 15927381
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   fl_date           181998 non-null  object 
 1   origin_city_name  181998 non-null  object 
 2   dest_city_name    181998 non-null  object 
 3   crs_dep_time      181998 non-null  int64  
 4   crs_arr_time      181998 non-null  int64  
 5   delay             181998 non-null  float64
 6   weather_delay     181998 non-null  float64
 7   dep_hour          181998 non-null  int64  
 8   arr_hour          181998 non-null  int64  
 9   weather           11924 non-null   object 
dtypes: float64(2), int64(4), object(4)
memory usage: 19.3+ MB


In [388]:
# merge the two frames and drop null values
df_main = pd.concat([df_weather_final.dropna(), df_weather_final2.dropna()])

In [389]:
df_main.head(10)

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,delay,crs_arr_time,weather_delay,dep_hour,arr_hour,weather
332,2018-07-14,"Atlanta, GA","Austin, TX",1730,92.0,1850,56.0,17,18,Moderate or heavy rain shower
374,2018-07-14,"Atlanta, GA","Jacksonville, FL",1710,72.0,1825,57.0,17,18,Moderate or heavy rain shower
377,2018-07-14,"Atlanta, GA","Las Vegas, NV",1740,39.0,1855,37.0,17,19,Moderate or heavy rain shower
419,2018-07-14,"Atlanta, GA","Tampa, FL",1715,73.0,1845,69.0,17,18,Moderate or heavy rain shower
3878,2018-07-14,"Atlanta, GA","Phoenix, AZ",1802,80.0,1902,5.0,18,19,Moderate or heavy rain shower
7124,2018-07-14,"Atlanta, GA","Chicago, IL",1733,66.0,1905,32.0,17,19,Moderate or heavy rain shower
7502,2018-07-14,"Atlanta, GA","Los Angeles, CA",1903,44.0,2100,1.0,19,21,Moderate or heavy rain shower
7753,2018-07-14,"Atlanta, GA","Norfolk, VA",2011,91.0,2152,20.0,20,22,Moderate or heavy rain shower
7817,2018-07-14,"Atlanta, GA","Austin, TX",2039,62.0,2156,52.0,20,22,Moderate or heavy rain shower
7824,2018-07-14,"Atlanta, GA","Fort Lauderdale, FL",1900,40.0,2058,24.0,19,21,Moderate or heavy rain shower


In [390]:
df_main.tail(20)

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,delay,crs_arr_time,weather_delay,dep_hour,arr_hour,weather
15860992,2018-07-11,"Wichita, KS","Dallas/Fort Worth, TX",1624,28.0,1745,28.0,16,17,Moderate rain at times
15861003,2018-07-11,"Washington, DC","Dallas/Fort Worth, TX",1710,31.0,1924,31.0,17,19,Moderate rain at times
15861037,2018-07-11,"Santa Ana, CA","Dallas/Fort Worth, TX",1604,17.0,2109,7.0,16,21,Moderate rain at times
15862737,2018-07-11,"Aspen, CO","Dallas/Fort Worth, TX",1555,36.0,1915,36.0,16,19,Moderate rain at times
15863102,2018-07-11,"Minneapolis, MN","Dallas/Fort Worth, TX",1740,44.0,2017,44.0,17,20,Moderate rain at times
15863156,2018-07-11,"Detroit, MI","Dallas/Fort Worth, TX",1958,46.0,2150,39.0,20,22,Moderate rain at times
15863397,2018-07-11,"Cincinnati, OH","Dallas/Fort Worth, TX",1700,16.0,1829,16.0,17,18,Moderate rain at times
15865075,2018-07-11,"Roswell, NM","Dallas/Fort Worth, TX",1654,26.0,1935,26.0,17,19,Moderate rain at times
15865088,2018-07-11,"Madison, WI","Dallas/Fort Worth, TX",1715,62.0,1940,62.0,17,19,Moderate rain at times
15865123,2018-07-11,"Lafayette, LA","Dallas/Fort Worth, TX",1819,31.0,1945,22.0,18,19,Moderate rain at times


In [391]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49029 entries, 332 to 15921712
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   fl_date           49029 non-null  object 
 1   origin_city_name  49029 non-null  object 
 2   dest_city_name    49029 non-null  object 
 3   crs_dep_time      49029 non-null  int64  
 4   delay             49029 non-null  float64
 5   crs_arr_time      49029 non-null  int64  
 6   weather_delay     49029 non-null  float64
 7   dep_hour          49029 non-null  int64  
 8   arr_hour          49029 non-null  int64  
 9   weather           49029 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 4.1+ MB


In [392]:
df_main['weather'].unique()

array(['Moderate or heavy rain shower', 'Moderate rain at times', 'Sunny',
       'Patchy rain possible', 'Partly cloudy', 'Overcast', 'Cloudy',
       'Heavy rain at times', 'Moderate or heavy snow showers',
       'Light freezing rain', 'Moderate snow', 'Patchy moderate snow'],
      dtype=object)

In [393]:
def simplify_weather (weather):
    weather = weather.lower()
    if 'sunny' in weather:
        weather = 'Sunny'
    elif 'rain' in weather:
        weather = 'Rainy'
    elif 'cloud' in weather:
        weather = 'Cloudy'
    elif 'overcast' in weather:
        weather = 'Cloudy'
    elif 'snow' in weather:
        weather = 'Snow'
    return weather

In [394]:
tmp = df_main.copy() # safekeeping

In [395]:
df_main['weather'] = df_main['weather'].apply(simplify_weather)

In [396]:
df_main['weather'].unique()

array(['Rainy', 'Sunny', 'Cloudy', 'Snow'], dtype=object)

In [397]:
df_main.head()

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,delay,crs_arr_time,weather_delay,dep_hour,arr_hour,weather
332,2018-07-14,"Atlanta, GA","Austin, TX",1730,92.0,1850,56.0,17,18,Rainy
374,2018-07-14,"Atlanta, GA","Jacksonville, FL",1710,72.0,1825,57.0,17,18,Rainy
377,2018-07-14,"Atlanta, GA","Las Vegas, NV",1740,39.0,1855,37.0,17,19,Rainy
419,2018-07-14,"Atlanta, GA","Tampa, FL",1715,73.0,1845,69.0,17,18,Rainy
3878,2018-07-14,"Atlanta, GA","Phoenix, AZ",1802,80.0,1902,5.0,18,19,Rainy


In [400]:
df_main.to_csv('task3_weather_df.csv', index=False)

In [401]:
pd.read_csv('task3_weather_df.csv').head()

Unnamed: 0,fl_date,origin_city_name,dest_city_name,crs_dep_time,delay,crs_arr_time,weather_delay,dep_hour,arr_hour,weather
0,2018-07-14,"Atlanta, GA","Austin, TX",1730,92.0,1850,56.0,17,18,Rainy
1,2018-07-14,"Atlanta, GA","Jacksonville, FL",1710,72.0,1825,57.0,17,18,Rainy
2,2018-07-14,"Atlanta, GA","Las Vegas, NV",1740,39.0,1855,37.0,17,19,Rainy
3,2018-07-14,"Atlanta, GA","Tampa, FL",1715,73.0,1845,69.0,17,18,Rainy
4,2018-07-14,"Atlanta, GA","Phoenix, AZ",1802,80.0,1902,5.0,18,19,Rainy
