<a href="https://colab.research.google.com/github/samueljaval/weather-prediction/blob/main/getData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import requests
import pandas as pd
import json
import datetime
import calendar
import ast 

I will be getting my data for this project from the world weather online API. The documentation can be found at the following link : https://www.worldweatheronline.com/developer/api/docs/historical-weather-api.aspx#qparameter

In [None]:
URL = "https://api.worldweatheronline.com/premium/v1/past-weather.ashx"

For each day, some of the data is for the entire day and some of the data is for each hour in the day. For each hourly feature, I will take the average to make it daily. I use the 'result' dictionary to add all the hourly values and then divide by 24. So the 'dailyFromHourly' function take the data from the API for one day and returns a dictionary of all the weather features for that day.

In [None]:
def dailyFromHourly(day):
  result = {
              #'tempC':0,
              'avgtempC':0,
              'maxtempC':0,
              'mintempC':0,
              'totalSnow_cm':0,
              'windspeedKmph':0, 
              'winddirDegree':0, 
              'precipMM':0,
              'humidity':0,
              'visibility':0,
              'pressure':0,
              'cloudcover':0,
              'DewPointC':0,
              'WindChillC':0,
              'FeelsLikeC':0
            }
  result['maxtempC'] = int(day['maxtempC'])
  result['mintempC'] = int(day['mintempC'])
  result['avgtempC'] = int(day['avgtempC'])
  result['totalSnow_cm'] = float(day['totalSnow_cm'])
  date = day['date']
  date = date.split('-')
  result['date'] = datetime.datetime(int(date[0]), int(date[1]), int(date[2]))
  for hour in day['hourly']: 
    for item in hour:
      if item in result:
        result[item] += float(hour[item])
  for item in result: 
    if item not in ['maxtempC','mintempC','totalSnow_cm','date','precipMM','avgtempC']:
      result[item] = round(result[item] / 24, 2)
    result['precipMM'] = round(result['precipMM'],2)
  return result

The 'ReqToDF' function makes the HTTP GET request to the API. It takes a date and enddate as parameters. The parameters will correspond to the first and last day of a month because the API will only let us get one month of data per GET request. Once we got all the data for a particular month we then call the 'dailyFromHourly' on each day in the month. The function will return a pandas dataframe with each row corresponds to a day and each column is a weather feature (e.g couldcover, wind speed, average temperature, ... etc).

In [None]:
def ReqToDF(date, enddate):
  PARAMS = {
            #longitude and latitude of Val d'Isère, France
            'q':'45.448,6.980',
            'tp':'1',
            'date': str(date.year)+'-'+str(date.month)+'-'+str(date.day),
            'enddate': str(enddate.year)+'-'+str(enddate.month)+'-'+str(enddate.day),
            'key':'82efea4533d84849b4a222022200111',
            'format':'json'
            }
  res = requests.get(url = URL, params = PARAMS)
  data = res.json()
  result = []
  for day in data['data']['weather']:
    result.append(dailyFromHourly(day))
  df = pd.DataFrame(result)
  first_col = df.pop('date')
  df.insert(0, 'date', first_col)
  return df

The 'last_day_of_month' function simply returns the last day of a particular month. This will be used when we call the 'ReqToDF' function from above to get the right first and last day of the needed month. 

In [None]:
def last_day_of_month(year, month):
    last_days = [31, 30, 29, 28, 27]
    for i in last_days:
        try:
            end = datetime.datetime(year, month, i)
        except ValueError:
            continue
        else:
            return end.date()
    return None

The 'getListDateIntervals' function takes in the first and last date of the first month we need and returns a list of all tuples (first day,last day) for each month until the end of 2019. 

In [None]:
def getListDateIntervals(start,end):
  s = start 
  e = end 
  lst = [(s,e)]
  while not (s.year == 2019 and s.month == 12):
    if s.month == 12:
      s = datetime.datetime(s.year+1,1,1)
    else: 
      s = datetime.datetime(s.year,s.month+1,1)
    e = last_day_of_month(s.year, s.month)
    lst.append((s,e))
  return lst 

'getAllData' is the main function of this script. It gets the data for all the months between January 2009 and December 2019 and priduced a list of dataframes where each dataframe holds the data for one month. We then concatenate the dataframes to get all the data in one single dataframe.

In [None]:
def getAllData(): 
  dates = getListDateIntervals(datetime.datetime(2009, 1, 1),datetime.datetime(2009, 1, 31))
  lst_df = []
  for interval in dates: 
    lst_df.append(ReqToDF(interval[0],interval[1]))
  df = pd.concat(lst_df)
  return df

Exporting the dataframe to a csv file

In [1]:
df = getAllData()
df.to_csv('data.csv',index=False)

Here's a quick look at what our big dataframe looks like

In [2]:
df.head()

Unnamed: 0,date,avgtempC,maxtempC,mintempC,totalSnow_cm,windspeedKmph,winddirDegree,precipMM,humidity,visibility,pressure,cloudcover,DewPointC,WindChillC,FeelsLikeC
0,2009-01-01,-9,-3,-21,0.9,7.17,226.71,1.0,97.29,2.25,1025.17,37.96,-9.58,-11.79,-11.79
1,2009-01-02,-14,-3,-22,0.0,4.04,259.29,0.0,97.71,1.42,1021.12,15.62,-14.21,-15.62,-15.62
2,2009-01-03,-13,-7,-20,0.2,6.75,115.92,0.0,98.38,1.67,1021.46,61.67,-13.38,-16.83,-16.83
3,2009-01-04,-14,-4,-24,0.0,6.83,300.88,0.0,96.42,2.42,1019.62,18.25,-14.42,-18.29,-18.29
4,2009-01-05,-13,-8,-18,0.0,7.88,305.08,0.0,93.62,5.75,1015.38,58.5,-14.0,-18.12,-18.12


In [3]:
df.describe()

Unnamed: 0,avgtempC,maxtempC,mintempC,totalSnow_cm,windspeedKmph,winddirDegree,precipMM,humidity,visibility,pressure,cloudcover,DewPointC,WindChillC,FeelsLikeC
count,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0,4017.0
mean,0.009709,4.341548,-5.063978,0.675454,8.040306,232.488071,1.998233,86.770809,6.376413,1016.757453,57.463714,-2.539061,-2.679597,-2.679288
std,7.868513,8.015414,8.571421,2.606383,3.661351,69.303534,4.393992,12.137017,2.876501,6.996115,29.508598,7.076254,9.268072,9.268676
min,-25.0,-19.0,-35.0,0.0,1.58,50.21,0.0,31.42,0.0,983.79,0.0,-25.0,-28.58,-28.58
25%,-6.0,-2.0,-11.0,0.0,5.33,173.62,0.0,80.0,4.21,1013.17,33.0,-7.62,-10.08,-10.08
50%,0.0,3.0,-4.0,0.0,7.38,253.54,0.3,91.38,6.58,1017.0,59.79,-1.92,-2.92,-2.92
75%,7.0,11.0,2.0,0.2,10.17,291.79,2.4,96.29,8.75,1020.79,83.5,3.25,5.42,5.42
max,21.0,25.0,12.0,80.2,31.75,341.21,88.5,99.08,20.0,1037.04,100.0,12.96,17.92,18.0


This little bit of code will become very import in the other scripts of the porject. We expand our current dataframe containing all the weather data to make it easier for machine learing models to make weather predictions. If $n$ is the current number of columns in the original dataframe, we now have $n+3n$ columns. For each rown, these additional columns corresponds to the data from the day before, two days before and three days before.

In [4]:
expanded_df = df.copy()

def get_previous(input_df, feature, N):
  old_column = input_df[feature].tolist()
  new_column = [None for i in range(len(old_column))]
  for i in range(N,len(new_column)):
    new_column[i] = old_column[i-N]
    input_df[feature+'(-'+str(N)+')']=new_column

for feature in expanded_df.columns:
  if feature != 'date':
    for N in range(1,4):
      get_previous(expanded_df,feature,N)

We get rid of the first three rows because they turn out to be incomplete since we don't have the data from the three previous days. We also export the expanded datafame to a csv file so that we can use it in the other scripts of the project.

In [None]:
expanded_df = expanded_df.iloc[3:].reset_index()
expanded_df = expanded_df.drop(columns=['index'])
expanded_df.to_csv('expanded_data.csv',index=False)

Here's a quick look at the exxpanded dataframe

In [6]:
expanded_df.head()

Unnamed: 0,date,avgtempC,maxtempC,mintempC,totalSnow_cm,windspeedKmph,winddirDegree,precipMM,humidity,visibility,pressure,cloudcover,DewPointC,WindChillC,FeelsLikeC,avgtempC(-1),avgtempC(-2),avgtempC(-3),maxtempC(-1),maxtempC(-2),maxtempC(-3),mintempC(-1),mintempC(-2),mintempC(-3),totalSnow_cm(-1),totalSnow_cm(-2),totalSnow_cm(-3),windspeedKmph(-1),windspeedKmph(-2),windspeedKmph(-3),winddirDegree(-1),winddirDegree(-2),winddirDegree(-3),precipMM(-1),precipMM(-2),precipMM(-3),humidity(-1),humidity(-2),humidity(-3),visibility(-1),visibility(-2),visibility(-3),pressure(-1),pressure(-2),pressure(-3),cloudcover(-1),cloudcover(-2),cloudcover(-3),DewPointC(-1),DewPointC(-2),DewPointC(-3),WindChillC(-1),WindChillC(-2),WindChillC(-3),FeelsLikeC(-1),FeelsLikeC(-2),FeelsLikeC(-3)
0,2009-01-04,-14,-4,-24,0.0,6.83,300.88,0.0,96.42,2.42,1019.62,18.25,-14.42,-18.29,-18.29,-13.0,-14.0,-9.0,-7.0,-3.0,-3.0,-20.0,-22.0,-21.0,0.2,0.0,0.9,6.75,4.04,7.17,115.92,259.29,226.71,0.0,0.0,1.0,98.38,97.71,97.29,1.67,1.42,2.25,1021.46,1021.12,1025.17,61.67,15.62,37.96,-13.38,-14.21,-9.58,-16.83,-15.62,-11.79,-16.83,-15.62,-11.79
1,2009-01-05,-13,-8,-18,0.0,7.88,305.08,0.0,93.62,5.75,1015.38,58.5,-14.0,-18.12,-18.12,-14.0,-13.0,-14.0,-4.0,-7.0,-3.0,-24.0,-20.0,-22.0,0.0,0.2,0.0,6.83,6.75,4.04,300.88,115.92,259.29,0.0,0.0,0.0,96.42,98.38,97.71,2.42,1.67,1.42,1019.62,1021.46,1021.12,18.25,61.67,15.62,-14.42,-13.38,-14.21,-18.29,-16.83,-15.62,-18.29,-16.83,-15.62
2,2009-01-06,-12,-10,-15,2.7,9.08,114.67,3.1,98.0,3.46,1015.04,93.17,-11.71,-16.75,-16.75,-13.0,-14.0,-13.0,-8.0,-4.0,-7.0,-18.0,-24.0,-20.0,0.0,0.0,0.2,7.88,6.83,6.75,305.08,300.88,115.92,0.0,0.0,0.0,93.62,96.42,98.38,5.75,2.42,1.67,1015.38,1019.62,1021.46,58.5,18.25,61.67,-14.0,-14.42,-13.38,-18.12,-18.29,-16.83,-18.12,-18.29,-16.83
3,2009-01-07,-9,-7,-10,5.1,14.17,99.25,5.9,99.0,5.17,1014.83,100.0,-8.58,-14.5,-14.5,-12.0,-13.0,-14.0,-10.0,-8.0,-4.0,-15.0,-18.0,-24.0,2.7,0.0,0.0,9.08,7.88,6.83,114.67,305.08,300.88,3.1,0.0,0.0,98.0,93.62,96.42,3.46,5.75,2.42,1015.04,1015.38,1019.62,93.17,58.5,18.25,-11.71,-14.0,-14.42,-16.75,-18.12,-18.29,-16.75,-18.12,-18.29
4,2009-01-08,-7,-6,-9,1.8,12.54,118.88,2.0,99.0,4.71,1022.17,100.0,-7.04,-12.29,-12.29,-9.0,-12.0,-13.0,-7.0,-10.0,-8.0,-10.0,-15.0,-18.0,5.1,2.7,0.0,14.17,9.08,7.88,99.25,114.67,305.08,5.9,3.1,0.0,99.0,98.0,93.62,5.17,3.46,5.75,1014.83,1015.04,1015.38,100.0,93.17,58.5,-8.58,-11.71,-14.0,-14.5,-16.75,-18.12,-14.5,-16.75,-18.12
