In [1]:
import pandas as pd
import datetime

This notebook is preliminary ETL for data from the weather API. <br>
We work with data for Union Square in San Francisco; (37.7879, -122.4079) for API.  <br>
<b> Beware, returned data says has coordinates (38, -122.75) which is 30 miles north in Point Reyes! </b> <br>
We pulled data from 2000-01-01 to 2020-12-31. <br>
We took temperature, humidity, rainfal, snowfal, cloudcover, windspeed, and wind-direction hourly. <br>
<hr> 
We will also take max temp, min temp, rain, snow, and percip-hours daily. <br>
These will need separate and different dataframe processing. <br>
We are taking in data via downloaded csv, and we need separate daily/hourly csv files because different column headers, numbers of columns.<br>
<hr>
We chose ISO format for date/time; and USian (not-metric) units for the rest.<br>


This run is with "GMT +0" time zone; in the future, we will need to specify the timezone appropriate to the location.<br>
API allows "GMT +0" time zone for hourly data, but not for daily data - says "error, time zone must be specified" -- except does time zone even matter for dailies? I think not for us, except for reproducibility.<br>
If we're stuck with something other than "GMT +0", we could get strange behaviour whenif daylight savings time kicks in; can check for that by requesting sunrise time data.

## Hourly data: load, clean, pivot for joining with daily.

In [2]:
# Load the dataset.
# rename columns because provided column headers contain non-ascii characters
file_path ="hourly.csv"
hourly_columns = ['time',
           'temperature_2m_degF', 
           'relativehumidity_2m_perc', 
           'rain_inch',
           'snowfall_cm', 
           'cloudcover_perc', 
           'windspeed_10m_mph',
           'winddirection_10m_deg']

raw_hrly_df1 = pd.read_csv(file_path, skiprows=4, names=hourly_columns)
raw_hrly_df1.head()

Unnamed: 0,time,temperature_2m_degF,relativehumidity_2m_perc,rain_inch,snowfall_cm,cloudcover_perc,windspeed_10m_mph,winddirection_10m_deg
0,2000-01-01T00:00,51.6,77,0.0,0.0,8,11.9,283.0
1,2000-01-01T01:00,51.3,80,0.0,0.0,14,10.8,283.0
2,2000-01-01T02:00,51.4,81,0.0,0.0,10,10.5,286.0
3,2000-01-01T03:00,48.1,88,0.0,0.0,11,10.0,290.0
4,2000-01-01T04:00,47.8,88,0.0,0.0,5,9.8,290.0


When wind speed is 0, wind direction is NaN. Otherwise, wind directions varies from 1 to 360.<br>
We replace NaNs with 0s to avoid errors; and losing no data as 0 never appears in the original.

In [3]:
raw_hrly_df1["winddirection_10m_deg"].fillna(0, inplace=True)

We convert the provided ISO string 'time' into
<ul>
    <li> a 'pure_date' in python datetime format for merging with daily data; and </li>
    <li> an 'hour' integer, for pivoting. </li>
</ul>

In [4]:
raw_hrly_df1["pure_date"] = raw_hrly_df1['time'].map(lambda x: 
                                                     datetime.datetime.fromisoformat(x[0:10]))

In [5]:
raw_hrly_df1["hour"] = raw_hrly_df1['time'].map(lambda x: datetime.datetime.fromisoformat(x).hour)

In [6]:
raw_hrly_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184104 entries, 0 to 184103
Data columns (total 10 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   time                      184104 non-null  object        
 1   temperature_2m_degF       184104 non-null  float64       
 2   relativehumidity_2m_perc  184104 non-null  int64         
 3   rain_inch                 184104 non-null  float64       
 4   snowfall_cm               184104 non-null  float64       
 5   cloudcover_perc           184104 non-null  int64         
 6   windspeed_10m_mph         184104 non-null  float64       
 7   winddirection_10m_deg     184104 non-null  float64       
 8   pure_date                 184104 non-null  datetime64[ns]
 9   hour                      184104 non-null  int64         
dtypes: datetime64[ns](1), float64(5), int64(3), object(1)
memory usage: 14.0+ MB


Now we pivot the dataframe on the "hour" variable: for each value of 'pure_date',<ul>
    <li> The raw dataframe has 24 rows (one for each hour) with: <ul>
        <li> 7 weather columns;</li>
        <li> 3 date/time columns: original string 'time', datetime 'pure_date', and integer 'hour; and </li>
        <li> mostly meaningless sequential index.</li> </ul>
    <li> The clean dataframe has 1 row with: <ul>
        <li> index 'pure_date'; and </li>
        <li> 7*24=168 weather-at-hour columns. </li> <ul>

In [7]:
clean_hrly_df1 = raw_hrly_df1.pivot(index = 'pure_date',
                                    columns = 'hour', values = hourly_columns[1:])

In [8]:
clean_hrly_df1.head()

Unnamed: 0_level_0,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,temperature_2m_degF,...,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg,winddirection_10m_deg
hour,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
pure_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2000-01-01,51.6,51.3,51.4,48.1,47.8,47.8,46.7,45.6,45.0,43.1,...,3.0,7.0,350.0,333.0,315.0,306.0,307.0,308.0,296.0,297.0
2000-01-02,52.2,51.8,51.2,47.8,47.8,47.6,47.5,47.2,46.8,45.4,...,326.0,327.0,329.0,327.0,323.0,317.0,309.0,306.0,303.0,302.0
2000-01-03,52.2,52.1,51.8,48.1,48.3,48.0,47.9,47.9,47.7,46.9,...,352.0,8.0,27.0,29.0,339.0,315.0,252.0,267.0,275.0,283.0
2000-01-04,52.9,52.0,50.9,47.6,47.9,49.4,49.2,47.3,47.3,45.9,...,280.0,277.0,291.0,261.0,180.0,162.0,151.0,161.0,188.0,195.0
2000-01-05,53.0,53.2,53.2,49.8,49.2,49.4,50.0,50.0,49.7,49.2,...,342.0,343.0,335.0,324.0,318.0,318.0,319.0,323.0,314.0,310.0


We have dropped the mostly meaningless sequential index; and the original string 'time'.<br>
The two percentage columns that used to be int64 got converted to float64 for some reason.<br>

In [9]:
clean_hrly_df1.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7671 entries, 2000-01-01 to 2020-12-31
Columns: 168 entries, ('temperature_2m_degF', 0) to ('winddirection_10m_deg', 23)
dtypes: float64(168)
memory usage: 9.9 MB


In [10]:
raw_hrly_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184104 entries, 0 to 184103
Data columns (total 10 columns):
 #   Column                    Non-Null Count   Dtype         
---  ------                    --------------   -----         
 0   time                      184104 non-null  object        
 1   temperature_2m_degF       184104 non-null  float64       
 2   relativehumidity_2m_perc  184104 non-null  int64         
 3   rain_inch                 184104 non-null  float64       
 4   snowfall_cm               184104 non-null  float64       
 5   cloudcover_perc           184104 non-null  int64         
 6   windspeed_10m_mph         184104 non-null  float64       
 7   winddirection_10m_deg     184104 non-null  float64       
 8   pure_date                 184104 non-null  datetime64[ns]
 9   hour                      184104 non-null  int64         
dtypes: datetime64[ns](1), float64(5), int64(3), object(1)
memory usage: 14.0+ MB


In [11]:
clean_hrly_df1['cloudcover_perc'][1]

pure_date
2000-01-01     14.0
2000-01-02     24.0
2000-01-03      0.0
2000-01-04     14.0
2000-01-05     90.0
              ...  
2020-12-27     29.0
2020-12-28     65.0
2020-12-29      0.0
2020-12-30     27.0
2020-12-31    100.0
Name: 1, Length: 7671, dtype: float64

In [12]:
raw_hrly_df1['cloudcover_perc']

0          8
1         14
2         10
3         11
4          5
          ..
184099    24
184100    23
184101    25
184102    25
184103    23
Name: cloudcover_perc, Length: 184104, dtype: int64

### Load daily data

In [13]:
# Load the dataset.
# rename columns because provided column headers contain non-ascii characters
file_path ="daily.csv"
daily_columns = ['time',
           'temperature_2m_degF_max',
           'temperature_2m_degF_min',  
           'rain_inch',
           'snowfall_cm',
           'precipitation_hours'
          ]

raw_daily_df1 = pd.read_csv(file_path, skiprows=4, names=daily_columns)
raw_daily_df1.head()

Unnamed: 0,time,temperature_2m_degF_max,temperature_2m_degF_min,rain_inch,snowfall_cm,precipitation_hours
0,2000-01-01,52.2,43.1,0.0,0.0,0.0
1,2000-01-02,52.3,45.1,0.0,0.0,0.0
2,2000-01-03,52.9,42.8,0.0,0.0,0.0
3,2000-01-04,53.2,44.2,0.02,0.0,2.0
4,2000-01-05,56.8,45.8,0.0,0.0,0.0


In [14]:
raw_daily_df1["date"] = raw_daily_df1['time'].map(lambda x: datetime.datetime.fromisoformat(x))

In [15]:
raw_daily_df1.head()

Unnamed: 0,time,temperature_2m_degF_max,temperature_2m_degF_min,rain_inch,snowfall_cm,precipitation_hours,date
0,2000-01-01,52.2,43.1,0.0,0.0,0.0,2000-01-01
1,2000-01-02,52.3,45.1,0.0,0.0,0.0,2000-01-02
2,2000-01-03,52.9,42.8,0.0,0.0,0.0,2000-01-03
3,2000-01-04,53.2,44.2,0.02,0.0,2.0,2000-01-04
4,2000-01-05,56.8,45.8,0.0,0.0,0.0,2000-01-05


In [16]:
raw_daily_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7671 entries, 0 to 7670
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   time                     7671 non-null   object        
 1   temperature_2m_degF_max  7671 non-null   float64       
 2   temperature_2m_degF_min  7671 non-null   float64       
 3   rain_inch                7671 non-null   float64       
 4   snowfall_cm              7671 non-null   float64       
 5   precipitation_hours      7671 non-null   float64       
 6   date                     7671 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 419.6+ KB


### Join!

In [17]:
# warning says, flatten column names or else future versions will throw errors, so
clean_hrly_df1.columns = clean_hrly_df1.columns.to_flat_index()

In [18]:
awesome_df = raw_daily_df1.join(clean_hrly_df1, on = 'date' )

In [19]:
awesome_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7671 entries, 0 to 7670
Columns: 175 entries, time to ('winddirection_10m_deg', 23)
dtypes: datetime64[ns](1), float64(173), object(1)
memory usage: 10.2+ MB


In [20]:
awesome_df.drop('time', axis=1, inplace=True)

In [21]:
awesome_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7671 entries, 0 to 7670
Columns: 174 entries, temperature_2m_degF_max to ('winddirection_10m_deg', 23)
dtypes: datetime64[ns](1), float64(173)
memory usage: 10.2 MB


In [22]:
awesome_df.head()

Unnamed: 0,temperature_2m_degF_max,temperature_2m_degF_min,rain_inch,snowfall_cm,precipitation_hours,date,"(temperature_2m_degF, 0)","(temperature_2m_degF, 1)","(temperature_2m_degF, 2)","(temperature_2m_degF, 3)",...,"(winddirection_10m_deg, 14)","(winddirection_10m_deg, 15)","(winddirection_10m_deg, 16)","(winddirection_10m_deg, 17)","(winddirection_10m_deg, 18)","(winddirection_10m_deg, 19)","(winddirection_10m_deg, 20)","(winddirection_10m_deg, 21)","(winddirection_10m_deg, 22)","(winddirection_10m_deg, 23)"
0,52.2,43.1,0.0,0.0,0.0,2000-01-01,51.6,51.3,51.4,48.1,...,3.0,7.0,350.0,333.0,315.0,306.0,307.0,308.0,296.0,297.0
1,52.3,45.1,0.0,0.0,0.0,2000-01-02,52.2,51.8,51.2,47.8,...,326.0,327.0,329.0,327.0,323.0,317.0,309.0,306.0,303.0,302.0
2,52.9,42.8,0.0,0.0,0.0,2000-01-03,52.2,52.1,51.8,48.1,...,352.0,8.0,27.0,29.0,339.0,315.0,252.0,267.0,275.0,283.0
3,53.2,44.2,0.02,0.0,2.0,2000-01-04,52.9,52.0,50.9,47.6,...,280.0,277.0,291.0,261.0,180.0,162.0,151.0,161.0,188.0,195.0
4,56.8,45.8,0.0,0.0,0.0,2000-01-05,53.0,53.2,53.2,49.8,...,342.0,343.0,335.0,324.0,318.0,318.0,319.0,323.0,314.0,310.0


### Parse date into numbers: year, month, and day-of-the-year.
The day_of_year field runs from 1 to 365 (366 for leap years); for example, February 3rd would be 34. We want it for the season-clusters machine learning. The 'month' field is for the month-label machine learning. We keep the whole date formatted as datetime for time intervals like December 28, 2011 to January 17, 2012. <br>
<i> maybe this should be at the start of machine-learning code, not here. </i>

In [23]:
awesome_df['year'] = awesome_df["date"].map(lambda x: x.year)

In [24]:
awesome_df['month'] = awesome_df["date"].map(lambda x: x.month)

In [25]:
# this code stolen from
# https://www.w3resource.com/python-exercises/date-time-exercise/python-date-time-exercise-11.php
# which says nothing about licenses
def day_of_year(x):
    return (x - datetime.datetime(x.year, 1, 1)).days + 1

In [26]:
awesome_df['day_of_year'] = awesome_df["date"].map(lambda x: day_of_year(x))

In [27]:
awesome_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7671 entries, 0 to 7670
Columns: 177 entries, temperature_2m_degF_max to day_of_year
dtypes: datetime64[ns](1), float64(173), int64(3)
memory usage: 10.4 MB


### Next coding steps.
Change data intake from reading csv to making API call.<br>
Write code for generating the url for the API call, given lattitude, longitude.<br>
Write code to select relevant rows from awesome_df, give date range of a trip.<br>


### Next project steps/questions.

Define the summary statistics we will provide, and the meaning of the words in them.<br>

Like "28% (34 out of 120) days were rainy", and 'rainy' means at least 5 in 'percepitation_hours'.<br>

For machine learning of seasons, do we want to unpivot on year, to have 366 rows with 20x the number of columns?<br>

In [28]:
awesome_df.to_csv('awesome.csv')