# Hyperlocal rain prediction - Singapore 

Singapore is a tropical country, located near the equator. As with most tropical countries, it is subject to heavy rains, and shifting prevalent winds. Also, given its small size, squalls (also colloquially known as moving clouds) result in hyperlocal showers in one area of the country, while other areas can be left warm, dry and humid. 

While rains are generally harmless, they bring about massive financial losses across multiple industries:
* In construction industries, work may be postponed due to safety concerns, resulting in project delays
* In chemical processing industries, rain results in unwanted cooling of process equipment, causing high energy costs
* In pharmaceuticals, rain results in higher humidities where quality control gets affected
* In the food and beverage (F&B) industries, rain has resulted in reduced sitting capacities and loss of revenue
* In..... well you get the idea....

# Problem Statement

In this project, I'll aim to use publicly available data to predict the probability of rain, and the length of how long it'll last. 
This should allow users to 
* estimate how much rain will fall, within a particular neighbourhood in Singapore and,
* how long the rain will last

# 1. Setup

As usual, we will begin the project by obtaining the data. The usual libraries shall be imported.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime, time, calendar

import requests # HTTP requests (GET / POST)

### 1.1 Obtaining Data 

To begin, the data from [link] data.gov.sg shall be imported. We will primarily import 5 main data points
https://data.gov.sg/dataset/realtime-weather-readings
* Air Temperature
* Rainfall
* Relative Humidity
* Wind Direction
* Wind Speed


* https://towardsdatascience.com/exploring-data-gov-sg-api-725e344048dc

##### 1.1.0 Initializing and obtaining full list of stations

In [2]:
##initialize
# now = datetime.datetime.now()
# params_now = {"date_time": today.strftime("%Y-%m-%dT%H:%M:%S")} # for latest data
# js_stations = requests.get(weather_api_url['rain'], params=params_now).json()
# stations_df = pd.DataFrame(js_stations['metadata']['stations'])
# stations_df = stations_df.drop(columns = ['device_id'])
# stations_df = pd.concat([stations_df, pd.json_normalize(stations_df['location'])],
#                           axis = 1)

# stations_df.to_csv('../data/00_stationdata.csv',index = False)

## import stations data
stations_df = pd.read_csv('../data/00_stationdata.csv')
stations_df.head()
# stations_dict = stations_df.to_dict()
# stations_dict

Unnamed: 0,id,name,location,latitude,longitude
0,S77,Alexandra Road,"{'latitude': 1.2937, 'longitude': 103.8125}",1.2937,103.8125
1,S109,Ang Mo Kio Avenue 5,"{'latitude': 1.3764, 'longitude': 103.8492}",1.3764,103.8492
2,S90,Bukit Timah Road,"{'latitude': 1.3191, 'longitude': 103.8191}",1.3191,103.8191
3,S114,Choa Chu Kang Avenue 4,"{'latitude': 1.38, 'longitude': 103.73}",1.38,103.73
4,S50,Clementi Road,"{'latitude': 1.3337, 'longitude': 103.7768}",1.3337,103.7768


In [7]:
stations_df[stations_df['id'] == 'S106']

Unnamed: 0,id,name,location,latitude,longitude
23,S106,Pulau Ubin,"{'latitude': 1.4168, 'longitude': 103.9673}",1.4168,103.9673


##### 1.1.1 Create custom function to pull weather data from data.gov.sg

In order to obtain the data from data.gov.sg, we shall create a few custom functions to pull data from the server. 

In [3]:
def print_dates(start,end):       
    # convert from string to time
    end = datetime.datetime.strptime(end,'%Y-%m-%d')
    start = datetime.datetime.strptime(start,'%Y-%m-%d')
    
    #obtain time diffference, in days
    delta = (end - start).days
    
    #list comprehension for date list
    date_list = [(end - datetime.timedelta(days=x)).strftime("%Y-%m-%d") for x in range(delta,-1,-1)]
    
    return date_list

In [4]:
def agg_weather(df):
    #change to datetime format, promote index for station_id and timestamp
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df = df.sort_values(by = ['station_id','timestamp'])
    df = df.set_index(keys = ['station_id','timestamp'])

    #split into rain and nonrain groups for different aggregation
    df_nonrain = df.drop(columns = ['rain'])
    df_rainonly = df[['rain']]

    #run .mean() for non-rain readings, run .sum
    df_nonrain = df_nonrain.groupby(['station_id']+[pd.Grouper(freq='5T', level=-1)]).mean()
    df_rainonly = df_rainonly.groupby(['station_id']+[pd.Grouper(freq='5T', level=-1)]).sum()
    
    #merge rain and non-rain readings
    df = df_nonrain.merge(right = df_rainonly,
                                         how = 'left',
                                         on = ['timestamp', 'station_id'])

    #reset index
    df = df.reset_index()
    
    return df

In [5]:
def data_pull(start_date, end_date):
    
    '''
    This function does three main things
    
    1. The function takes in the start date of the data required.
        If the start data is not specified, then the latest 5 complete days will be drawn.
    2. The function then merges only the intersection of rain and other weather readings 
    (namely air temperature, relative humidity, wind speed and wind direction) at weather stations where all 5 readings exist.
    3. The function then takes the merged dataframe, and aggregates the weather readings down to a 5-minute interval. This is to 
    account for the rain gauge only having data at 5 minute intervals. 
        For rain data, the aggregation is done by summation.
        For the other data point, aggregation is done by taking the average over the 5 minute interval.
            
    The function then returns the merge dataframe.
    '''
    #begin data pull function
    print('######################################')
    print(f'Begin Data Pull.... | End Date : {end_date}')
    print('######################################')
    #################################################################################################
    #hardcoded list of urls for weather data
    weather_api_url = {'temp': 'https://api.data.gov.sg/v1/environment/air-temperature',
                       'rh': 'https://api.data.gov.sg/v1/environment/relative-humidity',
                       'wind_dir': 'https://api.data.gov.sg/v1/environment/wind-direction',
                       'wind_spd': 'https://api.data.gov.sg/v1/environment/wind-speed',
                       'rain': 'https://api.data.gov.sg/v1/environment/rainfall'
                      }
    #################################################################################################
    #Obtain list of dates for data pull
    end_date = datetime.datetime.strptime(end_date,'%Y-%m-%d')
    start_date = datetime.datetime.strptime(start_date,'%Y-%m-%d')
    
    #obtain time diffference, in days
    delta = (end_date - start_date).days
    
    #list comprehension for date list
    date_list = [(end_date - datetime.timedelta(days=x)).strftime("%Y-%m-%d") for x in range(delta,-1,-1)]    
    #################################################################################################

    #initialize initial counter for date
    init_date = 1  
    
    #Begin loop
    for date in date_list:
        
        #initialize
#         params = {"date": date.strftime("%Y-%m-%d")} # YYYY-MM-DD, for historical data
        params = {"date": date} # YYYY-MM-DD, for historical data
        init_weather = 1
        init_agg = 1
    
        #############################################################################################
        for key in weather_api_url.keys():

            print(f'Date: {date} | {key} data pull: in progress', end = '\r')
            
            #obtain json as dictionary
            js_weather = requests.get(weather_api_url[key], params=params).json()
            
            #error handling
            if len(js_weather['items']) == 0:
                continue
            
            #manipulate dataframe
            hold_weather_df = pd.DataFrame(js_weather['items']).explode('readings', ignore_index = True)
            hold_weather_df = pd.concat([hold_weather_df, pd.json_normalize(hold_weather_df['readings'])],
                                  axis = 1)
            hold_weather_df = hold_weather_df.drop(columns = ['readings'])
            hold_weather_df = hold_weather_df.rename({'value': key}, axis = 1)
            hold_weather_df['timestamp'] = pd.to_datetime(hold_weather_df['timestamp'])

            #merge with other weather data
            if init_weather == 1:
                merge_weather_df = hold_weather_df

            else:
                merge_weather_df = merge_weather_df.merge(right = hold_weather_df,
                                         how = 'left',
                                         on = ['timestamp', 'station_id'])
            #update counter
            init_weather = init_weather + 1
            
            print('                                                     ',end = '\r')
        #############################################################################################
        #concatenate weather df with previous dates
        if init_date == 1:
            concat_weather = merge_weather_df
        else:
            concat_weather = pd.concat([concat_weather, merge_weather_df],
                                       axis = 0)          

    
        init_date = init_date + 1
        
#     aggregate at 5min level     
    concat_weather = agg_weather(concat_weather)
    
    
    print(f'Date: {date} | Data pull: Complete')
    print('######################################')
    
    return concat_weather

In [6]:
# %%time
##For individual months data
# weather_data = data_pull(start_date = '2021-01-01',end_date = '2021-01-31')
# weather_data.to_csv('../data/01_weather_data_202201.csv')

### 1.1.2 Obtaining Last 8 quarters of data

Now that the custom function has been defined, we shall create a for-loop to obtain the last 8 quarters of data for modelling.

<i>Note: this notebook was created in May-2022. The for-loop was designed to be edited for the relevant purposes.</i>

In [7]:
# %%time
# #loop through data ranges to fetch data from 2021 to 2022
# #This potion is hardcoded, and should be adapted to the user's needs accordingly.

# for year in [2020, 2021]:
#     for month in range(1,13,1):
#         last_day = calendar.monthrange(year,month)[1]
#         start_date = str(year)+'-'+str(month).zfill(2)+'-'+'01'
#         end_date = str(year)+'-'+str(month).zfill(2)+'-'+str(last_day)
#         filepath = '../data/01_weather_data_'+str(year)+str(month).zfill(2) + '.csv'
#         weather_data = data_pull(start_date, end_date)
#         weather_data.to_csv(filepath, index = False)
# #         del weather_data

### 1.1.3 Concatenating downloaded data

In 1.1.2, the data was downloaded at a monthly level to ensure that connectivity issues were accounted for. 
In this section, the downloaded data shall be concatenated for ease of usage.

<i>Note: the code here has been hardcoded for the puposes of project speed. For users that intend to change the date ranges, please take note to change the code accordingly.</i>

In [8]:
base_filepath = '../data/01_weather_data_'
ext = '.csv'
list_ext = []

for year in [2020,2021,2022]:
    for month in range(1,13,1):
        yearmonth = str(year)+str(month).zfill(2)
        
        if yearmonth == '202205':
            break
        else:     
            full_filepath = base_filepath + yearmonth + ext
            list_ext.append(full_filepath)

for i in range(3):
    list_ext.remove(list_ext[0])


In [34]:
# https://www.pythonpool.com/python-string-to-variable-name/
for i in range(len(list_ext)):
    if i == 0:
        weather_data = pd.read_csv(list_ext[0])
    else:
        new_df = pd.read_csv(list_ext[i])
        weather_data = pd.concat([weather_data, new_df], axis = 0)
        print(f"Current Iteration: {i} | Number of Rows: {weather_data.shape[0]}",end = '\r')

weather_data = weather_data.drop(columns = ['Unnamed: 0'])
weather_data = weather_data.merge(right = stations_df,
                                 how = 'left',
                                 left_on = 'station_id',
                                 right_on = 'id')
weather_data = weather_data.drop(columns = ['id'])


weather_data.to_csv('../data/00_weatherdata_full.csv', index = False)
print('                                                                               ', end = '\r')
print('Data Concatenate, Complete!')


Current Iteration: 24 | Number of Rows: 3178161

In [36]:
weather_data.head()

Unnamed: 0,timestamp,station_id,temp,rh,wind_dir,wind_spd,rain,id,name,location,latitude,longitude
0,2020-04-01 00:00:00+08:00,S100,28.2,73.575,69.75,2.475,0.0,S100,Woodlands Road,"{'latitude': 1.4172, 'longitude': 103.74855}",1.4172,103.74855
1,2020-04-01 00:05:00+08:00,S100,28.2,73.64,72.4,2.72,0.0,S100,Woodlands Road,"{'latitude': 1.4172, 'longitude': 103.74855}",1.4172,103.74855
2,2020-04-01 00:10:00+08:00,S100,28.2,73.86,74.0,2.62,0.0,S100,Woodlands Road,"{'latitude': 1.4172, 'longitude': 103.74855}",1.4172,103.74855
3,2020-04-01 00:15:00+08:00,S100,28.2,74.28,73.8,2.56,0.0,S100,Woodlands Road,"{'latitude': 1.4172, 'longitude': 103.74855}",1.4172,103.74855
4,2020-04-01 00:20:00+08:00,S100,28.1,74.6,71.0,2.6,0.0,S100,Woodlands Road,"{'latitude': 1.4172, 'longitude': 103.74855}",1.4172,103.74855


### 2. Conclusion

Data from the past 24 months have been drawn and concatenated into a singular data frame. In the dataframe, the following items have been obtained:
* Date Time, 5min aggregated
* Air Temperature
* Relative Humidity
* Wind Direction & Wind Speed
* Rain Fall
* Station ID & Name
* Station Latitude & Longitude Coordinates

Over 3.1 million rows were obtained. 