<a href="https://colab.research.google.com/github/themathedges/3YP-Standalone-Kennington/blob/main/Ravi/Daily_Precipitation_Temperature_Data_Processing_Ravi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Daily Precipitation Data Processing
Author: Ravi Kohli

Date: November 20th, 2020

College: Christ Church

**Goal:** The purpose of this notebook is to process the daily dataset from the Radcliffe Observatory for use in analysis notebooks

Data was collected from Radcliffe Observatory:
Oxford Weather and Climate since 1767 by Stephen Burt and Tim Burt, published by Oxford University Press, 2019.

This data is available at
https://www.geog.ox.ac.uk/research/climate/rms/daily-data.html



### Documentation of dataset

Dataset documentation explanation of the **relevant** columns: (taken from documentation)
- YYYY - Year (four digits). First record 1 Jan 1815
- MM - Month (two digits). First record 1 Jan 1815
- DD - Date (two digits). First record 1 Jan 1815
- Tmax $^o$C - Daily maximum temperature. **Units degrees Celsius and tenths**. First record 1 Jan 1815
- Tmin $^o$C - Daily minimum temperature. **Units degrees Celsius and tenths**. First record 1 Jan 1815
- Daily Tmean $^o$C - Daily mean temperature, derived from the average of the day's maximum and minimum temperatures. **Units degrees Celsius and tenths**. First record 1 Jan 1815
- Daily range degC - Daily range in temperature, derived from maximum minus minimum temperature. **Units degrees Celsius and tenths**. First record 1 Jan 1815	
- Grass min $^o$C  - Daily grass minimum temperature. **Units degrees Celsius and tenths**. First record 1 Dec 1930 (earlier records exist, awaiting digitisation)
- Air frost 0/1 - Binary flag for air frost, 1 when the day's minimum temperature is below 0 degrees Celsius (note, excludes 0.0 degrees Celsius), else 0. 
- Ground frost 0/1 - Binary flag for ground frost, 1 when the day's grass minimum temperature is below 0 degrees Celsius (note, excludes 0.0 degrees Celsius), else 0. First record 1 Dec 1930 (earlier records exist, awaiting digitisation)
- Max >= 25.0¯C - Binary flag for 'hot day', 1 when the day's maximum temperature is at or above 25.0 degrees Celsius, else 0. 
- Max >= 30.0¯C - Binary flag for 'heatwave day', 1 when the day's maximum temperature is at or above 30.0 degrees Celsius, else 0. 
- Min >= 15.0 ¯C - Binary flag for 'warm night', 1 when the day's minimum temperature is at or above 15.0 degrees Celsius, else 0. 
- Max < 0 ¯C - Binary flag for 'ice day', 1 when the day's maximum temperature is below 0 degrees Celsius, (note, excludes 0.0 degrees Celsius), else 0. 
- Rainfall mm raw including traces - daily precipitation total, **mm and tenths**, including 'trace' entries where entered. Includes melted snowfall, at least from 1853. Note that for the majority of the record, traces were not digitised and their absence from the record in this column should not be taken to assume they did not occur. For statistical operations it is preferable to use the following column which excludes traces, as text entries can result in errors in statistical operations performed on the data. 
- Rainfall mm 1 dpl no traces - daily precipitation total, **mm and tenths**: any 'trace' entries set to zero. Includes melted snowfall, at least from 1853. For statistical operations it is advisable to use this column (i.e. excluding traces), as text entries can result in errors in statistical operations performed on the data. First record 1 Jan 1827	
- Rain day (0.2 mm or more) - Binary flag for 'rain day', 1 when the day's rainfall is 0.2 mm or more, else 0. First record 1 Jan 1827. There remains some doubt as to whether rainfall was measured every day prior to 1853, and some 'daily' values prior to this may be multi-day accumulations (and thus the number of rain days will be lower than actual)	
- Wet day (1.0 mm or more) - Binary flag for 'wet day', 1 when the day's rainfall is 1.0 mm or more, else 0. First record 1 Jan 1827. There remains some doubt as to whether rainfall was measured every day prior to 1853, and some 'daily' values prior to this may be multi-day accumulations (and thus the number of wet days will be lower than actual)
- Sunshine duration (h) - Daily sunshine duration. **Units hours and tenths**. First record 1 Jan 1921 (earlier records exist, awaiting digitisation)
- Nil sunshine - Binary flag for 'sunless day', 1 when the day's sunshine duration is zero, else 0. First record 1 Jan 1921 (earlier records exist, awaiting digitisation)
- 12 h sunshine - Binary flag for 'sunny day', 1 when the day's sunshine duration is 12.0 hours or more, else 0. First record 1 Jan 1921 (earlier records exist, awaiting digitisation)


## Code

**NOTE:** if running this notebook on your own computer, then just ignore this cell below

In [None]:
# Mounting the Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# importing the modules
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# Retrieving the precipitation data
path = '/content/drive/My Drive/3YP/data/'    # when running in a notebook in colab
#path = '../data/'                            # when running from an external notebook

filename = "daily-data-to-jan-2020.csv"       # the name of the csv file
df = pd.read_csv(path+filename)               # converting the csv file to a dataframe

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74906 entries, 0 to 74905
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   YYYY                         74906 non-null  int64  
 1   MM                           74906 non-null  int64  
 2   DD                           74906 non-null  int64  
 3   Tmax °C                      74906 non-null  float64
 4   Tmin °C                      74906 non-null  float64
 5   Daily Tmean °C               74906 non-null  float64
 6   Daily range degC             74906 non-null  float64
 7   Grass min °C                 32531 non-null  object 
 8   Air frost 0/1                74906 non-null  int64  
 9   Ground frost 0/1             32251 non-null  float64
 10  Max ≥ 25.0°C                 74906 non-null  int64  
 11  Max ≥ 30.0°C                 74906 non-null  int64  
 12  Min ≥ 15.0 °C                74906 non-null  int64  
 13  Max < 0 °C      

In [None]:
df

Unnamed: 0,YYYY,MM,DD,Tmax °C,Tmin °C,Daily Tmean °C,Daily range degC,Grass min °C,Air frost 0/1,Ground frost 0/1,Max ≥ 25.0°C,Max ≥ 30.0°C,Min ≥ 15.0 °C,Max < 0 °C,Rainfall mm raw incl traces,Rainfall mm 1 dpl no traces,Rain day (0.2 mm or more),Wet day (1.0 mm or more),Sunshine duration h,Nil sunshine,12 h sunshine
0,1815,1,1,6.6,-1.5,2.6,8.1,,1,,0,0,0,0,,,,,,,
1,1815,1,2,4.9,-3.2,0.9,8.1,,1,,0,0,0,0,,,,,,,
2,1815,1,3,2.6,-5.6,-1.5,8.2,,1,,0,0,0,0,,,,,,,
3,1815,1,4,2.1,-6.1,-2.0,8.2,,1,,0,0,0,0,,,,,,,
4,1815,1,5,1.0,-7.2,-3.1,8.2,,1,,0,0,0,0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74901,2020,1,27,8.5,5.5,7.0,3.0,3.6,0,0.0,0,0,0,0,2.8,2.8,1.0,1.0,0.8,0,0
74902,2020,1,28,7.4,1.6,4.5,5.8,-0.3,0,1.0,0,0,0,0,0,0.0,0.0,0.0,6,0,0
74903,2020,1,29,9.8,2.3,6.1,7.5,0.1,0,0.0,0,0,0,0,0,0.0,0.0,0.0,5.5,0,0
74904,2020,1,30,12.3,4.0,8.2,8.3,2.4,0,0.0,0,0,0,0,0,0.0,0.0,0.0,0,1,0


List of observations and pre-processing steps to make the data usable:
- Year, Month, and Day columns need to be combined to form a date column with a datatype datetime
- Data is all in strings : we can convert the data to ints and float as required

*Row references are given with respect to original indexing

## Data Preprocessing

In [None]:
df['Date'] = pd.to_datetime(pd.DataFrame({'year': df['YYYY'], 'month': df['MM'], 'day': df['DD']}))   # creating a date column

In [None]:
# creating a dataframe with just the data for the daily precipitation
radcliffe_precipitation_daily_df = df[['Date', 'Rainfall mm 1 dpl no traces', 'Rain day (0.2 mm or more)', 'Wet day (1.0 mm or more)']]
radcliffe_precipitation_daily_df

Unnamed: 0,Date,Rainfall mm 1 dpl no traces,Rain day (0.2 mm or more),Wet day (1.0 mm or more)
0,1815-01-01,,,
1,1815-01-02,,,
2,1815-01-03,,,
3,1815-01-04,,,
4,1815-01-05,,,
...,...,...,...,...
74901,2020-01-27,2.8,1.0,1.0
74902,2020-01-28,0.0,0.0,0.0
74903,2020-01-29,0.0,0.0,0.0
74904,2020-01-30,0.0,0.0,0.0


In [None]:
radcliffe_precipitation_daily_df[radcliffe_precipitation_daily_df.columns[1: ]] = radcliffe_precipitation_daily_df[radcliffe_precipitation_daily_df.columns[1: ]].apply(pd.to_numeric)    # chaging the data to floats 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [None]:
# creating a dataframe with just the data for the daily temperature
radcliffe_temperature_daily_df = df[['Date', 'Tmax °C', 'Tmin °C', 'Daily Tmean °C', 'Daily range degC']]
radcliffe_temperature_daily_df

Unnamed: 0,Date,Tmax °C,Tmin °C,Daily Tmean °C,Daily range degC
0,1815-01-01,6.6,-1.5,2.6,8.1
1,1815-01-02,4.9,-3.2,0.9,8.1
2,1815-01-03,2.6,-5.6,-1.5,8.2
3,1815-01-04,2.1,-6.1,-2.0,8.2
4,1815-01-05,1.0,-7.2,-3.1,8.2
...,...,...,...,...,...
74901,2020-01-27,8.5,5.5,7.0,3.0
74902,2020-01-28,7.4,1.6,4.5,5.8
74903,2020-01-29,9.8,2.3,6.1,7.5
74904,2020-01-30,12.3,4.0,8.2,8.3


In [None]:
radcliffe_temperature_daily_df[radcliffe_temperature_daily_df.columns[1: ]] = radcliffe_temperature_daily_df[radcliffe_temperature_daily_df.columns[1: ]].apply(pd.to_numeric)    # chaging the data to floats 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [None]:
# pickle our dataframes so they can be used in other notebooks
import pickle
path = '/content/drive/My Drive/3YP/data/'
filename_real = 'radcliffe_daily_precipitation_data_processed'
outfile_real = open(path+filename_real, 'wb')
pickle.dump(radcliffe_precipitation_daily_df, outfile_real)
outfile_real.close()

In [None]:
path2 = '/content/drive/My Drive/3YP/data/'
filename_real2 = 'radcliffe_daily_temperature_data_processed'
outfile_real2 = open(path2 + filename_real2, 'wb')
pickle.dump(radcliffe_temperature_daily_df, outfile_real2)
outfile_real2.close()