## Making the Data Confess


In Part 2.1, we identified the `66062_UQ.dat` (Sydney Observatory Hill) file as the one with the highest frequency of Station Data.

We also sorted all stations by their (increasing) distance from it.  The closest station is `66006_UQ.dat` (Sydney Botanic Gardens).

We'll load up the metadata and these two files and take a closer look.

In [None]:
# Import data
library("readr")
metadata<-read_csv('./all_metadata.csv')
OH<-read_table('https://stluc.manta.uqcloud.net/mdatascience/public/datasets/SILO_PPD/66062_UQ.dat',skip=55,col_names
=c("Date","Day","Date2","TMax","Smx","TMin","Smn","Rain","Srn","Evap","Sev","Radn","Ssl","VP","Svp","RHmaxT","RHminT","FAO56","ASCEPM","Mlake","Mpot","Mact","Mwet","Span","Ssp","EvSp","Ses","MSLPres","Sp"))
BG<-read_table('https://stluc.manta.uqcloud.net/mdatascience/public/datasets/SILO_PPD/66006_UQ.dat',skip=55,col_names
=c("Date","Day","Date2","TMax","Smx","TMin","Smn","Rain","Srn","Evap","Sev","Radn","Ssl","VP","Svp","RHmaxT","RHminT","FAO56","ASCEPM","Mlake","Mpot","Mact","Mwet","Span","Ssp","EvSp","Ses","MSLPres","Sp"))
OH$Date<-strptime(OH$Date,format="%Y%m%d")
OH$Date2<-strptime(OH$Date2,format="%d-%m-%Y")
OH$Rain<-as.numeric(OH$Rain)
BG$Date<-strptime(BG$Date,format="%Y%m%d")
BG$Date2<-strptime(BG$Date2,format="%d-%m-%Y")
BG$Rain<-as.numeric(BG$Rain)
head(metadata,10)
head(OH,10)
head(BG,10)
summary(metadata)
summary(OH)
summary(BG)

Let's take a look at the temperature data at the two sites, coloured by source.

In [None]:
plot(OH$TMax[OH$Smx==0]~as.Date(OH$Date[OH$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='TMax (oC)', main="Observatory Hill TMax by Source")
points(OH$TMax[OH$Smx==13]~as.Date(OH$Date[OH$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(OH$TMax[OH$Smx==15]~as.Date(OH$Date[OH$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(OH$TMax[OH$Smx==23]~as.Date(OH$Date[OH$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(OH$TMax[OH$Smx==25]~as.Date(OH$Date[OH$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(OH$TMax[OH$Smx==26]~as.Date(OH$Date[OH$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(OH$TMax[OH$Smx==35]~as.Date(OH$Date[OH$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(OH$TMax[OH$Smx==75]~as.Date(OH$Date[OH$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topright", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

plot(OH$TMin[OH$Smn==0]~as.Date(OH$Date[OH$Smn==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='TMin (oC)', main="Observatory Hill TMin by Source")
points(OH$TMin[OH$Smn==13]~as.Date(OH$Date[OH$Smn==13]), col=rgb(1,0,0,.5), pch='.')
points(OH$TMin[OH$Smn==15]~as.Date(OH$Date[OH$Smn==15]), col=rgb(0,1,0,.5), pch='.')
points(OH$TMin[OH$Smn==23]~as.Date(OH$Date[OH$Smn==23]), col=rgb(1,0,1,.5), pch='.')
points(OH$TMin[OH$Smn==25]~as.Date(OH$Date[OH$Smn==25]), col=rgb(0,1,1,.5), pch='.')
points(OH$TMin[OH$Smn==26]~as.Date(OH$Date[OH$Smn==26]), col=rgb(1,1,0,.5), pch='.')
points(OH$TMin[OH$Smn==35]~as.Date(OH$Date[OH$Smn==35]), col=rgb(0,0,0,.5), pch='.')
points(OH$TMin[OH$Smn==75]~as.Date(OH$Date[OH$Smn==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topright", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

In [None]:
plot(BG$TMax[BG$Smx==0]~as.Date(BG$Date[BG$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='TMax (oC)', main="Botanic Gardens TMax by Source",xlim=c(min(as.Date(BG$Date),na.rm=TRUE),max(as.Date(BG$Date),na.rm=TRUE)),ylim=c(min(BG$TMax,na.rm=TRUE),max(BG$TMax,na.rm=TRUE)))
points(BG$TMax[BG$Smx==13]~as.Date(BG$Date[BG$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(BG$TMax[BG$Smx==15]~as.Date(BG$Date[BG$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(BG$TMax[BG$Smx==23]~as.Date(BG$Date[BG$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(BG$TMax[BG$Smx==25]~as.Date(BG$Date[BG$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(BG$TMax[BG$Smx==26]~as.Date(BG$Date[BG$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(BG$TMax[BG$Smx==35]~as.Date(BG$Date[BG$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(BG$TMax[BG$Smx==75]~as.Date(BG$Date[BG$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topright", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

plot(BG$TMin[BG$Smn==0]~as.Date(BG$Date[BG$Smn==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='TMin (oC)', main="Botanic Gardens TMin by Source",xlim=c(min(as.Date(BG$Date),na.rm=TRUE),max(as.Date(BG$Date),na.rm=TRUE)),ylim=c(min(BG$TMin,na.rm=TRUE),max(BG$TMin,na.rm=TRUE)))
points(BG$TMin[BG$Smn==13]~as.Date(BG$Date[BG$Smn==13]), col=rgb(1,0,0,.5), pch='.')
points(BG$TMin[BG$Smn==15]~as.Date(BG$Date[BG$Smn==15]), col=rgb(0,1,0,.5), pch='.')
points(BG$TMin[BG$Smn==23]~as.Date(BG$Date[BG$Smn==23]), col=rgb(1,0,1,.5), pch='.')
points(BG$TMin[BG$Smn==25]~as.Date(BG$Date[BG$Smn==25]), col=rgb(0,1,1,.5), pch='.')
points(BG$TMin[BG$Smn==26]~as.Date(BG$Date[BG$Smn==26]), col=rgb(1,1,0,.5), pch='.')
points(BG$TMin[BG$Smn==35]~as.Date(BG$Date[BG$Smn==35]), col=rgb(0,0,0,.5), pch='.')
points(BG$TMin[BG$Smn==75]~as.Date(BG$Date[BG$Smn==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topright", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

It appears as though all of the temperature data at the Botanical Gardens site is interpolated, whereas most from the Observatory site is source data.  

### Spatio-temporal consistency

Given both sites are very close to eachother (about 1.2 km apart), let's see how the difference in maximum and minimum temperatures looks.

In [None]:
plot((OH$TMax-BG$TMax)~as.Date(OH$Date), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH - BG TMax (oC)', main="Difference in TMax")
plot((OH$TMin-BG$TMin)~as.Date(OH$Date), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH - BG TMin (oC)', main="Difference in TMin")

plot((OH$TMax-OH$TMin)~as.Date(OH$Date), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH TMax - OH TMin (oC)', main="Observatory Temperature Spread")
plot((BG$TMax-BG$TMin)~as.Date(OH$Date), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='BG TMax - BG TMin (oC)', main="Botanical Gardens Temperature Spread")

plot(((OH$TMax-OH$TMin)-(BG$TMax-BG$TMin))~as.Date(OH$Date), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH Spread - BG Spread (oC)', main="Difference in Temperature Spread")


It is very clear from these plots so far that the temporal relationship between the two sites changes markedly when the Botanical Gardens data switched from `interp-daily-CLIMARC` to `interp-daily`, even though there is little to suggest a change when looking soley at the Botanical Gardens temperature data.  Detecting such a change algorithmically again falls into the realm of change-point analysis.  

### Past and Present Temperature Distributions

Let's take a look now at a single year's worth of temperature data from 1889 vs the same from 2015.

In [None]:
OH1889<-OH[(as.Date(OH$Date)<'1890-01-01')&(as.Date(OH$Date)>='1889-01-01'),]
OH2015<-OH[(as.Date(OH$Date)<'2016-01-01')&(as.Date(OH$Date)>='2015-01-01'),]
BG1889<-BG[(as.Date(BG$Date)<'1890-01-01')&(as.Date(BG$Date)>='1889-01-01'),]
BG2015<-BG[(as.Date(BG$Date)<'2016-01-01')&(as.Date(BG$Date)>='2015-01-01'),]

plot(OH1889$TMax~as.Date(format(as.Date(OH1889$Date),"1970-%m-%d")), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH TMax (oC)', main="OH TMax 1889 vs 2015")
points(OH2015$TMax~as.Date(format(as.Date(OH2015$Date),"1970-%m-%d")), col=rgb(1,0,0,.5), pch='.')

plot((OH2015$TMax-OH1889$TMax)~as.Date(format(as.Date(OH1889$Date),"1970-%m-%d")), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH TMax (oC)', main="OH TMax 2015 - 1889")

plot(OH1889$TMin~as.Date(format(as.Date(OH1889$Date),"1970-%m-%d")), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH TMin (oC)', main="OH TMin 1889 vs 2015")
points(OH2015$TMin~as.Date(format(as.Date(OH2015$Date),"1970-%m-%d")), col=rgb(1,0,0,.5), pch='.')

plot((OH2015$TMin-OH1889$TMin)~as.Date(format(as.Date(OH1889$Date),"1970-%m-%d")), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH TMin (oC)', main="OH TMin 2015 - 1889")

plot((OH1889$TMax-OH1889$TMin)~as.Date(format(as.Date(OH1889$Date),"1970-%m-%d")), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH Spread (oC)', main="OH Temp Spread 1889 vs 2015")
points((OH2015$TMax-OH2015$TMin)~as.Date(format(as.Date(OH2015$Date),"1970-%m-%d")), col=rgb(1,0,0,.5), pch='.')

plot(((OH2015$TMax-OH2015$TMin)-(OH1889$TMax-OH1889$TMin))~as.Date(format(as.Date(OH1889$Date),"1970-%m-%d")), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH Spread (oC)', main="OH Temp Spread 2015 - 1889")

Even though they show the seasonal trends rather nicely, it is a bit difficult from these plots to see if there are any changes in the max, min, and spread of temperatures between 1889 and 2015.  

Let's just look at histograms numerical summaries of the daily differences in TMax and TMin between the two years.

In [None]:
hist(OH2015$TMax-OH1889$TMax,breaks=seq(-20,20,by=1),freq=TRUE,col=rgb(0,0,1,0.5),main="Histogram for Differences (2015 - 1889)", xlab="Difference (oC)")
hist(OH2015$TMin-OH1889$TMin,breaks=seq(-20,20,by=1),freq=TRUE,col=rgb(1,0,0,0.5),add=TRUE)
legend("topright", c("TMax","TMin"),fill=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5)))

summary(OH2015$TMax-OH1889$TMax)
summary(OH2015$TMin-OH1889$TMin)

These histograms have means/medians which are both above zero, suggesting that the typical daily min/max temperatures were higher at this site in 2015 than the corresponding day in 1889.  There is of course much more that these distributions can tell us; can you think of another interesting point?

### Temperature, Rainfall, and Radiation

Finally, let's take a look and see if there are any relationships between a select few variables (Min/Max Temperature, Rainfall, and Radiation) at the Observatory site.

In [None]:
# look at the relationships
pairs(~ TMax + TMin + Rain + Radn, data=OH)

There are certainly some interesting patterns here, although not too surprising.  For example, there is a strong linear relationship between TMin and TMax in general (caution: we're ignoring the fact the data comes from time series).  If TMax or TMin is high, then Rain is low.  If TMax is high, then Radn tends to be high.  If Radn is high, then there tends to be less Rain.

Let's find the 1% and 99% quantiles for TMax, and look at the characteristics of the bottom and top 1% hottest maximum temperature days.

In [None]:
TMaxq99<-quantile(OH$TMax,0.99)
TMaxq01<-quantile(OH$TMax,0.01)
subOH<-OH[(OH$TMax>=TMaxq99)|(OH$TMax<=TMaxq01),]
subOH$Tag<-(subOH$TMax>=TMaxq99)

# look at the relationships
pairs(~ TMax + TMin + Rain + Radn, data=subOH,col=(as.numeric(subOH$Tag)+1))

It is quite interesting to see the different character of the top 1% hottest TMax and bottom 1% coldest TMax days at the Observatory site.

Let's take a look temporally to see when these hottest and coldest TMax recordings occurred.

In [None]:
plot(subOH$TMax[subOH$Tag]~as.Date(subOH$Date[subOH$Tag]),col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='OH TMax (oC)', main="OH TMax (Extremes)",ylim=c(min(subOH$TMax),max(subOH$TMax)),xlim=c(min(as.Date(subOH$Date)),max(as.Date(subOH$Date))))
points(subOH$TMax[!subOH$Tag]~as.Date(subOH$Date[!subOH$Tag]), col=rgb(1,0,0,.5), pch='.')

Certainly it seems as though unusually cold days tend to have earlier dates.

As a last exercise, let's fit a linear regression model with TMax as output and TMin as input.

In [None]:
Tfit<-lm(OH$TMax~OH$TMin)
summary(Tfit)
library("car")
qqPlot(resid(Tfit),xlab="Standard Normal Quantiles",ylab="Quantiles of the Residuals")

The residuals are clearly non-normal, and the model fit is not great.

Let's transform the output to a log scale and fit again.

In [None]:
Tfit2<-lm(log(OH$TMax)~OH$TMin)
summary(Tfit2)
library("car")
qqPlot(resid(Tfit2),xlab="Standard Normal Quantiles",ylab="Quantiles of the Residuals")

The distribution of the residuals is better here, but the model fit is still rather poor.  How can we interpret the coefficients here?

## Challenge: Conditions Conducive to Bushfires

On the 16th of February, 1983, a series of bushfires swept through south-eastern Australia causing loss of life and widespread destruction.  This event is known as Ash Wednesday.

We will investigate data from one station nearby the Adelaide Hills nexus of bushfires leading up to that day.

Your challenge is to build a model from this dataset that flags conditions conducive to a bushfire event.

In [None]:
AW1<-read_table('https://stluc.manta.uqcloud.net/mdatascience/public/datasets/SILO_PPD/23785_UQ.dat',skip=55,col_names
=c("Date","Day","Date2","TMax","Smx","TMin","Smn","Rain","Srn","Evap","Sev","Radn","Ssl","VP","Svp","RHmaxT","RHminT","FAO56","ASCEPM","Mlake","Mpot","Mact","Mwet","Span","Ssp","EvSp","Ses","MSLPres","Sp"))
AW1$Date<-strptime(AW1$Date,format="%Y%m%d")
AW1$Date2<-strptime(AW1$Date2,format="%d-%m-%Y")
AW1$Rain<-as.numeric(AW1$Rain)
head(AW1,10)

Let's examine the maximum temperature, minimum temperature, and rainfall for the site by source.

In [None]:
plot(AW1$TMax[AW1$Smx==0]~as.Date(AW1$Date[AW1$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='TMax (oC)', main="(-35, 138.7167) TMax by Source",xlim=c(min(as.Date(AW1$Date)),max(as.Date(AW1$Date))))
points(AW1$TMax[AW1$Smx==13]~as.Date(AW1$Date[AW1$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(AW1$TMax[AW1$Smx==15]~as.Date(AW1$Date[AW1$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(AW1$TMax[AW1$Smx==23]~as.Date(AW1$Date[AW1$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(AW1$TMax[AW1$Smx==25]~as.Date(AW1$Date[AW1$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(AW1$TMax[AW1$Smx==26]~as.Date(AW1$Date[AW1$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(AW1$TMax[AW1$Smx==35]~as.Date(AW1$Date[AW1$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(AW1$TMax[AW1$Smx==75]~as.Date(AW1$Date[AW1$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topleft", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

plot(AW1$TMin[AW1$Smx==0]~as.Date(AW1$Date[AW1$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='TMin (oC)', main="(-35, 138.7167) TMin by Source",xlim=c(min(as.Date(AW1$Date)),max(as.Date(AW1$Date))))
points(AW1$TMin[AW1$Smx==13]~as.Date(AW1$Date[AW1$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(AW1$TMin[AW1$Smx==15]~as.Date(AW1$Date[AW1$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(AW1$TMin[AW1$Smx==23]~as.Date(AW1$Date[AW1$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(AW1$TMin[AW1$Smx==25]~as.Date(AW1$Date[AW1$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(AW1$TMin[AW1$Smx==26]~as.Date(AW1$Date[AW1$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(AW1$TMin[AW1$Smx==35]~as.Date(AW1$Date[AW1$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(AW1$TMin[AW1$Smx==75]~as.Date(AW1$Date[AW1$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topleft", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

plot(AW1$Rain[AW1$Smx==0]~as.Date(AW1$Date[AW1$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='Rain (mm)', main="(-35, 138.7167) Rain by Source",xlim=c(min(as.Date(AW1$Date)),max(as.Date(AW1$Date))))
points(AW1$Rain[AW1$Smx==13]~as.Date(AW1$Date[AW1$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(AW1$Rain[AW1$Smx==15]~as.Date(AW1$Date[AW1$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(AW1$Rain[AW1$Smx==23]~as.Date(AW1$Date[AW1$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(AW1$Rain[AW1$Smx==25]~as.Date(AW1$Date[AW1$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(AW1$Rain[AW1$Smx==26]~as.Date(AW1$Date[AW1$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(AW1$Rain[AW1$Smx==35]~as.Date(AW1$Date[AW1$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(AW1$Rain[AW1$Smx==75]~as.Date(AW1$Date[AW1$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topleft", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

plot(AW1$Radn[AW1$Smx==0]~as.Date(AW1$Date[AW1$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='Radn (MJ/m2)', main="(-35, 138.7167) Radn by Source",xlim=c(min(as.Date(AW1$Date)),max(as.Date(AW1$Date))))
points(AW1$Radn[AW1$Smx==13]~as.Date(AW1$Date[AW1$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(AW1$Radn[AW1$Smx==15]~as.Date(AW1$Date[AW1$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(AW1$Radn[AW1$Smx==23]~as.Date(AW1$Date[AW1$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(AW1$Radn[AW1$Smx==25]~as.Date(AW1$Date[AW1$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(AW1$Radn[AW1$Smx==26]~as.Date(AW1$Date[AW1$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(AW1$Radn[AW1$Smx==35]~as.Date(AW1$Date[AW1$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(AW1$Radn[AW1$Smx==75]~as.Date(AW1$Date[AW1$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topleft", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))

plot(AW1$VP[AW1$Smx==0]~as.Date(AW1$Date[AW1$Smx==0]), col=rgb(0,0,1,.5), pch='.', xlab='Date', ylab='VP (hPa)', main="(-35, 138.7167) VP by Source",xlim=c(min(as.Date(AW1$Date)),max(as.Date(AW1$Date))))
points(AW1$VP[AW1$Smx==13]~as.Date(AW1$Date[AW1$Smx==13]), col=rgb(1,0,0,.5), pch='.')
points(AW1$VP[AW1$Smx==15]~as.Date(AW1$Date[AW1$Smx==15]), col=rgb(0,1,0,.5), pch='.')
points(AW1$VP[AW1$Smx==23]~as.Date(AW1$Date[AW1$Smx==23]), col=rgb(1,0,1,.5), pch='.')
points(AW1$VP[AW1$Smx==25]~as.Date(AW1$Date[AW1$Smx==25]), col=rgb(0,1,1,.5), pch='.')
points(AW1$VP[AW1$Smx==26]~as.Date(AW1$Date[AW1$Smx==26]), col=rgb(1,1,0,.5), pch='.')
points(AW1$VP[AW1$Smx==35]~as.Date(AW1$Date[AW1$Smx==35]), col=rgb(0,0,0,.5), pch='.')
points(AW1$VP[AW1$Smx==75]~as.Date(AW1$Date[AW1$Smx==75]), col=rgb(.5,0,.5,.5), pch='.')
legend("topleft", c("station","deaccum-nearby","deaccum-interp","nearby-BoM","interp-daily","synth-pan","interp-daily-CLIMARC","interp-lta"),pch='.',col=c(rgb(0,0,1,0.5), rgb(1,0,0,0.5),rgb(0,1,0,0.5),rgb(1,0,1,0.5),rgb(0,1,1,0.5),rgb(1,1,0,0.5),rgb(0,0,0,0.5),rgb(0.5,0,0.5,0.5)))



We'll zoom in to the month up to and including 1983-02-16. 

In [None]:
AW2<-AW1[as.Date(AW1$Date)>=as.Date('1983-01-16')&as.Date(AW1$Date)<=as.Date('1983-02-16'),]

plot(AW2$TMax[AW2$Smx==0]~as.Date(AW2$Date[AW2$Smx==0]), col=rgb(1,0,0,.5), pch='o', xlab='Date', ylab='Temperature (oC)', main="(-35, 138.7167) TMin and TMax",xlim=c(min(as.Date(AW2$Date)),max(as.Date(AW2$Date))),ylim=c(min(AW2$TMin),max(AW2$TMax)))
points(AW2$TMin[AW2$Smx==0]~as.Date(AW2$Date[AW2$Smx==0]), col=rgb(0,0,1,.5), pch='o')
legend("topleft", c("TMax","TMin"),pch='o',col=c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))

plot(AW2$Rain[AW2$Smx==0]~as.Date(AW2$Date[AW2$Smx==0]), col=rgb(0,0,0,.5), pch='o', xlab='Date', ylab='Rain (mm)', main="(-35, 138.7167) Rain",xlim=c(min(as.Date(AW2$Date)),max(as.Date(AW2$Date))))

plot(AW2$Radn[AW2$Smx==0]~as.Date(AW2$Date[AW2$Smx==0]), col=rgb(1,0,1,.5), pch='o', xlab='Date', ylab='Radn (MJ/m2)', main="(-35, 138.7167) Radn",xlim=c(min(as.Date(AW2$Date)),max(as.Date(AW2$Date))))

plot(AW2$VP[AW2$Smx==0]~as.Date(AW2$Date[AW2$Smx==0]), col=rgb(1,0.5,0,.5), pch='o', xlab='Date', ylab='VP (hPa)', main="(-35, 138.7167) VP",xlim=c(min(as.Date(AW2$Date)),max(as.Date(AW2$Date))))


Some characteristics are immediately clear.  Leading up to the bushfire event, there is very little rainfall, and maximum temperature and radiation measurements are high.  

The challenge is one of classification; attempting to classify parts of the data as having conditions conducive to a bushfire event or not.  

The first part of this challenge is to identify other parts of the data leading up to known bushfire events (of length, say, one month), as well as parts of the data that do not correspond to known bushfire events (of the same length).  This will result in a set of labelled data with and without known bushfire events.

The second part of this challenge is to train your classifier on part of your labelled data (a problem of supervised learning).  For instance, you could build a logistic regression model for use as a binary classifier, with features consisting of rainfall, minimum temperature, maximum temperature, and radiation measurements for one month leading up to an event (bushfire / no bushfire).  

The third part of this challenge is to test your classifier on the held-out part of your labelled data.  Does it perform any better than a classifier that assigns labels at random?


## Storytelling with data

**THIS IS NOT ASSESSED**

Identify three interesting data stories within this dataset

For each data story create a single (or compound) visualisation of the data that speaks to this particular story. Pay close attention to selecting appropriate visualisations given the unique characteristics of the data stories.

Prepare these data stories as slides within a presentation platform of choice (Powerpoint, Tableau, Prezi). Include a paragraph or two of text in the speakers notes for each slide narrating the data story in each case.