Skip to content

R package as a result of Scout Jarman's URCO grant reaserch.

License

Unknown, GPL-3.0 licenses found

Licenses found

Unknown
LICENSE
GPL-3.0
LICENSE.md
Notifications You must be signed in to change notification settings

scoutiii/HTSoutliers

Repository files navigation

HTSoutliers

The goal of HTSoutliers is to provide functions to help with the detection of outliers found in hyrdological time series data. Specifically, the snow depth measurements and water equivalent snow depth measurements as recorded by NOAA in the Global Historical Climatology Network Daily. This data set contains thousands of weather station data go back as far as 1890.

There are two major parts to this package, the functions, and the data sets. The functions are primarily to help with the downloading and formatting of the GHCN-D raw data. There are also functions to fit a loess model, and to plot and visualize the outliers. The data sets are included to save in computation time. Many of the sets are required by summarizing massive amounts of data, which can take hours. So they are included here for convenience.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("scoutiii/HTSoutliers")

Example

This is a basic example which shows you how to solve a common problem:

In Scout Jarman’s research, the goal was to create models which can classify SNWD observations as outliers. The model we will create first is a Random Forest. In order to do this, we need to make a training data set, and a testing data set. So lets start with creating a training data set.

We will create a training data set based on a random sample of observations outside of Colorado. We can use the outliers_ghcnd_training_set, as it already has a random sample of station data (with outliers and non-outliers).

library(HTSoutliers)
library(tidyverse)

# Gets all the Colorado station ids
CO_ids <- ghcnd_stations %>%
  dplyr::filter(STATE == "CO")

# Filters out non-Colorado data from the outliers_ghcnd_training_set
training_data <- outliers_ghcnd_training_set %>%
  dplyr::filter(!(ID %in% CO_ids$ID))

Now that we have some training data we can fit a model. But there is a problem, the ratio of non-outliers to outliers is out of balance. We can use the function subset_dataset to help balance out data set out. This function will take a training set, and set the ratio of outliers to non-outliers to the desired ratio, in this case we want 3 non-outliers to 1 outlier.

# subsets the training data to have an non-outlier to outlier ratio of 3:1
set.seed(42069)
training_data <- subset_dataset(training_data,
                                zeros_to_one = 3) %>%
  dplyr::select(-c(ID, DATE, WESD,
                   PRISM_PPT, PRISM_TMIN, PRISM_TMAX,
                   TYPE)) %>%
  dplyr::distinct()

Now that we have a training data set, lets create a testing data set. We want to narrow our testing set to stations which we know have outliers. We can use the outliers_ghcnd data set in the package to find stations which have outliers. The outliers_ghcnd data set only contains observations are outliers, so we will first gather the station ids that we want.

# Finds Colorado stations which have manually detected outliers
test_manual_ids <- outliers_ghcnd %>%
  dplyr::filter(ID %in% CO_ids$ID,
                TYPE == "MANUAL")

# Finds a random sample of Colorado stations with QFLAG outliers
set.seed(42069)
test_other_ids <- outliers_ghcnd %>%
  dplyr::filter(ID %in% CO_ids$ID,
                TYPE != "MANUAL") %>%
  dplyr::sample_n(1000)

# Combines the station ids of the stations we want to test
test_ids <- dplyr::bind_rows(test_manual_ids,
                             test_other_ids) %>%
  dplyr::sample_n(15)

Now that we have the ids of stations we want to test, we need to download all of the station data from the GHCN-D. We first use the get_weather_data function which will get the data online, or if you have a local copy (recommended for speed) find it in the file system. Now that we have the raw GHCN-D data, we need to format it so that it is easy to work with. We use the create_flagged_dataset which will format the raw data to include the outlier flags. Next we use the create_model_variables which creates several new variables (summary statistics on different time scales) to be used in the random forest.

# Downloads the raw GHCN-D data
test_data <- get_station_data(test_ids$ID, 
                              "ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/")

test_data <- dplyr::distinct(test_data, ID, DATE, .keep_all = TRUE)
# Formats the data to include the outlier flag
test_data <- create_flagged_dataset(test_data)
# Creates new variables for use in a model
test_data <- create_model_variables(test_data)

test_data$ID <- as.character(test_data$ID)

Now that we have our training and testing data sets, we can train a random forest.

library(randomForest)

# The training_data already has the most important variables for training
set.seed(42069)
forest <- randomForest::randomForest(OUTLIER_FINAL ~ .,
                                     data = training_data,
                                     importance = TRUE,
                                     ntree = 101)
# see which variables are most important
randomForest::varImpPlot(forest,
                         main = "Daily Model Variable Importance")

Now that we have our random forest fit, we can run our testing data through and see how well it works.

# uses the random forest to predict outliers on our testing data set
test_data$OUTLIER_PRED <- predict(forest, test_data)

# makes a confusion matrix to get an idea of the performance
caret::confusionMatrix(test_data$OUTLIER_PRED, test_data$OUTLIER_FINAL, positive="1")

Numbers are great, but it really helps to visualize the outliers to see how well the model did. The plot_outliers function will take a data set with ID, DATE, SNWD, OUTLIER_FINAL, and OUTLIER_PRED, and plot them to see which points are outliers and which points are predicted to be outliers.

# gets the ids of stations which had predicted outliers
flagged_ids <- dplyr::filter(test_data, OUTLIER_PRED == 1) %>%
  dplyr::select(ID) %>%
  dplyr::distinct()

# plots the first 5 stations
for (i in 1:5) {
  print(flagged_ids[i, "ID"])
  print(plot_outliers(test_data, as.character(flagged_ids[i,"ID"])))
}

This model does OK, but has too many false positives. To try and reduce the number of false positives, lets create a new model. What if we used a loess curve to find the general behavior of a station, and use call observations greater than 3 SD above the loess curve an outlier. We can then combine this models outliers with the random forest outliers, so that only observations which were flagged by both models are counted as outliers. The loess_model function fits a loess curve for each station, and flags points which lie above a specified number of standard deviations from the loess prediction.

# saves the random forest predictions
test_data$rf_pred <- test_data$OUTLIER_PRED

# fits a loess model to the testing data
loess_test <- loess_model(test_data, span = .2, nsigma = 3.5, family = "gaussian")

# flags points if both the the random forest and loess model agree
test_data$OUTLIER_PRED <- as.factor((test_data$rf_pred == "1" & loess_test$loess_outlier == "1") * 1)

# confusion matrix to measure the performance
caret::confusionMatrix(test_data$OUTLIER_PRED, test_data$OUTLIER_FINAL, positive="1")

# plots the first 20 stations
for (i in 1:5) {
  print(flagged_ids[i, "ID"])
  print(plot_outliers(test_data, as.character(flagged_ids[i,"ID"])))
}

That is a little better, there are less false positives, though still not great. That is real world data for you, big and messy.

There are a couple of things we did not use from the package. Such as the ghcnd_monthly data set which contains monthly summaries of every station in the ghcnd. Also the PRISM_climate_norms (which is actually used in create_model_variables), and the state_counts data set which has the counts of SNWD and WESD observations.

About

R package as a result of Scout Jarman's URCO grant reaserch.

Topics

Resources

License

Unknown, GPL-3.0 licenses found

Licenses found

Unknown
LICENSE
GPL-3.0
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages