# Weather Forecast Dataset - Usage Example

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/whylogs/blob/mainline/python/examples/datasets/weather.ipynb)

This an example demonstrating the usage of the Weather Forecast Dataset.

For more information about the dataset itself, check the documentation on :
https://whylogs.readthedocs.io/en/latest/datasets/weather.html

## Installing the datasets module

Uncomment the cell below if you don't have the `datasets` module installed:

In [None]:
# %pip install -q whylogs[datasets]

## Loading the Dataset

You can load the dataset of your choice by calling it from the `datasets` module:

In [2]:
from whylogs.datasets import Weather

dataset = Weather(version="in_domain")

This will create a folder in the current directory named `whylogs_data` with the csv files for the Weather Dataset. If the files already exist, the module will not redownload the files.

Notice we're specifying the version of the dataset. A dataset can have multiple versions that can be used for differente purposes. In this case, the version "in_domain" has data from the same domain between baseline and inference subsets (data from the same set of regions - tropical, dry, polar, etc.).

If we're interested in assessing drift issues, the version "out_domain" could be used, in which we have out-of-domain data in the inference subset, when compare to the baseline.

Similarly, datasets could have other versions for other purposes, such as assessing data quality or outlier detection strategies.

## Discovering Information

To know what are the available versions for a given dataset, you can call:

In [9]:
Weather.describe_versions()

('in_domain', 'out_domain')

To get access to overall description of the dataset:

In [None]:
print(Weather.describe())

note: the output was cleared as `describe()` will print a rather lengthy description.

## Getting Baseline Data

You can access data from two different partitions: the baseline dataset and inference dataset.

The baseline can be accessed as a whole, whereas the inference dataset can be accessed in periodic batches, defined by the user.

To get a `baseline` object, just call `dataset.get_baseline()`:

In [9]:
from whylogs.datasets import Weather

dataset = Weather(version="in_domain")

baseline = dataset.get_baseline()

`baseline` will contain different attributes - one timestamp and five dataframes.

- timestamp: the batch's timestamp (at the start)
- data: the complete dataframe
- features: input features
- target: output feature(s)
- prediction: output prediction and, possibly, features such as uncertainty, confidence, probability
- misc: metadata features that are not of any of the previous categories, but still contain relevant information about the data.

In [6]:
baseline.timestamp

datetime.date(2022, 8, 4)

In [8]:
baseline.misc.head()

Unnamed: 0_level_0,meta_latitude,meta_longitude,meta_climate,date
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-08-04,28.7029,-105.964996,dry,2022-08-04
2022-08-04,-35.165298,147.466003,mild temperate,2022-08-04
2022-08-04,29.6073,-95.158798,mild temperate,2022-08-04
2022-08-04,39.077999,-77.557503,mild temperate,2022-08-04
2022-08-04,26.152599,-81.775299,mild temperate,2022-08-04


## Setting Parameters

With `set_parameters`, you can specify the timestamps for both baseline and inference datasets, as well as the inference interval.

By default, the timestamp is set as:
- Current date for baseline dataset
- Tomorrow's date for inference dataset

These timestamps can be defined by the user to any given day, including the dataset's original date.

The `inference_interval` defines the interval for each batch: '1d' means that we will have daily batches, while '7d' would mean weekly batches.

To set the timestamps to the original dataset's date, set `original` to true, like below:

In [10]:
# Currently, the inference interval takes a str in the format "Xd", where X is an integer between 1-30
dataset.set_parameters(inference_interval="1d", original=True)

You can set timestamp by using the `baseline_timestamp` and `inference_start_timestamp`, and the inference interval like below:

In [11]:
from datetime import date
today = date.today()
dataset.set_parameters(baseline_timestamp=today, inference_start_timestamp=today, inference_interval="1d")

Note that if both `original` and a timestamp (baseline or inference) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp.

## Getting Inference Data #1 - By Date

You can get inference data in two different ways. The first is to specify the exact date you want, which will return a single batch:

In [12]:
batch = dataset.get_inference_data(target_date=today)

You can access the attributes just as showed before:

In [13]:
batch.timestamp

datetime.date(2022, 8, 4)

In [14]:
batch.prediction.head()

Unnamed: 0_level_0,prediction_temperature,uncertainty
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-08-04,9.163181,1.744749
2022-08-04,26.220221,3.051431
2022-08-04,13.178478,5.418665
2022-08-04,23.255124,2.586641
2022-08-04,27.851674,6.792959


## Getting Inference Data #2 - By Number of Batches

The second way is to specify the number of batches you want and also the date for the first batch.

You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here's an example that retrieves daily batches for a period of 5 days and logs each one with __whylogs__:

In [15]:
import whylogs as why
batches = dataset.get_inference_data(number_batches=5)

for batch in batches:
  print("logging batch of size {} for {}".format(len(batch.data),batch.timestamp))
  why.log(batch.data)

logging batch of size 216 for 2022-08-04
logging batch of size 242 for 2022-08-05
logging batch of size 231 for 2022-08-06
logging batch of size 225 for 2022-08-07
logging batch of size 222 for 2022-08-08
