<a href="https://colab.research.google.com/github/whorseman/Assignments/blob/main/learning_portfolio_time_series_to_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time Series Datasets

This notebook shows how to create a time series dataset from some csv file in order to then share it on the [🤗 hub](https://huggingface.co/docs/datasets/index). We will use the GluonTS library to read the csv into the appropriate format. We start by installing the libraries

B: Using the notebook to prepare a csv of BTC for the 2022 for Transformers prediction.

In [1]:
! pip install -q datasets gluonts orjson

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.0/137.0 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

GluonTS comes with a pandas DataFrame based dataset so our strategy will be to read the csv file, and process it as a `PandasDataset`. We will then iterate over it and convert it to a 🤗 dataset with the appropriate schema for time series. So lets get started!

## `PandasDataset`

Suppose we are given multiple (10) time series stacked on top of each other in a dataframe with an `item_id` column that distinguishes different series:

In [15]:
import pandas as pd
from google.colab import drive
from google.colab import data_table
import matplotlib.pyplot as plt
data_table.enable_dataframe_formatter()

drive.mount('/content/drive')


df = pd.read_csv("/content/drive/MyDrive/DIGO/BTC-USD-2.csv", index_col=0, parse_dates=True)
df['item_id'] = 'BTC'
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,item_id
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-01-01,46311.746094,47827.3125,46288.484375,47686.8125,47686.8125,24582667004,BTC
2022-01-02,47680.925781,47881.40625,46856.9375,47345.21875,47345.21875,27951569547,BTC
2022-01-03,47343.542969,47510.726563,45835.964844,46458.117188,46458.117188,33071628362,BTC
2022-01-04,46458.851563,47406.546875,45752.464844,45897.574219,45897.574219,42494677905,BTC
2022-01-05,45899.359375,46929.046875,42798.222656,43569.003906,43569.003906,36851084859,BTC


After converting it into a `pd.Dataframe` we can then convert it into GluonTS's `PandasDataset`:

In [16]:
from gluonts.dataset.pandas import PandasDataset

ds = PandasDataset.from_long_dataframe(df, target="Open", item_id="item_id")


## 🤗 Datasets

From here we have to map the pandas dataset's `start` field into a time stamp instead of a `pd.Period`. We do this by defining the following class:

In [17]:
class ProcessStartField():
    ts_id = 0

    def __call__(self, data):
        data["start"] = data["start"].to_timestamp()
        data["feat_static_cat"] = [self.ts_id]
        self.ts_id += 1

        return data

In [18]:
from gluonts.itertools import Map

process_start = ProcessStartField()

list_ds = list(Map(process_start, ds))

Next we need to define our schema features and create our dataset from this list via the `from_list` function:

In [19]:
from datasets import Dataset, Features, Value, Sequence

features  = Features(
    {
        "start": Value("timestamp[s]"),
        "target": Sequence(Value("float32")),
        "feat_static_cat": Sequence(Value("uint64")),
        # "feat_static_real":  Sequence(Value("float32")),
        # "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
        # "feat_dynamic_cat": Sequence(Sequence(Value("uint64"))),
        "item_id": Value("string"),
    }
)

In [20]:
dataset = Dataset.from_list(list_ds, features=features)

In [22]:
dataset['target']

[[46311.74609375,
  47680.92578125,
  47343.54296875,
  46458.8515625,
  45899.359375,
  43565.51171875,
  43153.5703125,
  41561.46484375,
  41734.7265625,
  41910.23046875,
  41819.5078125,
  42742.1796875,
  43946.7421875,
  42598.87109375,
  43101.8984375,
  43172.0390625,
  43118.12109375,
  42250.07421875,
  42374.0390625,
  41744.02734375,
  40699.60546875,
  36471.58984375,
  35047.359375,
  36275.734375,
  36654.8046875,
  36950.515625,
  36841.87890625,
  37128.4453125,
  37780.71484375,
  38151.91796875,
  37920.28125,
  38481.765625,
  38743.71484375,
  36944.8046875,
  37149.265625,
  41501.48046875,
  41441.12109375,
  42406.78125,
  43854.65234375,
  44096.703125,
  44347.80078125,
  43571.12890625,
  42412.30078125,
  42236.56640625,
  42157.3984375,
  42586.46484375,
  44578.27734375,
  43937.0703125,
  40552.1328125,
  40026.0234375,
  40118.1015625,
  38423.2109375,
  37068.76953125,
  38285.28125,
  37278.56640625,
  38333.74609375,
  39213.08203125,
  39098.6992187

We can thus use this strategy to [share](https://huggingface.co/docs/datasets/share) the dataset to the hub.