# Saving the data in a TXT file with the correct structure for GAMCR

To properly use the GAMCR package, you should have a folder for each site with the following structure:

- this folder should have name `site`
- in this folder, you should have the `data_{site}.txt` file saved
- GAMCR will save in this folder the different models that you will train for that site
- in this folder, two subfolders will be created and used by GAMCR. 
    * The first subfolder `data` will be created to save the preprocessed data when calling a `save_batch` type method
    * The second subfolder `results` will be created to save some statistics on the results of a trained model when calling the `compute_statistics` method
    
This notebook will create the folder `site` and the txt file `data_{site}.txt` in it. This text file needs to have the following columns:
- `q`: streamflow time series
- `p`: precipitation time series
- `timeyear`: fractional year (e.g. 2022.5 for 2nd July 2022)
- `date`: date of the year (datetime python object)
- `pet`: potential evapotranspiration

/!\ Not the we filter out some date just to have faster computation for the tutorial. Set the global variable `FILTERING_DATE` to False to use the complete dataset.

In [1]:
import numpy as np
import pandas as pd
import os

from data_and_visualization.get_feat_space import *

FILTERING_DATE = True

In [None]:
all_GISID = [44]
all_GISID = np.array([str(el) for el in all_GISID])

file_catchprop = os.path.join('..', 'data', 'CH_Catchments_Geodata_MF_20221209.csv')

feat_space, all_GISID, df_features = get_feat_space(file_catchprop, all_GISID=all_GISID, get_df=True, normalize=False)
# why retrieving feature space if not used?

  df_data = df_data.replace(',', '.', regex=True).astype(float)


In [3]:
for GISID in all_GISID:
    pathdata = os.path.join('..', 'data', 'real_data', 'GISID2hourly_data_withPET', f'{GISID}.csv')
    df_preprocessed_data = pd.read_csv(pathdata, sep=',')
    df_preprocessed_data = df_preprocessed_data.rename(columns={"discharge": "q", "precip": "p", "t": "timeyear", "datetime": "date"})
    
    # Conversion of discharge data to mm/h
    df_preprocessed_data['q'] = df_preprocessed_data['q'] * 3600 * 1000 / (df_features.loc[GISID, 'EZG '] * 1_000_000)  

    # Remove nois from precipitation timeseries
    df_preprocessed_data.loc[df_preprocessed_data['p'] <= 0.1, 'p'] = 0

    # Fill all NaN values (replace by 0 for p, q and pet) -> could interpolate instead? RM
    df_preprocessed_data = df_preprocessed_data.fillna(0)
    
    # Filtering out some date just to have faster computation for the tutorial
    if FILTERING_DATE:
        df_preprocessed_data = df_preprocessed_data.loc[df_preprocessed_data['timeyear'] > 2014]
    
    df_preprocessed_data.reset_index(inplace=True, drop=True)

    # Create directory and save file
    directory = f'./{GISID}/'
    if not os.path.exists(directory):
        os.makedirs(directory)
        
    df_preprocessed_data.to_csv(directory + f'data_{GISID}.txt', index=False)  # why does it still have tmin, tmax and tabs? RM

In [5]:
df_preprocessed_data

Unnamed: 0,q,p,timeyear,date,tmin,tmax,tabs,pet
0,0.048807,0.0,2014.000114,2014-01-01 00:00:00,-4.756859,2.940631,-1.325035,0.582874
1,0.047939,0.0,2014.000228,2014-01-01 01:00:00,-4.756859,2.940631,-1.325035,0.582874
2,0.047121,0.0,2014.000342,2014-01-01 02:00:00,-4.756859,2.940631,-1.325035,0.582874
3,0.046385,0.0,2014.000457,2014-01-01 03:00:00,-4.756859,2.940631,-1.325035,0.582874
4,0.046988,0.0,2014.000571,2014-01-01 04:00:00,-4.756859,2.940631,-1.325035,0.582874
...,...,...,...,...,...,...,...,...
57692,0.037942,0.0,2020.581626,2020-07-31 20:00:00,15.530965,26.443121,21.434227,4.725457
57693,0.037513,0.0,2020.581740,2020-07-31 21:00:00,15.530965,26.443121,21.434227,4.725457
57694,0.037227,0.0,2020.581853,2020-07-31 22:00:00,15.530965,26.443121,21.434227,4.725457
57695,0.037104,0.0,2020.581967,2020-07-31 23:00:00,15.530965,26.443121,21.434227,4.725457
