# Tutorial : data analysis tool for reader creation 

This tutorial aims to guide you through the analysis of you raw dataset. It will provide you with functions that will display or check the way you import your datasets. At the end of the tutorial, you will get all the parameters/elements to define an new reader. 

This tutorial is based on a lightweight data sample.

This tutorial is divided into 3 parts :

* Step 1 : Data, where we introduce the sample data.
* Step 2 : where we dig into the data to set up the transformation parameters.
* Step 3 : where we create the reader


Some of the following cells should not be modified, others must be adapted to your data. 

## Tutorial step 1: Get the data and its structure

You will find the sample data for this tutorial in the folder`data` of the GitHub repository. It corresponds to one measurement campaign composed of two stations (`ID_station_1` and `ID_station_2`) during two
days.

```
📁 data/
  📁 DISDRODB/
  ├── 📁 Raw/
      ├── 📁 INSTITUTION\_or\_COUNTRY/
          ├── 📁 CAMPAIGN/
              ├── 📁 data
                  ├── 📁 ID\_station\_1/
                  ├── 📜 file60\_20180817.dat.gz
                  ├── 📜 file60\_20180818.dat.gz
                  ├── 📁 ID\_station\_2/
                  ├── 📜 file61\_20180817.dat.gz
                  ├── 📜 file61\_20180818.dat.gz
              ├── 📁 info
              ├── 📁 issue
                  ├── 📜 ID\_station\_1.yml
                  ├── 📜 ID\_station\_2.yml
              ├── 📁 metadata
                  ├── 📜 ID\_station\_1.yml
                  ├── 📜 ID\_station\_2.yml
```

This structure fulfills the requirements described todo.


## Tutorial step 2: Read and analyse the data

Once the data folders are correctly set up, we can now start analysing our dataset. 

The objectives of step 2 is to define the raw reading and writing specifications. At the end, you should normally be able to generate Apcahe parquet files fro your input raw data.  

We load modules and packages. *Nothing must be changed here*. 

In [1]:
import os
import sys
import logging
import pandas as pd

# Add project root folder into sys path
root_path = os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))
sys.path.insert(0,root_path)


# Directory
from disdrodb.L0.io import (
    check_directories,
    get_campaign_name,
    create_directory_structure,
)


# Tools to develop the parser
from disdrodb.L0.template_tools import (
    check_column_names,
    infer_df_str_column_names,
    print_df_first_n_rows,
    print_df_random_n_rows,
    print_df_column_names,
    print_valid_L0_column_names,
    get_df_columns_unique_values_dict,
    print_df_columns_unique_values,
    print_df_summary_stats,
)

# L0A processing
from disdrodb.L0.L0A_processing import (
    read_raw_data,
    get_file_list,
    read_L0A_raw_file_list,
    cast_column_dtypes,
    write_df_to_parquet,  # TODO: add code to write to parquet a single file in 8.3 ... to check it works
)

# Metadata
from disdrodb.L0.metadata import read_metadata

# Standards
from disdrodb.L0.check_standards import check_sensor_name, check_L0A_column_names

# Logger
from disdrodb.utils.logger import create_logger



**1. Define paths and running parameters**

In the following section, define the raw and processed folder paths. *This may be changed if you are using another folder*


In [2]:
raw_dir = os.path.join(root_path,"data","DISDRODB","Raw","INSTITUTION_or_COUNTRY","CAMPAIGN")  # Must end with campaign_name upper case
processed_dir = os.path.join(root_path,"data","DISDRODB","Processed","INSTITUTION_or_COUNTRY","CAMPAIGN") # Must end with campaign_name upper case

assert os.path.exists(raw_dir), "Raw directory does not exist"


If the paths have been defined correctly, no warning should be raised. 

The running parameters can be defined here :

In [3]:
force = True
lazy = True
verbose = True
debugging_mode = True
sensor_name = "Parsivel"

When the new reader will be created, these parameters will be defined within the command. Please have a look [here](https://disdrodb.readthedocs.io/en/latest/readers.html#runing-a-reader) to get a full description. 

**2. Initialization**

We initiate some checks, and get some variable. *Nothing must be changed here.*

In [4]:
# Initial directory checks
raw_dir, processed_dir = check_directories(raw_dir, processed_dir, force=force)


# Retrieve campaign name
campaign_name = get_campaign_name(raw_dir)

# -------------------------------------------------------------------------.
# Define logging settings
create_logger(processed_dir, "parser_" + campaign_name)

# Retrieve logger
logger = logging.getLogger(campaign_name)
logger.info("### Script start ###")

# -------------------------------------------------------------------------.
# Create directory structure
create_directory_structure(raw_dir, processed_dir)

# -------------------------------------------------------------------------.
# List stations
list_stations_id = os.listdir(os.path.join(raw_dir, "data"))

**3. Selection of the station**

By default, only one station is read. *Feel free to change the station id (in the current example, we have only two stations. Therefore you can choose between 0 and 1)*

In [5]:
station_id = list_stations_id[0]

**4. Get the list of file to process**

We now list all files that are in selected station.

In [6]:
glob_pattern = os.path.join("data", station_id, "*.dat*")

file_list = get_file_list(
    raw_dir=raw_dir,
    glob_pattern=glob_pattern,
    verbose=verbose,
    debugging_mode=debugging_mode,
)

print(file_list)

 - 2 files to process in C:\projects\disdrodb-fork\data/DISDRODB\Raw/INSTITUTION_or_COUNTRY/CAMPAIGN
['C:\\projects\\disdrodb-fork\\data/DISDRODB\\Raw/INSTITUTION_or_COUNTRY/CAMPAIGN\\data\\ID_station_1\\file60_20180817.dat.gz', 'C:\\projects\\disdrodb-fork\\data/DISDRODB\\Raw/INSTITUTION_or_COUNTRY/CAMPAIGN\\data\\ID_station_1\\file60_20180818.dat.gz']


Note that the  `glob_pattern` variable depends on the file extensions of your dataset. The current dataset contains only \*.dat file.

> 🚨 The `glob_pattern` variable definition will be transferred  to the reader at the end of this tutorial. 


**5. Retrieve metadata from yml files**

We now load the metadata file of the station. 

In [7]:
# Retrieve metadata
attrs = read_metadata(raw_dir=raw_dir, station_id=station_id)

# Retrieve sensor name
sensor_name = attrs["sensor_name"]
check_sensor_name(sensor_name)

If the name of the station is not correctly defined, an error message is raised. 

**5. Load the one file into a dataframe**

In the  `reader_kwargs` dictionary, you may set [any arguments](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) that need to be passed for the reading of the raw data to a dataframe via Pandas.

In [8]:
reader_kwargs = {}
# - Define delimiter
reader_kwargs["delimiter"] = ","

# - Avoid first column to become df index !!!
reader_kwargs["index_col"] = False

# - Define behaviour when encountering bad lines
reader_kwargs["on_bad_lines"] = "skip"

# - Define parser engine
#   - C engine is faster
#   - Python engine is more feature-complete
reader_kwargs["engine"] = "python"

# - Define on-the-fly decompression of on-disk data
#   - Available: gzip, bz2, zip
reader_kwargs["compression"] = "infer"

# - Strings to recognize as NA/NaN and replace with standard NA flags
#   - Already included: ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’,
#                       ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’,
#                       ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’
reader_kwargs["na_values"] = ["na", "", "error"]

# - Define max size of dask dataframe chunks (if lazy=True)
#   - If None: use a single block for each file
#   - Otherwise: "<max_file_size>MB" by which to cut up larger files
reader_kwargs["blocksize"] = None  # "50MB"

reader_kwargs['header'] = None

filepath = file_list[0]
str_reader_kwargs = reader_kwargs.copy()
str_reader_kwargs["dtype"] = str  # or object


df_str = read_raw_data(
    filepath, column_names=None, reader_kwargs=str_reader_kwargs, lazy=False
)


print(f'Dataframe for the file {os.path.basename(filepath)} :')
display(df_str)


Dataframe for the file file60_20180817.dat.gz :


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,362511,4612.0301,00847.4977,01-08-2018 12:44:30,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
1,362512,4612.0301,00847.4978,01-08-2018 12:45:01,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
2,362513,4612.0301,00847.4985,01-08-2018 12:45:30,,OK,0000.000,0056.49,00,00,...,035,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
3,362514,4612.0305,00847.4990,01-08-2018 12:46:01,,OK,0000.000,0056.49,00,00,...,035,0.05,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4,362515,4612.0303,00847.4992,01-08-2018 12:46:31,,OK,0000.000,0056.49,00,00,...,034,0.06,24.9,0,005.649,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4736,367249,4612.0313,00847.4956,03-08-2018 04:13:25,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4737,367250,4612.0313,00847.4955,03-08-2018 04:13:56,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4738,367251,4612.0313,00847.4955,03-08-2018 04:14:26,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0
4739,367252,4612.0313,00847.4954,03-08-2018 04:14:55,,OK,0000.000,0056.71,00,00,...,015,0.06,24.9,0,005.671,000,"-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.9...","00.000,00.000,00.000,00.000,00.000,00.000,00.0...","000,000,000,000,000,000,000,000,000,000,000,00...",0


If the structure of the dataframe looks fine (header, index), we are on the good track ! 

Depending on the schema of your data, this `reader_kwargs` dictionary may be fairly different from the one above. 

> 🚨 The `reader_kwargs` dictionary will be transferred to the reader at the end of this tutorial. 


**6. Data exploration**

The settings for the loading of the data is now ready, we can now load one file and analyse its content to see if there is any errors or inconsistencies.

Here are some instructions : 

* Do not assign column names to the columns yet
* Do not assign a dtype to the columns yet
* Possibly look at multiple files ;)


We print the content firt rows :

In [9]:
print_df_first_n_rows(df_str, n=0, column_names=False)

 - Column 0 :
      ['362511']
 - Column 1 :
      ['4612.0301']
 - Column 2 :
      ['00847.4977']
 - Column 3 :
      ['01-08-2018 12:44:30']
 - Column 4 :
      [nan]
 - Column 5 :
      ['OK']
 - Column 6 :
      ['0000.000']
 - Column 7 :
      ['0056.49']
 - Column 8 :
      ['00']
 - Column 9 :
      ['00']
 - Column 10 :
      ['-9.999']
 - Column 11 :
      ['9999']
 - Column 12 :
      ['12611']
 - Column 13 :
      ['00000']
 - Column 14 :
      ['035']
 - Column 15 :
      ['0.06']
 - Column 16 :
      ['24.9']
 - Column 17 :
      ['0']
 - Column 18 :
      ['005.649']
 - Column 19 :
      ['000']
 - Column 20 :
      ['-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,-9.999,']
 - Column 21 :
      ['00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.000,00.0

We print the content of the 5th row :

In [10]:
print_df_first_n_rows(df_str, n=5, column_names=False)

 - Column 0 :
      ['362511' '362512' '362513' '362514' '362515' '362516']
 - Column 1 :
      ['4612.0301' '4612.0301' '4612.0301' '4612.0305' '4612.0303' '4612.0298']
 - Column 2 :
      ['00847.4977' '00847.4978' '00847.4985' '00847.4990' '00847.4992'
 '00847.4991']
 - Column 3 :
      ['01-08-2018 12:44:30' '01-08-2018 12:45:01' '01-08-2018 12:45:30'
 '01-08-2018 12:46:01' '01-08-2018 12:46:31' '01-08-2018 12:47:01']
 - Column 4 :
      [nan nan nan nan nan nan]
 - Column 5 :
      ['OK' 'OK' 'OK' 'OK' 'OK' 'OK']
 - Column 6 :
      ['0000.000' '0000.000' '0000.000' '0000.000' '0000.000' '0000.000']
 - Column 7 :
      ['0056.49' '0056.49' '0056.49' '0056.49' '0056.49' '0056.49']
 - Column 8 :
      ['00' '00' '00' '00' '00' '00']
 - Column 9 :
      ['00' '00' '00' '00' '00' '00']
 - Column 10 :
      ['-9.999' '-9.999' '-9.999' '-9.999' '-9.999' '-9.999']
 - Column 11 :
      ['9999' '9999' '9999' '9999' '9999' '9999']
 - Column 12 :
      ['12611' '12617' '12600' '12603' '12606

Feel free to change the value for n to print another row

We print the content of n raw picked randomly : 

In [11]:
print_df_random_n_rows(df_str, n=5, with_column_names=False) 

- Column 0 : ['365163' '366574' '363031' '364156' '365799']
- Column 1 : ['4612.0310' '4612.0327' '4612.0298' '4612.0291' '4612.0317']
- Column 2 : ['00847.4995' '00847.4942' '00847.4955' '00847.4953' '00847.4946']
- Column 3 : ['02-08-2018 10:50:30' '02-08-2018 22:36:00' '01-08-2018 17:04:31'
 '02-08-2018 02:27:01' '02-08-2018 16:08:30']
- Column 4 : [nan nan nan nan nan]
- Column 5 : ['OK' 'OK' 'OK' 'OK' 'OK']
- Column 6 : ['0000.000' '0000.000' '0000.000' '0000.000' '0000.000']
- Column 7 : ['0056.67' '0056.71' '0056.52' '0056.67' '0056.71']
- Column 8 : ['00' '00' '00' '00' '00']
- Column 9 : ['00' '00' '00' '00' '00']
- Column 10 : ['-9.999' '-9.999' '-9.999' '-9.999' '-9.999']
- Column 11 : ['9999' '9999' '9999' '9999' '9999']
- Column 12 : ['12594' '12191' '12465' '12574' '12607']
- Column 13 : ['00000' '00000' '00000' '00000' '00000']
- Column 14 : ['036' '017' '019' '017' '024']
- Column 15 : ['0.06' '0.06' '0.06' '0.06' '0.05']
- Column 16 : ['24.9' '24.9' '24.9' '24.9' '24.9

Here again, feel free to change the number of printed rows. 

Get the number of column :

In [12]:
len(df_str.columns)

24

Look at unique values :

In [13]:
print_df_columns_unique_values(df_str, column_indices=None, column_names=False)

 - Column 0 :
      ['362511', '362512', '362513', '362514', '362515', '362516', '362517', '362518', '362519', '362520', '362521', '362522', '362523', '362524', '362525', '362526', '362527', '362528', '362529', '362530', '362531', '362532', '362533', '362534', '362535', '362536', '362537', '362538', '362539', '362540', '362541', '362542', '362543', '362544', '362545', '362546', '362547', '362548', '362549', '362550', '362551', '362552', '362553', '362554', '362555', '362556', '362557', '362558', '362559', '362560', '362561', '362562', '362563', '362564', '362565', '362566', '362567', '362568', '362569', '362570', '362571', '362572', '362573', '362574', '362575', '362576', '362577', '362578', '362579', '362580', '362581', '362582', '362583', '362584', '362585', '362586', '362587', '362588', '362589', '362590', '362591', '362592', '362593', '362594', '362595', '362596', '362597', '362598', '362599', '362600', '362601', '362602', '362603', '362604', '362605', '362606', '362607', '362608',

Look at unique values for a single column :

In [14]:
print_df_columns_unique_values(df_str, column_indices=11, column_names=False)

 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']


Look at unique values for a slice column :

In [15]:
print_df_columns_unique_values(df_str, column_indices=slice(10, 12), column_names=False)

 - Column 10 :
      ['-9.999', '02.669', '04.241', '04.745', '04.826', '04.879', '05.430', '06.095', '06.220', '07.415', '08.436', '08.489', '08.506', '08.724', '08.956', '09.079', '09.894', '10.057', '10.567', '11.705', '12.097', '12.390', '12.923', '13.114', '13.407', '13.684', '14.324', '15.060', '16.530', '16.636', '16.668', '17.194', '17.382', '17.829', '17.918', '18.334', '18.655', '19.526', '20.329', '21.134', '21.426', '23.098', '23.664', '23.760', '24.472', '25.473', '25.957', '29.270', '31.271', '32.255', '33.844', '36.196']
 - Column 11 :
      ['0824', '0906', '1363', '1397', '2921', '3203', '3326', '3816', '4465', '9999']


Get the unique values as dictionnary

In [16]:
get_df_columns_unique_values_dict(df_str, column_indices=slice(10, 12), column_names=False)

{'Column 10': ['-9.999',
  '02.669',
  '04.241',
  '04.745',
  '04.826',
  '04.879',
  '05.430',
  '06.095',
  '06.220',
  '07.415',
  '08.436',
  '08.489',
  '08.506',
  '08.724',
  '08.956',
  '09.079',
  '09.894',
  '10.057',
  '10.567',
  '11.705',
  '12.097',
  '12.390',
  '12.923',
  '13.114',
  '13.407',
  '13.684',
  '14.324',
  '15.060',
  '16.530',
  '16.636',
  '16.668',
  '17.194',
  '17.382',
  '17.829',
  '17.918',
  '18.334',
  '18.655',
  '19.526',
  '20.329',
  '21.134',
  '21.426',
  '23.098',
  '23.664',
  '23.760',
  '24.472',
  '25.473',
  '25.957',
  '29.270',
  '31.271',
  '32.255',
  '33.844',
  '36.196'],
 'Column 11': ['0824',
  '0906',
  '1363',
  '1397',
  '2921',
  '3203',
  '3326',
  '3816',
  '4465',
  '9999']}

**7. Columns name**

Now we have validated the content of our data. It's time to care about its structure (columns name). 

The function `infer_df_str_column_names()` tries to guess the column name based on string patterns (according to `L0A_encodings.yml` and the type of sensor.)

In [17]:
infer_df_str_column_names(df_str, sensor_name=sensor_name)

{0: [],
 1: [],
 2: [],
 3: [],
 4: [],
 5: [],
 6: ['rainfall_rate_32bit'],
 7: ['rainfall_accumulated_32bit'],
 8: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 9: ['weather_code_synop_4680', 'weather_code_synop_4677'],
 10: ['reflectivity_32bit'],
 11: ['mor_visibility'],
 12: ['number_particles', 'laser_amplitude'],
 13: ['number_particles', 'laser_amplitude'],
 14: ['sensor_temperature', 'error_code'],
 15: ['sensor_heating_current'],
 16: ['sensor_battery_voltage'],
 17: ['sensor_status'],
 18: ['rainfall_amount_absolute_32bit'],
 19: ['sensor_temperature', 'error_code'],
 20: ['raw_drop_average_velocity', 'raw_drop_concentration'],
 21: ['raw_drop_average_velocity', 'raw_drop_concentration'],
 22: ['raw_drop_number'],
 23: ['sensor_status']}

This can help us to define later define the columns name.

As reference, here is the list of valid columns name (taken from `L0A_encodings.yml`):

In [18]:
print_valid_L0_column_names(sensor_name)

['rainfall_rate_32bit', 'rainfall_accumulated_32bit', 'weather_code_synop_4680', 'weather_code_synop_4677', 'weather_code_metar_4678', 'weather_code_nws', 'reflectivity_32bit', 'mor_visibility', 'sample_interval', 'laser_amplitude', 'number_particles', 'sensor_temperature', 'sensor_serial_number', 'firmware_iop', 'firmware_dsp', 'sensor_heating_current', 'sensor_battery_voltage', 'sensor_status', 'start_time', 'sensor_time', 'sensor_date', 'station_name', 'station_number', 'rainfall_amount_absolute_32bit', 'error_code', 'rainfall_rate_16bit', 'rainfall_rate_12bit', 'rainfall_accumulated_16bit', 'reflectivity_16bit', 'raw_drop_concentration', 'raw_drop_average_velocity', 'raw_drop_number']


It's time now to define our current column names : 

Hint to define the names :
* get information from the disdrometer user guide. 
* use `infer_df_str_column_names()` to help you
* analyse the content column after column with `print_df_columns_unique_values()`  

In [19]:
column_names = [
    "unknown1",
    "unknown2",
    "unknown3",
    "time",
    "unknown4",
    "unknown5",
    "rainfall_rate_32bit",
    "rainfall_accumulated_32bit",
    "weather_code_synop_4680",
    "weather_code_synop_4677",
    "reflectivity_32bit",
    "mor_visibility",
    "laser_amplitude",
    "number_particles",
    "sensor_temperature",
    "sensor_heating_current",
    "sensor_battery_voltage",
    "sensor_status",
    "rainfall_amount_absolute_32bit",
    "error_code",
    "raw_drop_concentration",
    "raw_drop_average_velocity",
    "raw_drop_number",
    "unknown6",
]

> 🚨 The `column_names` list will be transferred  to the reader at the end of this tutorial. 

Check the validity of your definition 

In [20]:
check_column_names(column_names,sensor_name)

The following columns do no met the DISDRODB standards: ['unknown3', 'unknown4', 'unknown1', 'unknown6', 'unknown5', 'unknown2'].
Please remove such columns within the df_sanitizer_fun


Ok, fair enough. As mentioned in the error message, the unknown columns will be removed later.

**8. Read the dataframe with correct columns name**

We can now create a new dataframe with the columns name :

In [21]:
df_pandas = read_raw_data(filepath=filepath,column_names=column_names,reader_kwargs=reader_kwargs,lazy=False)

And print the dataframe column names : 

In [22]:
print_df_column_names(df_pandas)

 - Column 0 : unknown1
 - Column 1 : unknown2
 - Column 2 : unknown3
 - Column 3 : time
 - Column 4 : unknown4
 - Column 5 : unknown5
 - Column 6 : rainfall_rate_32bit
 - Column 7 : rainfall_accumulated_32bit
 - Column 8 : weather_code_synop_4680
 - Column 9 : weather_code_synop_4677
 - Column 10 : reflectivity_32bit
 - Column 11 : mor_visibility
 - Column 12 : laser_amplitude
 - Column 13 : number_particles
 - Column 14 : sensor_temperature
 - Column 15 : sensor_heating_current
 - Column 16 : sensor_battery_voltage
 - Column 17 : sensor_status
 - Column 18 : rainfall_amount_absolute_32bit
 - Column 19 : error_code
 - Column 20 : raw_drop_concentration
 - Column 21 : raw_drop_average_velocity
 - Column 22 : raw_drop_number
 - Column 23 : unknown6


Check if the lazily loading (dask) is correct :

In [23]:
df_dask = read_raw_data(filepath=filepath, column_names=column_names, reader_kwargs=reader_kwargs, lazy=True)
df_dask = df_dask.compute()

And print the dataframe column names : 

In [24]:
print_df_column_names(df_dask)

 - Column 0 : unknown1
 - Column 1 : unknown2
 - Column 2 : unknown3
 - Column 3 : time
 - Column 4 : unknown4
 - Column 5 : unknown5
 - Column 6 : rainfall_rate_32bit
 - Column 7 : rainfall_accumulated_32bit
 - Column 8 : weather_code_synop_4680
 - Column 9 : weather_code_synop_4677
 - Column 10 : reflectivity_32bit
 - Column 11 : mor_visibility
 - Column 12 : laser_amplitude
 - Column 13 : number_particles
 - Column 14 : sensor_temperature
 - Column 15 : sensor_heating_current
 - Column 16 : sensor_battery_voltage
 - Column 17 : sensor_status
 - Column 18 : rainfall_amount_absolute_32bit
 - Column 19 : error_code
 - Column 20 : raw_drop_concentration
 - Column 21 : raw_drop_average_velocity
 - Column 22 : raw_drop_number
 - Column 23 : unknown6


We can now verify that the pandas and the dask reading are similar 

In [25]:
assert df_pandas.equals(df_dask)

We can get some statistics for the `rainfall_rate_32bit`, but feel free to change it

**9. Run somme tests and analysis**

This must be done once that `reader_kwargs` and `column_names` are correctly defined

In [26]:
df = df_pandas

Input some statitics on columns : 

In [27]:
df_stat = df_pandas.loc[:, ["rainfall_rate_32bit"]]
df_stat = df_stat.astype('float')

print_df_summary_stats(df_stat)

 - Column 0 ( rainfall_rate_32bit ):
                    
mean  0.005426
min   0.000000
25%   0.000000
50%   0.000000
75%   0.000000
max   2.881000


Check if the file is empty : 

In [28]:
if len(df.index) == 0:
    raise ValueError(f"{filepath} is empty and has been skipped.")

Check the number of columns :

In [29]:
if len(df.columns) != len(column_names):
    raise ValueError(f"{filepath} has wrong columns number, and has been skipped.")

**10. Final columns formatting**

It's time now to remove all the columns that does not match the standard.

In [30]:
df = df.drop(columns=["unknown1", "unknown2", "unknown3","unknown4",'unknown5','unknown6'])

In [31]:
df["time"] = pd.to_datetime(df["time"], format="%m-%d-%Y %H:%M:%S")

Check column names met DISDRODB standards after custom processing :

In [32]:
check_L0A_column_names(df, sensor_name=sensor_name)

 Determine dtype based on standards :

In [33]:
df = cast_column_dtypes(df, sensor_name=sensor_name)

Check the dataframe looks as desired :

In [34]:
print_df_column_names(df)

 - Column 0 : time
 - Column 1 : rainfall_rate_32bit
 - Column 2 : rainfall_accumulated_32bit
 - Column 3 : weather_code_synop_4680
 - Column 4 : weather_code_synop_4677
 - Column 5 : reflectivity_32bit
 - Column 6 : mor_visibility
 - Column 7 : laser_amplitude
 - Column 8 : number_particles
 - Column 9 : sensor_temperature
 - Column 10 : sensor_heating_current
 - Column 11 : sensor_battery_voltage
 - Column 12 : sensor_status
 - Column 13 : rainfall_amount_absolute_32bit
 - Column 14 : error_code
 - Column 15 : raw_drop_concentration
 - Column 16 : raw_drop_average_velocity
 - Column 17 : raw_drop_number


In [35]:
print_df_random_n_rows(df, n=5)

- Column 0 (time) : ['2018-02-08T17:40:31.000000000' '2018-02-08T11:31:30.000000000'
 '2018-02-08T12:50:31.000000000' '2018-01-08T14:03:30.000000000'
 '2018-02-08T04:54:01.000000000']
- Column 1 (rainfall_rate_32bit) : [0. 0. 0. 0. 0.]
- Column 2 (rainfall_accumulated_32bit) : [56.71 56.7  56.7  56.52 56.67]
- Column 3 (weather_code_synop_4680) : [0 0 0 0 0]
- Column 4 (weather_code_synop_4677) : [0 0 0 0 0]
- Column 5 (reflectivity_32bit) : [-9.999 -9.999 -9.999 -9.999 -9.999]
- Column 6 (mor_visibility) : [9999 9999 9999 9999 9999]
- Column 7 (laser_amplitude) : [12459 12627 12652 12614 12577]
- Column 8 (number_particles) : [0 0 0 0 0]
- Column 9 (sensor_temperature) : [20 33 32 24 19]
- Column 10 (sensor_heating_current) : [0.06 0.06 0.06 0.06 0.06]
- Column 11 (sensor_battery_voltage) : [24.9 24.9 24.9 24.9 24.9]
- Column 12 (sensor_status) : [0 0 0 0 0]
- Column 13 (rainfall_amount_absolute_32bit) : [5.671 5.67  5.67  5.652 5.667]
- Column 14 (error_code) : [0 0 0 0 0]
- Column 1

In [36]:
print_df_columns_unique_values(df, column_indices=2, column_names=True)

 - Column 2 ( rainfall_accumulated_32bit ):
      [56.4900016784668, 56.52000045776367, 56.529998779296875, 56.54999923706055, 56.56999969482422, 56.58000183105469, 56.59000015258789, 56.599998474121094, 56.61000061035156, 56.619998931884766, 56.630001068115234, 56.63999938964844, 56.650001525878906, 56.65999984741211, 56.66999816894531, 56.68000030517578, 56.70000076293945, 56.709999084472656]


In [37]:
print_df_columns_unique_values(df, column_indices=slice(0, 17), column_names=True)

 - Column 0 ( time ):
      [1515415470000000000, 1515415501000000000, 1515415530000000000, 1515415561000000000, 1515415591000000000, 1515415621000000000, 1515415650000000000, 1515415680000000000, 1515415711000000000, 1515415741000000000, 1515415770000000000, 1515415800000000000, 1515415831000000000, 1515415861000000000, 1515415891000000000, 1515415921000000000, 1515415951000000000, 1515415981000000000, 1515416010000000000, 1515416041000000000, 1515416071000000000, 1515416101000000000, 1515416131000000000, 1515416161000000000, 1515416191000000000, 1515416221000000000, 1515416250000000000, 1515416281000000000, 1515416311000000000, 1515416341000000000, 1515416371000000000, 1515416401000000000, 1515416431000000000, 1515416461000000000, 1515416490000000000, 1515416521000000000, 1515416551000000000, 1515416581000000000, 1515416611000000000, 1515416641000000000, 1515416671000000000, 1515416701000000000, 1515416731000000000, 1515416761000000000, 1515416791000000000, 1515416821000000000, 15154

**11. Define the sanitizer function**

This function will be used in the reader. It contains the list of columns we have to drop, as defined previously. 

In [38]:
def df_sanitizer_fun(df, lazy=False):
    # Import dask or pandas
    if lazy:
        import dask.dataframe as dd
    else:
        import pandas as dd


    # - Drop datalogger columns
    columns_to_drop = ["unknown1", "unknown2", "unknown3","unknown4",'unknown5','unknown6']

    df = df.drop(columns=columns_to_drop)
    
    # - Convert time column to datetime format
    df["time"] = dd.to_datetime(df["time"], format="%m-%d-%Y %H:%M:%S")  
    
    return df

> 🚨 The `df_sanitizer_fun()` function will be transferted to the reader at the end of this tutorial. 

**12. Launch code as in the reader file**

Instructions : 
* Try with increasing number of files
* Try first with lazy=False, then lazy=True


In [39]:
lazy = False 
subset_file_list = file_list[:1]

df = read_L0A_raw_file_list(
    file_list=subset_file_list,
    column_names=column_names,
    reader_kwargs=reader_kwargs,
    sensor_name=sensor_name,
    verbose=verbose,
    df_sanitizer_fun=df_sanitizer_fun,
    lazy=lazy,
)

 -  - 0 of 1 have been skipped.


The function `read_L0A_raw_file_list` takes as argument :
* `file_list` : list of file based on data folder analysis
* `column_names` : the list of column, defined previously  
* `reader_kwargs` : dictionary to data loading  into the dataframe, defined previously 
* `sensor_name` : taken from the config file 
* `df_sanitizer_fun`: the function to clean up the data frame, defined previously 

All these arguments are defined either in the data folder, or earlier in the code. It is now time to create the reader !

## Tutorial step 3 : Create the reader

We have now all the elements to start creating the new reader. All the modifications that we did in this notebook must be now transcribed into a reader file.

1. Copy and paste the `disdrodb\L0\readers\parser_template.py` into the folder `disdrodb\L0\readers\TUTORIAL`

2. Rename the copied file `parser_TUTORIAL.py`

3. Add the root folder as path variable : 
   
   Before :

    ``` python

        import click
        from disdrodb.L0 import run_L0

    ```
    After : 

    ``` python
        import os
        import sys

        # Add project root folder into sys path
        root_path = os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd())))
        sys.path.insert(0,root_path)
        import click
        from disdrodb.L0.L0_processing import run_L0
    ```

4. Define the columns names : 

   Before :

    ``` python

        column_names = []
    ```
    After : 

    ``` python
        column_names = [
            "unknown1",
            "unknown2",
            "unknown3",
            "time",
            "unknown4",
            "unknown5",
            "rainfall_rate_32bit",
            "rainfall_accumulated_32bit",
            "weather_code_synop_4680",
            "weather_code_synop_4677",
            "reflectivity_32bit",
            "mor_visibility",
            "laser_amplitude",
            "number_particles",
            "sensor_temperature",
            "sensor_heating_current",
            "sensor_battery_voltage",
            "sensor_status",
            "rainfall_amount_absolute_32bit",
            "error_code",
            "raw_drop_concentration",
            "raw_drop_average_velocity",
            "raw_drop_number",
            "unknown6",
        ]
    ```


5. Add raw data loading parameter

    Before :

    ``` python

        reader_kwargs["blocksize"] = None # "50MB"

    ```
    After : 

    ``` python
        reader_kwargs["blocksize"] = None # "50MB"
        reader_kwargs['header'] = None
    ```

6. Modify the `df_sanitizer_fun()` function 

   Before :

    ``` python

        def df_sanitizer_fun(df, lazy=False):
            # Import dask or pandas
            if lazy:
                    import dask.dataframe as dd
            else:
                    import pandas as dd

            # - Drop datalogger columns
            columns_to_drop = ['id', 'datalogger_temperature', 'datalogger_voltage', 'datalogger_error']
            df = df.drop(columns=columns_to_drop)

            # - Drop latitude and longitude
            # --> Latitude and longitude is specified in the the metadata.yaml
            df = df.drop(columns=['latitude', 'longitude'])

            # - Convert time column to datetime with resolution in seconds
            df['time'] = dd.to_datetime(df['time'], format='%d-%m-%Y %H:%M:%S')

            return df
    ```
    
    After : 

    ``` python
        def df_sanitizer_fun(df, lazy=False):
            # Import dask or pandas
            if lazy:
                import dask.dataframe as dd
            else:
                import pandas as dd

            # - Drop datalogger columns
            columns_to_drop = ["unknown1", "unknown2", "unknown3","unknown4",'unknown5','unknown6']

            df = df.drop(columns=columns_to_drop)

            # - Convert time column to datetime format
            df["time"] = dd.to_datetime(df["time"], format="%m-%d-%Y %H:%M:%S")  

            return df
    ```
 
 
 7. Run the script
 
     From the root folder, just run :
     ``` batch
     python .\disdrodb\L0\readers\TUTORIAL\parser_TUTORIAL_finished.py  <data_folder>\DISDRODB\Raw\INSTITUTION_or_COUNTRY\CAMPAIGN\ <data_folder>\DISDRODB\Processed\INSTITUTION_or_COUNTRY\CAMPAIGN\ -l0b True -f True -v True -d False
     ```
 
    You need to adapt the <data_folder> parameter to your local data folder.

    Have a look here if you want to customize this command. (link still to update)

 
8. Check if the script has correctly run

    The output folder should be as follow :
    
    ```
    📁 DISDRODB/
    ├── 📁 Processed/
       ├── 📁 INSTITUTION\_or\_COUNTRY/
          ├── 📁 CAMPAIGN/
              ├── 📁 info
                  ├── 📜 ID\_station\_1.yml
                  ├── 📜 ID\_station\_2.yml
              ├── 📁 L0A
                  ├── 📁 ID\_station\_1/
                     ├── 📜 \_sID\_station\_1.parquet
                  ├── 📁 ID\_station\_2
                     ├── 📜 \_sID\_station\_2.parquet
              ├── 📁 L0B
                  ├── 📁 ID\_station\_1/
                     ├── 📜 \_sID\_station\_1.nc
                  ├── 📁 ID\_station\_2/
                     ├── 📜 \_sID\_station\_2.nc
              ├── 📁 logs
                  ├── 📜 \<date\>\_LO\_parser.log
              ├── 📁 metadata
                  ├── 📜 ID\_station\_1.yml
                  ├── 📜 ID\_station\_2.yml

    ```

Well done 👋 you have created a new reader. You can now :

-   Create your own reader based on your data.
-   Run this reader over your entire dataset to generate L0 files.
-   Publish this reader to the github main repository to enrich the
    DISDRODB project ! Have a look at the [contributors
    guidelines](https://disdrodb.readthedocs.io/en/latest/contributors_guidelines.html)

    

Corrections of this tutorial can be found here :  `disdrodb\L0\readers\TUTORIAL\parser_TUTORIAL_correction.py` 