# Introduction to data science
Author: Gérard Lichtert

## Introduction
This notebook is to clean data from a csv, it removes unnecesary columns, computes means and saves the processed data to a new csv file found in the output folder.

It will also make a new dataframe containing averages per day per participant and save it to a csv for the OBSE survey

## Usage
Following the instructions in the README.md file is crucial for installation, for execution it suffices to run all twice. The first time will create the necessary directories. You are expected to put the input CSV files in the resources/in directory. After you've moved the CSV files to the resources/in directory you can run it again to process the CSV files. After processing they will end up in the resources/out directory and depending on the data processed in either obse or afvar subdirectories.

## Variables you can change
In the following code cells you can change the variables as you need as these will be the columns that need to be removed from the OBSE survey and the other one respectively.

In [1]:
# This is a list of headers we want to delete (exluding the ones with _TZ, _RT and _TZ) from the OBSE survey
# the headers with _TZ, _RT and _TZ will be removed automatically.
HEADERS_TO_DELETE_OBSE: list[str] = [
    "STUDY_ID",
    "STUDY_NAME",
    "STUDY_VERSION",
    "SURVEY_ID",
    "TRIGGER",
    "START_END",
    "RAND_PROB",
    "CONTROLLEVRAAG",
    "INTRO",
    "SLOT",
    "BEDANKT",
    "INLEIDING",
]

In [2]:
# This is a list of headers we want to delete (exluding the ones with _TZ, _RT and _TZ) from the afvar survey
# the headers with _TZ, _RT and _TZ will be removed automatically.
HEADERS_TO_DELETE_OTHER = [
    "STUDY_ID",
    "STUDY_NAME",
    "STUDY_VERSION",
    "SURVEY_ID",
    "TRIGGER",
    "START_END",
    "RAND_PROB",
]


## Execution
Underneath are the code cells that process all the data. Run this once first to create the necessary directories. Run this a second time after moving the CSV files in the resources/in directory to process them.

In [3]:
from afvar import afvar
from obse import obse
from common import common
import polars as pl

# ! CHANGE THE CODE AT YOUR OWN PERIL

for file in common.IN_DIR.iterdir():
    raw_lf: pl.LazyFrame = common.load_csv(filename=file.name)
    if obse.is_obse_survey(raw_lf):
        processed_lf = obse.process_obse(raw_lf, HEADERS_TO_DELETE_OBSE, False)
        daily_means_lf = obse.create_daily_means(processed_lf)
        obse.save_csv(processed_lf, "obse_cleaned.csv")
        obse.save_excel(processed_lf, "obse_cleaned.xlsx")
        obse.save_csv(daily_means_lf, "obse_daily_means.csv")
        obse.save_excel(daily_means_lf, "obse_daily_means.xlsx")
    else:
        processed_lf = afvar.process_afvar(raw_lf, HEADERS_TO_DELETE_OTHER, False)
        afvar.save_csv(processed_lf, "afvar_cleaned.csv")
        print(f"Columns: {raw_lf.columns}")
        print(f"Columns to be removed: {HEADERS_TO_DELETE_OTHER + common.tz_rt_ts_headers(raw_lf)}")
        print(f"Remaining columns: {afvar.clean_lf(raw_lf, HEADERS_TO_DELETE_OTHER, False).columns}")

Columns: ['PARTICIPANT_ID', 'PARTICIPANT_TZ', 'STUDY_ID', 'STUDY_NAME', 'STUDY_VERSION', 'SURVEY_ID', 'SURVEY_NAME', 'TRIGGER', 'EXPORT_TZ', 'START_END', 'CREATED_TS', 'SCHEDULED_TS', 'STARTED_TS', 'COMPLETED_TS', 'EXPIRED_TS', 'UPLOADED_TS', 'TOTAL_RT', 'RAND_PROB', 'INLEIDING_WERKTEVREDENHEID', 'INLEIDING_WERKTEVREDENHEID_RT', 'WERKOMSTANDIGHEDEN_1', 'WERKOMSTANDIGHEDEN_2', 'WERKOMSTANDIGHEDEN_3', 'WERKOMSTANDIGHEDEN_4', 'WERKOMSTANDIGHEDEN_5', 'WERKOMSTANDIGHEDEN_RT', 'WERKMETHODE_1', 'WERKMETHODE_2', 'WERKMETHODE_3', 'WERKMETHODE_4', 'WERKMETHODE_5', 'WERKMETHODE_RT', "COLLEGA'S_1", "COLLEGA'S_2", "COLLEGA'S_3", "COLLEGA'S_4", "COLLEGA'S_5", "COLLEGA'S_RT", 'ERKENNING_1', 'ERKENNING_2', 'ERKENNING_3', 'ERKENNING_4', 'ERKENNING_5', 'ERKENNING_RT', 'BAAS_1', 'BAAS_2', 'BAAS_3', 'BAAS_4', 'BAAS_5', 'BAAS_RT', 'VERANTWOORDELIJKHEID_1', 'VERANTWOORDELIJKHEID_2', 'VERANTWOORDELIJKHEID_3', 'VERANTWOORDELIJKHEID_4', 'VERANTWOORDELIJKHEID_5', 'VERANTWOORDELIJKHEID_RT', 'SALARIS_1', 'SALARIS