__Data Backend__

* `spark`
* `pandas`
* `dask`
* `ray`

__Model Backends__

* `sklearn`
* `sklearnex`
* `xgboost`
* `lightgbm`
* `catboost`
* `pytorch`
* `tensorflow`
* `SparkML`

__Data Format__

* `csv`
* `parquet`

__Compute Backends__

* `joblib + dask`
* `joblib + ray`
* `joblib + ipp`
* `joblib + loky`
* `joblib + multiprocessing`
* `joblib + threading`
* `joblib + spark`


__Data Store__

* `dvc`
* `blob`
* `local`

__Configs__

* `data_config.json`
* `model_config.json`
* `log_config.json`
* `config.json`

### __Env creation to execute ray, dask, spark, modin__

```bash
conda create -n aikit-modin python=3.8 intel-aikit-modin -c intel -c conda-forge -y
conda activate aikit-modin
conda install scikit-learn-intelex -c conda-forge
pip install ipyparallel joblib jupyter notebook scikit-learn pandas numpy "dask[distributed]" watermark joblibspark pyspark tune-sklearn typer "modin[all]"
pip list --format=freeze > requirements_intel.txt
conda deactivate
```

In [None]:
# import pandas as pd

In [None]:
# import ray
# ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
# import modin.pandas as pd

In [None]:
# import os
# os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
# from pyspark.sql import SparkSession
# import pyspark.pandas as pd
# pd.set_option('compute.default_index_type', 'distributed')
# spark = SparkSession.builder.master('local[*]').config("spark.driver.memory", "10g").getOrCreate()

# https://docs.databricks.com/machine-learning/ray-integration.html
# https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html#gs.qt8guf
# https://www.ray.io/ray-datasets

* `DataIO`: This class will have all the methods to read/write data from/to different data stores. It will also have methods to read/write config files. It will also have methods to read/write models. Along with this if there is any other helper utility method that is required to read/write data, it should be added to this class.

* `DataParser`: This class will have all the methods to parse the data. For example, if the data type is `number` then it should convert the data type to `int` or `float` based on the data. It should also have methods to parse the date columns with the given date format. Also, wherever possible optimize data to be better suited in memory.



In [3]:
from data_utils import DataParser, MetaDataExtractor, DataProcessor, DataValidator, DataLogger, DataIO, DataProfiler, DataVisualizer

# Data Input and Output
DataIO.read_data()
DataIO.write_data()
DataIO.read_config()
DataIO.write_config()
DataIO.read_config()
DataIO.write_config()
DataIO.upload_file()
DataIO.download_file()

# Data Parser
DataParser.optimize_data_types()
DataParser.parse_date()

# Meta Data Extractor
MetaDataExtractor.get_counts()
MetaDataExtractor.get_date_frequency()
MetaDataExtractor.get_data_types()
MetaDataExtractor.get_date_format()
MetaDataExtractor.get_unique_values()
MetaDataExtractor.get_min_max_values()
MetaDataExtractor.get_min_max_date()

# Data Processor + Any other custom functions for aggregation and stuff
DataProcessor.filter_by_range()
DataProcessor.filter_by_values()
DataProcessor.apply_imputation()


# Resampling of data by date and change in date of month, week etc.

# Data Validator
DataValidator.check_columns()
DataValidator.check_duplicates()
DataValidator.check_nan()
DataValidator.check_min_max()
DataValidator.check_min_max_date()
DataValidator.check_unique_values()

In [1]:
class DataIO:
    def __init__(self):
        pass
    def read_data():
        pass
    def write_data():
        pass
    def read_config():
        pass
    def write_config():
        pass
    def save_model():
        pass
    def load_model():
        pass
    def upload_file():
        pass
    def download_file():
        pass

class DataParser:
    def __init__(self):
        pass
    def optimize_data_types():
        pass
    def parse_date():
        pass

class MetaDataExtractor:
    def __init__(self):
        pass
    def get_unique_values():
        pass
    def get_min_max_values():
        pass
    def get_min_max_date():
        pass
    def get_counts():
        pass
    def get_date_frequency():
        pass
    def get_data_types():
        pass
    def get_date_format():
        pass

class DataProcessor:
    def __init__(self):
        pass
    def rename_columns():
        pass
    def filter_by_range():
        pass
    def filter_by_values():
        pass
    def apply_imputation():
        pass

class DataValidator:
    def __init__(self):
        pass
    def check_duplicates():
        pass
    def check_columns():
        pass
    def check_nan():
        pass
    def check_min_max():
        pass
    def check_min_max_date():
        pass
    def check_unique_values():
        pass

class DataLogger:
    def __init__(self):
        pass
    def upload_log():
        pass
    def download_log():
        pass
    def read_log():
        pass
    def write_log():
        pass
    def log_metadata():
        pass
    def log_validation():
        pass
    def log_config():
        pass

class DataProfiler:
    def __init__(self):
        pass
    def profile_data():
        pass

class DataVisualizer:
    def __init__(self):
        pass
    def visualize_data():
        pass