# Overview

The goal of this file is to load the raw database files that ia in the `raw_data` folder, extract and format the data in a way that is usefull for training the model. The output files are saved in the `datasets` folder.

Produces:
- 'datasets/train.csv' : training data. 80% of the data.
- 'datasets/test.csv' : testing data. 20% of the data.

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, inspect
from tqdm import tqdm
import sys
import os

In [None]:
#load data from SQL server
server = 'localhost'
database_names = ['Bucket_110914_1', 'Bucket_110914_2'] #Bucket_<ID>_<#>
username = 'SA'
password = input('Enter password: ')
port = '1433'
driver = 'ODBC+Driver+17+for+SQL+Server'

data = {}
for database in tqdm(database_names):
    engine = create_engine(f'mssql+pyodbc://{username}:{password}@{server}:{port}/{database}?driver={driver}')
    
    inspector = inspect(engine)
    table_names = inspector.get_table_names()

    # Create a dictionary of dataframes
    dfs = {}

    # Loop through table names and for each table, execute a SQL query and load the result into a pandas DataFrame
    for table in tqdm(table_names):
        query = f'SELECT * FROM {table}'
        dfs[table] = pd.read_sql_query(query, engine)
    data = {**data, **dfs}

#load data from spreadsheet
data = {**data, **pd.read_excel('raw_data/Bucket_110915.xlsx', sheet_name=None)}


In [None]:
#write all data to parquet files
for table in tqdm(data):
    data[table].to_parquet(f'./datasets/{table}.parquet')

In [None]:
#load all data from parquet files
data_dir = './datasets/'
data = {}
for file in tqdm(os.listdir(data_dir)):
    if file.endswith('.parquet'):
        data[".".join(file.split('.')[:-1])] = pd.read_parquet(data_dir + file)

# Preprocessing

This will happen in a few steps.

- Weather data will be loaded and processed into a single useful dataframe.
- IRI dataframe will be loaded and columns from other dataframes will be added to it.
- The final result will be saved to another parquet file.