# Recreating the AdventureWorks database. 
In this notebook, we perform all steps needed to 
1. clean out any previous iteration of the adventureworks database. 
2. download a zip file of the Adventure Works Database provided in parquet format. 
3. use that zip file to create tables in the `adventureworks` schema. 

Due to the limitations of the Databricks Free edition, I do not seem to have the ability to create seperate workspaces required to properly create the seperate schemas needed for a proper AdventureWorks database. For this reason, all tables have the prefix of the correct schema, e.g., what should be `adventureworks.person.emailaddress` is can be found at `workspace.adventureworks.person_emailaddress`. I will call out later in the script where this script should be modified to properly create the schema. 

This script should be run whenever a user wants to create the `adventureworks` database within a databricks system workspace, or wants to recreate/reset the tables. 

## Initial Setup
We start with importing all the libraries we are going to need for this script and ensure we are using the correct catalog. 

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType, DateType, FloatType
import os
import requests
import zipfile

In [0]:
%sql 
use catalog workspace

*Note*: The above cell needs to be changed depending on your catalog name. It likely is going to chage if you are able to create a catalog for `adventureworks`, you will likely also have the ability to create seperate schams, which is discussed more later. 

## Clean-up
Next we grab all schemas related to adventureworks and drop them.

In [0]:
schemas = spark.sql('''
select schema_name
from `information_schema`.`schemata` 
where schema_name like "adventureworks%"
limit 100;
''')

for schema in _sqldf.collect():
    spark.sql(f'''DROP SCHEMA IF EXISTS {schema.schema_name} CASCADE''')


## Recreating Schema and Volume
Next, we recreate the namespaces used by this script. Specifcaly, we create the `adventureworks`, set it to the default, then create a volume within that schema to store the raw data files we are going to be using. 
*Note*: This is one of the areas that you can change if are using multiple catalogs. If you can create a `adventureworks` catalog, you would be able to create seperate schemas for each schema as defined by the original AdventureWorks sample database. This script would change, depending on that ability. 

In [0]:

%sql
CREATE SCHEMA adventureworks;
USE SCHEMA adventureworks;
CREATE VOLUME adventureworks.raw_data;

Next, we set a handful of variables we will be using to store files. 
*Note*: If you have the ability to create seperate catalogs, make sure you change your paths here as well.  

In [0]:
schema = 'adventureworks'
volume = 'raw_data'
url = 'https://github.com/olafusimichael/AdventureWorksParquet/archive/main.zip'
workspace_dir = f'/Volumes/workspace/{schema}/raw_data'
# parquet_dir = f'{workspace_dir}'
zip_path = f'{workspace_dir}/main.zip'
# dbutils.fs.mkdirs(parquet_dir)


At this point, we download the zip file, as provided by [Michael Olafusi (olafusimichael)](https://github.com/olafusimichael/AdventureWorksParquet) and save all data to our newly created volume.

In [0]:
with open(f'{workspace_dir}/main.zip', 'wb') as f:
    f.write(requests.get(url).content)
with zipfile.ZipFile(f'{workspace_dir}/main.zip', 'r') as zip_ref:
    zip_ref.extractall(workspace_dir)

## Table Creation
This is the meat and potatoes of this script. With everything else done, we loop through all parquete files, and create a table out of the data within. 

In [0]:
parquet_dir = f'{workspace_dir}/AdventureWorksParquet-main'
for file in dbutils.fs.ls(parquet_dir):
    if file.name.endswith('.parquet'):
        table_name = file.name.replace('.parquet', '').replace(".", "_")
        df = spark.read.parquet(file.path)
        df.write.mode('overwrite').saveAsTable(f'{schema}.{table_name}')

Once completed, you should be able to see all the tables in the `adventureworks` schema of your workspace. 