# ARCHIVED - Please run with RAPIDS 24.06 or earlier.  You can use conda or docker to get the correct version

# Querying Multiple Data Formats 
*By Winston Robson,
Edited for Dask-SQL by Shondace Thomas*

In this notebook, we will cover: 
- How to create DaskSQL tables from: 
    - Text file (CSV)
    - Apache Parquet 
    - cuDF DataFrame (GDF)
- How to `JOIN` multiple DaskSQL tables into one cuDF DataFrame with a single query.

#### Download Data
This cell will check if you have the data for this demo, and, if you don't, will download it for you.

In [1]:
import os

# relative path to data folder
data_dir = '../../../data/dask-sql/'

# does folder exist?
if not os.path.exists(data_dir):
    print('creating dask-sql directory\n')
    # create folder
    os.system('mkdir ../../data/dask-sql')

# do we have file 0?
if not os.path.isfile(data_dir + 'cancer_data_00.csv'):
    !wget -P ../../../data/dask-sql https://raw.githubusercontent.com/BlazingDB/bsql-demos/master/data/cancer_data_00.csv

# do we have file 1?
if not os.path.isfile(data_dir + 'cancer_data_01.parquet'):
    !wget -P ../../../data/dask-sql https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet

# do we have file 2?
if not os.path.isfile(data_dir + 'cancer_data_02.csv'):
    !wget -P ../../../data/dask-sql https://raw.githubusercontent.com/BlazingDB/bsql-demos/master/data/cancer_data_02.csv

--2021-12-10 08:33:53--  https://raw.githubusercontent.com/BlazingDB/bsql-demos/master/data/cancer_data_00.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1233 (1.2K) [text/plain]
Saving to: ‘../../../data/dask-sql/cancer_data_00.csv’


2021-12-10 08:33:54 (35.2 MB/s) - ‘../../../data/dask-sql/cancer_data_00.csv’ saved [1233/1233]

--2021-12-10 08:33:54--  https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet
Resolving blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)... 52.216.110.35
Connecting to blazingsql-colab.s3.amazonaws.com (blazingsql-colab.s3.amazonaws.com)|52.216.110.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2364 (2.3K) [binary/octet-stream]
Saving to: ‘../../../data

## Create DaskContext
You can think of the DaskContext much like a Spark Context (i.e. information such as FileSystems registered & Tables created will be stored).

In [2]:
from dask_sql import Context
# start up DaskSQL
dc = Context()

#### Data Path
Dask-SQL requires the full path to the data. This cell uses the `pwd` bash command to identify the path to this directory, then add it to the relative path to the notebooks-contrib `data` directory (i.e. what you'd type in Terminal to navigate to the data).

In [3]:
# bash command, returns SList w/ path (str) at 0th index
path = !pwd

# extract path to notebooks-contrib
path = path[0].split('getting_started_materials')[0] 

# add path to Dask-SQL data
path = path + 'data/dask-sql/'

# what's it look like?
path

'/rapids/notebooks/extra/notebooks-contrib/data/dask-sql/'

### Create Table from CSV
Here we create a Dask-SQL table directly from a comma-separated values (CSV) file. 

In [4]:
# define column names and types
col_names = ['diagnosis_result', 'radius', 'texture', 'perimeter']
col_types = ['float32', 'float32', 'float32', 'float32'] 

# create table from CSV file
dc.create_table('data_00', path +'cancer_data_00.csv', gpu=True, names=col_names, dtype=col_types)


# df_result = dc.sql("SELECT * FROM data_00")
# df_result.head()
# type(df_result)
# df_result

### Create Table from Parquet
Here we create a Dask-SQL table directly from an Apache Parquet file.

In [5]:
# create table from Parquet file
dc.create_table('data_01', path+'cancer_data_01.parquet', gpu= True)

### Create Table from GPU DataFrame
Here we use cuDF to create a GPU DataFrame (GDF), then use Dask-SQL to create a table from that GDF.

The GDF is the standard memory representation for the RAPIDS AI ecosystem.

In [6]:
import cudf

# define column names & types
col_names = ['compactness', 'symmetry', 'fractal_dimension']
col_types = ['float32', 'float32', 'float32', 'float32']

# make GPU DataFrame from CSV w/ cuDF
gdf_02 = cudf.read_csv(path+'cancer_data_02.csv', names=col_names, dtype=col_types)

# create BlazingSQL table from cuDF DataFrame
dc.create_table('data_02', gdf_02)

In [7]:
import time

t0 = time.time()

# Join Tables Together 

Now we can use Dask-SQL to join all three data formats in a single federated query. Dask-SQL queries return results as a cuDF DataFrame.

In [8]:
# grab everything from data_00 & data_01 and area & smoothness from data_01
query = '''
        SELECT 
            a.*, 
            b.area, b.smoothness, 
            c.* 
        FROM 
            data_00 AS a
        LEFT JOIN data_01 AS b
            ON (a.perimeter = b.perimeter)
        LEFT JOIN data_02 AS c
            ON (b.compactness = c.compactness)
            '''

# join the tables together
join = dc.sql(query)

# display results (type(join)==cudf.core.dataframe.DataFrame)
join.head()

Unnamed: 0,diagnosis_result,radius,texture,perimeter,area,smoothness,compactness,symmetry,fractal_dimension
0,0.0,16.0,14.0,86.0,520.0,0.108,0.154,0.194,0.069
1,1.0,10.0,24.0,97.0,668.0,0.117,0.148,0.195,0.067
2,1.0,23.0,26.0,78.0,451.0,0.105,0.071,0.19,0.066
3,1.0,23.0,26.0,78.0,451.0,0.105,0.071,0.162,0.057
4,1.0,23.0,12.0,151.0,954.0,0.143,0.278,0.242,0.079


In [9]:
t1 = time.time()
print(f"run_stuff took {t1-t0}s")

run_stuff took 1.8894765377044678s


In [10]:
len(join.index)

316

# You're Ready to Rock
And... thats it! You are now live with Dask-SQL.

Check out our [docs](https://dask-sql.readthedocs.io/) to get fancy or to learn more about how Dask-SQL works with the rest of [RAPIDS AI](https://rapids.ai/).