# Querying Multiple Data Formats 
*By Winston Robson*

In this notebook, we will cover: 
- How to create BlazingSQL tables from: 
    - Text file (CSV)
    - Apache Parquet 
    - cuDF DataFrame (GDF)
- How to `JOIN` multiple BlazingSQL tables into one cuDF DataFrame with a single query.

#### BlazingSQL install check
The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected).

In [1]:
import sys 
# point import path notebooks-contrib/utils
sys.path.append('../../../utils/')

from sql_check import bsql_start
# check that BlazingSQL is installed
bsql_start()

"You've got BlazingSQL set up perfectly! Let's get started with SQL in RAPIDS AI!"

#### Download Data
This cell will check if you have the data for this demo, and, if you don't, will download it for you.

In [2]:
import os

# relative path to data folder
data_dir = '../../../data/blazingsql/'

# does folder exist?
if not os.path.exists(data_dir):
    print('creating blazingsql directory\n')
    # create folder
    os.system('mkdir ../../data/blazingsql')

# do we have file 0?
if not os.path.isfile(data_dir + 'cancer_data_00.csv'):
    !wget -P ../../../data/blazingsql https://raw.githubusercontent.com/BlazingDB/bsql-demos/master/data/cancer_data_00.csv

# do we have file 1?
if not os.path.isfile(data_dir + 'cancer_data_01.parquet'):
    !wget -P ../../../data/blazingsql https://blazingsql-colab.s3.amazonaws.com/cancer_data/cancer_data_01.parquet

# do we have file 2?
if not os.path.isfile(data_dir + 'cancer_data_02.csv'):
    !wget -P ../../../data/blazingsql https://raw.githubusercontent.com/BlazingDB/bsql-demos/master/data/cancer_data_02.csv

## Create BlazingContext
You can think of the BlazingContext much like a Spark Context (i.e. information such as FileSystems registered & Tables created will be stored).

In [3]:
from blazingsql import BlazingContext
# start up BlazingSQL
bc = BlazingContext()

BlazingContext ready


#### Data Path
BlazingSQL requires the full path to the data. This cell uses the `pwd` bash command to identify the path to this directory, then add it to the relative path to the notebooks-contrib `data` directory (i.e. what you'd type in Terminal to navigate to the data).

In [4]:
# bash command, returns SList w/ path (str) at 0th index
path = !pwd
print(path)
# extract path to notebooks-contrib
path = path[0].split('getting_started_notebooks')[0] 

# add path to blazingsql data
#path = path + 'data/blazingsql/'
path = "/rapids/notebooks/nc/notebooks-contrib/data/blazingsql/"
# what's it look like?
path

['/rapids/notebooks/nc/notebooks-contrib/getting_started_materials/hello_worlds/blazingsql']


'/rapids/notebooks/nc/notebooks-contrib/data/blazingsql/'

### Create Table from CSV
Here we create a BlazingSQL table directly from a comma-separated values (CSV) file. 

In [5]:
# define column names and types
col_names = ['diagnosis_result', 'radius', 'texture', 'perimeter']
col_types = ['float32', 'float32', 'float32', 'float32']

# create table from CSV file
bc.create_table('data_00', path+'cancer_data_00.csv', names=col_names, dtype=col_types)

### Create Table from Parquet
Here we create a BlazingSQL table directly from an Apache Parquet file.

In [6]:
# create table from Parquet file
bc.create_table('data_01', path+'cancer_data_01.parquet')

### Create Table from GPU DataFrame
Here we use cuDF to create a GPU DataFrame (GDF), then use BlazingSQL to create a table from that GDF.

The GDF is the standard memory representation for the RAPIDS AI ecosystem.

In [7]:
import cudf

# define column names & types
col_names = ['compactness', 'symmetry', 'fractal_dimension']
col_types = ['float32', 'float32', 'float32', 'float32']

# make GPU DataFrame from CSV w/ cuDF
gdf_02 = cudf.read_csv(path+'cancer_data_02.csv', names=col_names, dtype=col_types)

# create BlazingSQL table from cuDF DataFrame
bc.create_table('data_02', gdf_02)

# Join Tables Together 

Now we can use BlazingSQL to join all three data formats in a single federated query. BlazingSQL queries return results as a cuDF DataFrame.

In [8]:
# grab everything from data_00 & data_01 and area & smoothness from data_01
query = '''
        SELECT 
            a.*, 
            b.area, b.smoothness, 
            c.* 
        FROM 
            data_00 AS a
        LEFT JOIN data_01 AS b
            ON (a.perimeter = b.perimeter)
        LEFT JOIN data_02 AS c
            ON (b.compactness = c.compactness)
            '''

# join the tables together
join = bc.sql(query)

# display results (type(join)==cudf.core.dataframe.DataFrame)
join

Unnamed: 0,diagnosis_result,radius,texture,perimeter,area,smoothness,compactness,symmetry,fractal_dimension
0,0.0,17.0,24.0,60.0,274.0,0.102,0.064999998,0.182000011,0.068999998
1,1.0,20.0,27.0,103.0,798.0,0.082,0.067000002,0.153000012,0.057
2,1.0,19.0,12.0,137.0,1404.0,0.094,0.101999998,0.177000001,0.052999999
3,1.0,9.0,13.0,110.0,905.0,0.112,0.145999998,0.200000003,0.063000001
4,1.0,19.0,27.0,116.0,913.0,0.119,0.228,0.30400002,0.074000001
...,...,...,...,...,...,...,...,...,...
311,0.0,10.0,17.0,87.0,566.0,0.098,0.081,0.274000019,0.07
312,0.0,10.0,17.0,88.0,559.0,0.102,0.126000002,0.191,0.066
313,0.0,17.0,21.0,86.0,520.0,0.108,0.127000004,0.192000002,0.059999999
314,0.0,25.0,21.0,77.0,443.0,0.097,0.071999997,0.208000004,0.059999999


# You're Ready to Rock
And... thats it! You are now live with BlazingSQL.

Check out our [docs](https://docs.blazingdb.com) or [Twitter](https://twitter.com/blazingsql) to get fancy or to learn more about how BlazingSQL works with the rest of [RAPIDS AI](https://rapids.ai/).