# Querying Multiple Data Formats 
*By Winston Robson*

In this notebook, we will cover: 
- How to create BlazingSQL tables from: 
    - Text file (CSV)
    - Apache Parquet 
    - cuDF DataFrame (GDF)
- How to `JOIN` multiple BlazingSQL tables into one cuDF DataFrame with a single query.

#### BlazingSQL install check
The next cell checks that you have BlazingSQL installed, and offers to install it if not (making sure the notebook will run as expected).

In [1]:
import sys 
# point import path notebooks-contrib/utils
sys.path.append('../../../utils/')
from sql_check import bsql_start
# check that BlazingSQL is installed
bsql_start()

"You've got BlazingSQL set up perfectly! Let's get started with SQL in RAPIDS AI!"

## create BlazingContext
You can think of the BlazingContext much like a Spark Context (i.e. information such as FileSystems registered & Tables created will be stored).

In [2]:
from blazingsql import BlazingContext

bc = BlazingContext()

BlazingContext ready


#### Data Path
BlazingSQL requires the full path to the data. This cell uses the `pwd` bash command to identify the path to this directory, then add it to the relative path to the notebooks-contrib `data` directory (i.e. what you'd type in Terminal to navigate to the data).

In [3]:
# bash command, returns SList w/ path (str) at 0th index
path = !pwd

# extract path to notebooks-contrib
path = path[0].split('getting_started_notebooks')[0] 

# add path to blazingsql data
path = path + 'data/blazingsql/'

# what's it look like?
path

'/rapids/notebooks/wip/blazing012/notebooks-contrib/data/blazingsql/'

### Create Table from CSV
Here we create a BlazingSQL table directly from a comma-separated values (CSV) file. 

In [4]:
# define column names and types
col_names = ['diagnosis_result', 'radius', 'texture', 'perimeter']
col_types = ['float32', 'float32', 'float32', 'float32']

# create table from CSV file
bc.create_table('data_00', path+'cancer_data_00.csv', names=col_names, dtype=col_types)

<pyblazing.apiv2.context.BlazingTable at 0x7f3c40384c50>

### Create Table from Parquet
Here we create a BlazingSQL table directly from an Apache Parquet file.

In [5]:
# create table from Parquet file
bc.create_table('data_01', path+'cancer_data_01.parquet')

<pyblazing.apiv2.context.BlazingTable at 0x7f3cf4c904a8>

### Create Table from GPU DataFrame
Here we use cuDF to create a GPU DataFrame (GDF), then use BlazingSQL to create a table from that GDF.

The GDF is the standard memory representation for the RAPIDS AI ecosystem.

In [6]:
import cudf

# define column names & types
col_names = ['compactness', 'symmetry', 'fractal_dimension']
col_types = ['float32', 'float32', 'float32', 'float32']

# make GPU DataFrame from CSV w/ cuDF
gdf_02 = cudf.read_csv(path+'cancer_data_02.csv', names=col_names, dtype=col_types)

# create BlazingSQL table from cuDF DataFrame
bc.create_table('data_02', gdf_02)

<pyblazing.apiv2.context.BlazingTable at 0x7f3cf4c90da0>

# Join Tables Together 

Now we can use BlazingSQL to join all three data formats in a single federated query. BlazingSQL queries return results as a cuDF DataFrame.

In [7]:
# grab everything from data_00 & data_01 and area & smoothness from data_01
query = '''
        SELECT 
            a.*, 
            b.area, b.smoothness, 
            c.* 
        FROM 
            data_00 AS a
            LEFT JOIN data_01 AS b
                ON (a.perimeter = b.perimeter)
            LEFT JOIN data_02 AS c
                ON (b.compactness = c.compactness)
        '''

# join the tables together
join = bc.sql(query)

# display results (type(join)==cudf.core.dataframe.DataFrame)
join

Unnamed: 0,diagnosis_result,radius,texture,perimeter,area,smoothness,compactness,symmetry,fractal_dimension
0,1.0,19.0,27.0,72.0,394.0,0.081,0.046999998,0.15200001,0.057
1,1.0,18.0,25.0,97.0,668.0,0.117,0.148000002,0.194999993,0.067000002
2,1.0,23.0,12.0,151.0,954.0,0.143,0.277999997,0.242000014,0.079000004
3,0.0,9.0,13.0,133.0,1326.0,0.143,0.079000004,0.181000009,0.057
4,1.0,21.0,27.0,130.0,1203.0,0.125,0.159999996,0.207000002,0.059999999
5,1.0,14.0,16.0,78.0,386.0,0.070,0.284000009,0.25999999,0.097000003
6,1.0,9.0,19.0,135.0,1297.0,0.141,0.133000001,0.181000009,0.059
7,0.0,25.0,25.0,83.0,477.0,0.128,0.170000002,0.209000006,0.075999998
8,1.0,19.0,24.0,88.0,520.0,0.127,0.193000004,0.234999999,0.074000001
9,1.0,24.0,21.0,103.0,798.0,0.082,0.067000002,0.153000012,0.057


# You're Ready to Rock
And... thats it! You are now live with BlazingSQL.

Check out our [docs](https://docs.blazingdb.com) to get fancy or to learn more about how BlazingSQL works with the rest of [RAPIDS AI](https://rapids.ai/).