# Lab notebook to query data using SNAP indexes


## HOL-EventsData-Lab-Query-II - to perform queries on the data qubes created in setup notebook

The pupose of this lab is to demonstrate the core capabilities of SparklineData such as

1. Slicing and dicing the data based on various dimnentions and metrics
2. Joining data outside SNAP qube
3. BI and Semantic capabilties like windowing

In this exercise we will be creating four views on the Qube Salessnap with filters for each segment of users. 

1. Query-1 on quantity of tickets sold and revenue/cost by date: 
    * Compare sales of all users to users who liked Jazz and Concerts
    * Draw 
        - Histogram of number of tickets sold
        - Time series of tickets sold - daily
        - Plot Quantity of tickets sold per month 
2. Query-2 Find quantity sold compared to quantity sold over a 40 day window 
    * Scatter plot of quantity of tickets sold vs quantity of tickets sold 40 days ago
    * Auto correlation plot

### Setup the notebook

In [None]:
from pyhive import hive
from pprint import pprint
import pandas as pd
import os
import numpy as np
from altair import *

from altair import Row, Column, Chart, Text
import altair as alt

# to use with Jupyter notebook (not JupyterLab) run the following
alt.renderers.enable('notebook')

# Connection to Thrift server
thrift_conn = hive.Connection(host="129.146.118.175",port=10000)
def sql(query, explain=False) :
    # silly hack to handle filesystem prefix for us when creating local tables
    if "{prefix}" in query:
        query = query.replace('{prefix}',cwd)
    df=pd.read_sql(query,thrift_conn)
    return df

def explain(query):
    df = sql("explain " + query)
    plan = df['plan'][0]
    pprint(plan)

# Set the directory of the data to ingest into SNAP
cwd="oci://sparkline-hol-data@paasdevbdc"

In [None]:
thrift_conn = hive.Connection(host="129.146.118.175",port=10000)
sql("use EventsDB")
sql("show tables")

## Query-1 on quantity of tickets sold and price paid by date: 
Compare sales of all users to users who liked Jazz and Concerts

In [None]:
query_str="""
with allusers AS ( 
select caldate adate, users_buyer_city, sum(qtysold) all_qnty, sum(pricepaid) all_price 
from salessnap group by caldate,users_buyer_city)
,
someusers AS (
select caldate sdate,users_buyer_city, sum(qtysold) quantity, sum(pricepaid) price 
from salessnap where users_buyer_likeconcerts='TRUE' AND users_buyer_likejazz='TRUE' group by caldate,users_buyer_city)

select adate, allusers.users_buyer_city, quantity, price, round(quantity/all_qnty,2)*100 quantity_ratio, 
        round(price/all_price,2)*100 price_ratio
from allusers, someusers where adate=sdate order by price_ratio desc limit 5000
"""

In [None]:
df=sql(query_str)
df

### 1- Draw histogram of number of tickets sold

In [None]:
##Histogram on QUantity bins 
bar = alt.Chart(df).mark_bar().encode(
    alt.X('quantity:Q', bin=True, axis=None),
    alt.Y('count()')
)

rule = alt.Chart(df).mark_rule(color='red').encode(
    x='mean(quantity):Q',
    size=alt.value(5)
)

bar + rule

### 2- Draw time series analysis for number of tieckts sold daily

In [None]:

## Plot price paid for the same segments
alt.Chart(df).mark_line(point=True).encode(
    x='adate',
    y='sum(quantity)'
)

### 3- Plot of quanity of tickets sold by month 

In [None]:
df['yearmon']=pd.to_datetime(df['adate'],format="%Y-%m-%d" ).dt.strftime("%Y%m")

In [None]:
quantity_permonth=Chart(df).mark_line().encode( x='yearmon', y='sum(quantity)')
quantity_permonth

## Query-2:  Find quantity sold compared to quantity sold over a 40 day window

In [None]:
df['quantity'].autocorr(lag=30)

In [None]:
quantity_ratio_40days_window="""
with firstseries AS
(
select caldate adate, sum(qtysold) quantity
from salessnap group by caldate
)

select * from 
( select adate , quantity, lead(quantity, 40)
     
   over ( order by adate desc) as qlag
   from firstseries  ) quantity
"""

In [None]:
df=sql(quantity_ratio_40days_window)
df

### 1- Scatter plot of quantity of tickets sold vs quantity of tickets sold 40 days ago

In [None]:
df['quantity'] = np.log(df['quantity'])
df['qlag'] = np.log(df['qlag'])
a=Chart(df).mark_circle().encode( x='quantity', y='qlag')
a

### 2- Draw auto correlation plot

In [None]:
auto_correlation=Chart(df).mark_circle().encode( x='q', y='q')
a

In [None]:
df['quantity'].corr(df['qlag'])

In [None]:
df

### Done