# Lab notebook to query data using SNAP indexes

## HOL-EventsData-Lab-Query-I - to perform queries on the data qubes created in setup notebook

The pupose of this lab is to demonstrate the core capabilities of SparklineData such as
1. Slicing and dicing the data based on various dimnentions and metrics
2. Joining data outside SNAP qube
3. BI and Semantic capabilties like windowing

In this exercise we will be creating four views on the Qube Salessnap with filters for each segment of users. 

### 1. Segmentation
1. Create four different segments (jazz only, jazz and sports etc)
    * JazzOnly - view has sales of customers who like Jazz only. 
    * JazzAndSports - view has all sales for users segments who like Jazz and Sports
    * JazzNotSports - view has all sales for users segments who like Jazz but not Sports
    * SportsOnly - view has sales of customers who like Sports only. 
2. Aggregate the revenue by month for Jazz only segment and order by month
3. Aggregate the revenue by month for Sports only segment and order by month
4. Combine both the results and plot quantity sold and revenue by month for JazzOnly and SportsOnly segments
5. Draw a histogram comparing the above

### 2. Repeat Customer Analysis
1. Creates a view with the first ticket sales time and the most recent user activity date for each customer
2. For each Jazzonly user get the amount of tickets purchased and the price paid every week since their first transaction
3. For each SportsOnly user get the amount of tickets purchased and the price paid every week since their first transaction

### 3. Cohort analysis
1. Draw a chart showing the behavior of Cohorts who bought tickets together and their subsequent behavior every week
2. Returning sports customers - customers who are coming repeatedly

### First let us setup the notebook environment (Python packages) and connect to Thrift server 

In [None]:
%load_ext autotime

In [None]:
from pyhive import hive
from pprint import pprint
import pandas as pd
import os
from altair import *

from altair import Row, Column, Chart, Text
import altair as alt

# to use with Jupyter notebook (not JupyterLab) run the following
alt.renderers.enable('notebook')

def sql(query, explain=False) :
    # silly hack to handle filesystem prefix for us when creating local tables
    if "{prefix}" in query:
        query = query.replace('{prefix}',cwd)
    df=pd.read_sql(query,thrift_conn)
    return df

def explain(query):
    df = sql("explain " + query)
    plan = df['plan'][0]
    pprint(plan)
    
# Choose the directory that contains the Dataset to ingest into SNAP
cwd="oci://sparkline-hol-data@paasdevbdc"

# Connection to Thrift server
thrift_conn = hive.Connection(host="129.146.118.175",port=10000)

### Use EventsDB schema/database and list out all the existing tables

In [None]:
pd.read_sql('use EventsDB',thrift_conn)
pd.read_sql('show tables',thrift_conn)

## Analysis of user segments 

### Create four different segments based on users interest

In this exercise we will be creating four views on the Qube Salessnap with filters for each segment of users. 
Example 
* JazzOnly - view has sales of customers who like Jazz only. 
* JazzAndSports - view has all sales for users segments who like Jazz and Sports
* JazzNotSports - view has all sales for users segments who like Jazz but not Sports
* SportsOnly - view has sales of customers who like Sports only. 

In [None]:
sql( """
create or replace view JazzOnly
as
select 'JazzOnly', *
from salessnap
where users_buyer_likejazz='TRUE'
""")

In [None]:
sql( """
create or replace view JazzANDSports
as
select 'JazzAndSports', *
from salessnap
where users_buyer_likesports='TRUE' and users_buyer_likejazz='TRUE'
""")

In [None]:
sql( """
create or replace view JazzNOTSports
as
select 'JazzNotSports', *
from salessnap
where users_buyer_likesports='FALSE' and users_buyer_likejazz='TRUE'
""")

In [None]:
sql( """
create or replace view SportsOnly
as
select 'SportsOnly', *
from salessnap
where users_buyer_likesports='TRUE' 
""")

#### 1.2 Query the JazzOnly segment for type, month of sales, quantity of tickets sold and the total revenue group by month order by revenue

In [None]:
df1=sql( """
select 'Jazz' as type, month sdate, sum(qtysold) quantity, sum(pricepaid) revenue 
from JazzOnly group by month order by revenue desc
""")

In [None]:
df1

In [None]:
# Visualize the data in the Dataframe between quantity and cost/revenue
alt.Chart(df1).mark_circle(
    color='red',
    opacity=0.3
).encode(
    x='quantity:Q',
    y='revenue:Q'
)

#### 1.3 Query the SportsOnly segment for type, month of sales, quantity of tickets sold and the total revenue group by month order by revenue

In [None]:
df2=sql( """
select 'Sports' as type , month sdate, sum(qtysold) quantity, sum(pricepaid) revenue 
from SportsOnly group by month order by revenue desc
""")

In [None]:
df2

Append Jazz only and Sports Only views 

In [None]:
df3=df1.append(df2)
g=df3.groupby('type')

g.describe()

for key, item in g:
    print (g.get_group(key), "\n\n")

####  1.4 Plot quantity sold and revenue by month for JazzOnly and SportsOnly segments

In [None]:
alt.Chart(df3).mark_line(point=True).encode(
    x='sdate',
    y='quantity'
)

In [None]:
## Plot price paid for the same segments
alt.Chart(df3).mark_line(point=True).encode(
    x='sdate',
    y='revenue'
)

####  1.5 Draw a histogram comparing both of them

In [None]:
##Histogram on QUantity bins 
bar = alt.Chart(df3).mark_bar().encode(
    alt.X('quantity:Q', bin=True, axis=None),
    alt.Y('count()')
)

rule = alt.Chart(df3).mark_rule(color='red').encode(
    x='mean(quantity):Q',
    size=alt.value(5)
)

bar + rule

### Analyze Repeat customers

In [None]:
# Creates a view with the first ticket sales time and the most recent user activity date for each customer
sql("""
create or replace view custmin as 
select users_buyer_username, min(saletime)firstsalestime, max(saletime) lastsaletime
from salessnap
where year(saletime)='2008'
group by users_buyer_username
""")

In [None]:
# For each Jazzonly user get the amount of tickets purchased and the price paid every week since their first transaction
df=sql("""
SELECT Weekofyear(firstsalestime)                            AS start, 
       ( Weekofyear(saletime) - Weekofyear(firstsalestime) ) AS weeksince, 
       Sum(qtysold)                                          quantity, 
       Sum(pricepaid)                                        price, 
       Count(DISTINCT jonly.users_buyer_username)   AS dist_count
FROM   jazzonly jonly,
       custmin  cust
WHERE  Weekofyear(saletime) - Weekofyear(firstsalestime) < '15' 
       AND jonly.users_buyer_username = cust.users_buyer_username 
       AND Weekofyear(jonly.saletime) > Weekofyear(cust.firstsalestime) 
       AND Year(saletime) = '2008' 
GROUP  BY Weekofyear(firstsalestime), 
          ( Weekofyear(saletime) - Weekofyear(firstsalestime) ) 
ORDER  BY start, 
          weeksince 
""")

In [None]:
df

In [None]:
#For each SportsOnly user get the amount of tickets purchased and the price paid every week since their first transaction

df2=sql("""
SELECT Weekofyear(firstsalestime)                            AS start, 
       ( Weekofyear(saletime) - Weekofyear(firstsalestime) ) AS weeksince, 
       Sum(qtysold)                                          quantity, 
       Sum(pricepaid)                                        price, 
       Count(DISTINCT jonly.users_buyer_username)                            AS dist_count 
FROM   jazznotsports jonly, 
       custmin cust
WHERE  Weekofyear(saletime) - Weekofyear(firstsalestime) < '15' 
       AND jonly.users_buyer_username = cust.users_buyer_username 
       AND Weekofyear(jonly.saletime) > Weekofyear(cust.firstsalestime) 
       AND Year(saletime) = '2008' 
GROUP  BY Weekofyear(firstsalestime), 
          ( Weekofyear(saletime) - Weekofyear(firstsalestime) ) 
ORDER  BY start, 
          weeksince 
""")

In [None]:
df2

In [None]:
bars = alt.Chart(df2).mark_bar().encode(
    x='quantity',
    y='weeksince:O'
)

bars

### A Chart showing the behavior of Cohorts who bought tickets and their subsequent behavior every week

In [None]:
alt.Chart(df2).mark_text().encode(
    color=Color('quantity:Q',
        legend=Legend(
           title='Cohort')),column='weeksince:O',row='start:O',text='dist_count:Q')



In [None]:
alt.Chart(df2,
    description='Returning Sports users.',
).mark_line().encode(
    color='weeksince:O',
    x='start:O',
    y='dist_count:Q'
)

In [None]:
df4=sql("""
select month(saletime)  as month, avg(weekofyear(lastsaletime) - weekofyear(firstsalestime) )  as duration, 
sum(qtysold) a, sum(pricepaid) p , count(distinct a.users_buyer_username) as c

from JazzNotSports a, custmin b 
where  
a.users_buyer_username=b.users_buyer_username 
group by month(saletime)
order by month 
""")

In [None]:
df4

#### Done