## Using Presto to Generate Reports

To generate reports for those interested in application usage, we can use Presto via the PyHive connector to load our Hive tables into Pandas dataframes, transform the dataframes to answer our business questions, and then publish the reports as JSON files.

Before we start, let's define two business questions that we would like to answer:

1. What are all the counts per event type?
2. What are all the parameters that were given for the `user` parameter?

First, let's install the PyHive library.

In [17]:
!pip install pyhive



Next, let's use PyHive to connect to Presto in code using the port that we exposed in our Docker Compose file. Once we connect, we can run a simple query to see all the tables that are created in Hive.

In [18]:
from pyhive import presto
import pandas as pd

presto_conn = presto.connect(
    host='0.0.0.0',
    port=8082 # Exposed Presto port (see docker compose file)
)

pd.read_sql_query("SHOW TABLES", presto_conn)

Unnamed: 0,Table
0,all_events
1,event_parameters


Now let's run a query to get all of the data from the `event_parameters` table and load it into a Pandas dataframe.

In [19]:
# https://stackoverflow.com/questions/55988436/how-to-convert-a-presto-query-output-to-a-python-data-frame
event_parameters = pd.read_sql_query("SELECT * from event_parameters", presto_conn)
event_parameters.head()

Unnamed: 0,raw_event,timestamp,accept,host,user_agent,event_id,parameter_name,parameter_value
0,"{""event_id"": ""0c5a6636-fed7-4ca2-9097-150b76d4...",2021-12-03 17:24:36.842,,,,0c5a6636-fed7-4ca2-9097-150b76d41276,user,lise
1,"{""event_id"": ""0c5a6636-fed7-4ca2-9097-150b76d4...",2021-12-03 17:24:36.845,,,,0c5a6636-fed7-4ca2-9097-150b76d41276,guild_name,Bright Hearts
2,"{""event_id"": ""335752b1-f695-4e3d-b915-d990c050...",2021-12-03 17:24:36.851,,,,335752b1-f695-4e3d-b915-d990c05004bb,user,lise
3,"{""event_id"": ""335752b1-f695-4e3d-b915-d990c050...",2021-12-03 17:24:36.851,,,,335752b1-f695-4e3d-b915-d990c05004bb,guild_name,Bright Hearts
4,"{""event_id"": ""4c442f9b-b3f6-4634-bbd1-cb4b49bf...",2021-12-03 17:24:36.856,,,,4c442f9b-b3f6-4634-bbd1-cb4b49bfbd1a,user,lise


Let's do the same thing for the data in the `all_events` table.

In [20]:
all_events = pd.read_sql_query("SELECT * from all_events", presto_conn)
all_events.head()

Unnamed: 0,raw_event,timestamp,accept,host,user_agent,event_id,event_type
0,"{""event_id"": ""0c5a6636-fed7-4ca2-9097-150b76d4...",2021-12-03 17:24:36.845,*/*,user2.att.com,ApacheBench/2.3,0c5a6636-fed7-4ca2-9097-150b76d41276,get_credit
1,"{""event_id"": ""335752b1-f695-4e3d-b915-d990c050...",2021-12-03 17:24:36.852,*/*,user2.att.com,ApacheBench/2.3,335752b1-f695-4e3d-b915-d990c05004bb,get_credit
2,"{""event_id"": ""4c442f9b-b3f6-4634-bbd1-cb4b49bf...",2021-12-03 17:24:36.857,*/*,user2.att.com,ApacheBench/2.3,4c442f9b-b3f6-4634-bbd1-cb4b49bfbd1a,get_credit
3,"{""event_id"": ""66d57fec-c67f-4f77-b5a7-cb1721b3...",2021-12-03 17:24:36.865,*/*,user2.att.com,ApacheBench/2.3,66d57fec-c67f-4f77-b5a7-cb1721b311cc,get_credit
4,"{""event_id"": ""5bc2bdaa-9fcf-4be0-8f4c-87c30839...",2021-12-03 17:24:36.87,*/*,user2.att.com,ApacheBench/2.3,5bc2bdaa-9fcf-4be0-8f4c-87c30839961c,get_credit


Now let's try to answer business question #1 - What are all the counts per event type? We can do this using a simple groupby statement on our `all_events` dataframe and then writing the output to a JSON file titled `event_type_count.json`.

In [21]:
event_type_count = all_events.groupby('event_type').size()
event_type_count.to_json("event_type_count.json", orient='columns')

Note that `event_type_count.json` should exist in your directory after running the code above.


In [22]:
all_events = pd.read_sql_query("select event_type, count(event_type) as event_count from all_events group by event_type", presto_conn)
all_events.head()

Unnamed: 0,event_type,event_count
0,get_credit,301
1,purchase_sword,246
2,join_guild,356
3,leave_guild,124




Now let's answer business question #2 - What are all the parameters that were given for the `user` parameter? We can do this by running a slightly more complex query on our dataframe and again writing the output to a JSON file.

In [23]:
# Question: What are all the parameters that were given for the `user` parameter?
user_parameter_count = event_parameters.where(event_parameters['parameter_name'] == 'user').groupby('parameter_value').size()
user_parameter_count.to_json("user_parameter_count.json", orient='columns')

events_by_user = pd.read_sql_query(" select     un.parameter_value as user,     et.event_type as event,    count(un.parameter_value) as user_event_count from     all_events et join      event_parameters un on      et.event_id = un.event_id and     un.parameter_name = 'user' where      un.parameter_name = 'user'\
and     et.event_id = un.event_id group by     un.parameter_value     ,et.event_type order by     et.event_type,     un.parameter_value ", presto_conn)

events_by_user.head(50)


Unnamed: 0,user,event,user_event_count
0,aastha,get_credit,18
1,ben,get_credit,78
2,don,get_credit,111
3,lise,get_credit,26
4,theresa,get_credit,68
5,aastha,join_guild,182
6,ben,join_guild,33
7,don,join_guild,49
8,lise,join_guild,54
9,theresa,join_guild,38


Now that we have answered the business questions and generated JSON reports with the answers, we can close our Presto connection.

In [24]:
presto_conn.close()