# Python Charmers 

## Mini Workshop 1: Hyper files

### Lesson Overview
- **Objective:** We'll explore building and working with hyper files using python.
- **Source materials:** [Pantab's documentation](https://pantab.readthedocs.io/)
- **Prerequisites:** [Lesson 3 Getting Started with Pandas](./fundamentals-03-getting-started-with-pandas.ipynb)
- **Duration:** 45 mins

Hyper files are design to store data extracted from various sources in a highly optimized way for quick aggregation, querying, and analysis in Tableau. The Hyper API that allows users and developers to create and interact with Hyper files programmatically, and we can do this simply with python. 

## Python Libraries

For these tasks we'll need 3 libraries installed:
- pandas - for working with dataframes
- pantab - for converting dataframes to hyper extracts
- tableauhyperapi - for accessing the full hyper API

Please ensure you can run the script below, if not you will need to install the libraries as detailed in [Lesson 2 Packages](./fundamentals-02-packages.ipynb), and it recommended you close this file and start a new notebook file.

In [1]:
import pandas as pd
import pantab
import tableauhyperapi

## Writing to a Hyper Extract

The below example will write out to a file named “example.hyper”, which Tableau can then report off of.

In [5]:
# Import libraries
import pandas as pd
import pantab

# create dataframe
df = pd.DataFrame([
    ["dog", 4],
    ["cat", 4],
], columns=["animal", "num_of_legs"])

# convert dataframe to hyper file and save locally
pantab.frame_to_hyper(df, "example.hyper", table="animals")

# Note as hyper files can support multiple data sources, table="animals", is here to give a name to each added datasource
# You'll see "animals" when you connect this file to Tableau

You'll notice we've generated a .log file, to prevent these happening with each process we can add a Hyper-Process to our pantab function.

In [10]:
# Import libraries
import pandas as pd
import pantab
from tableauhyperapi import HyperProcess, Telemetry

# set parameters for no logging and define hyper process
parameters = {"log_config": "", "default_database_version": "1"}
with HyperProcess(Telemetry.SEND_USAGE_DATA_TO_TABLEAU, parameters=parameters) as hyper:
    
    # create dataframe
    df = pd.DataFrame([
        ["dog", 4],
        ["cat", 4],
    ], columns=["animal", "num_of_legs"])
    
    # convert dataframe to hyper file and save locally
    pantab.frame_to_hyper(df, "example.hyper", table="animals", hyper_process = hyper)

## CSV file to Hyper

Converting files to hyper is very quick with the **pantab** library.

In [32]:
import pandas as pd
import pantab

# create dataframe
df = pd.read_csv('../data/2019_Yellow_Taxi_Trip_Data.csv')

# convert dataframe to hyper file and save locally
pantab.frame_to_hyper(df, "taxi_trips.hyper", table="taxi")

## Hyper file to CSV

Similarly converting files from hyper is very quick with the **pantab** library.

In [14]:
import pandas as pd
import pantab

# read hyper file to dataframe
df = pantab.frame_from_hyper("taxi_trips.hyper", table="taxi")

# save dataframe "df" as csv
df.to_csv('taxi_trips.csv', index=False)

## Have a hyper file but don't know the contents?

The script below will take a file and print the schemas, tables and columns. From [Community-Supported/list-hyper-contents](https://github.com/tableau/hyper-api-samples/tree/main/Community-Supported/list-hyper-contents).

In [18]:
# Lists the schemas, tables, and columns inside a Hyper file
from tableauhyperapi import HyperProcess, Telemetry, Connection, CreateMode, Nullability

hyper_file = "taxi_trips.hyper"

# Start Hyper and connect to our Hyper file
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
    with Connection(hyper.endpoint, hyper_file, CreateMode.NONE) as connection:
        # The `connection.catalog` provides us with access to the meta-data we are interested in
        catalog = connection.catalog

        # Iterate over all schemas and print them
        schemas = catalog.get_schema_names()
        print(f"{len(schemas)} schemas:")
        for schema_name in schemas:
            # For each schema, iterate over all tables and print them
            tables = catalog.get_table_names(schema=schema_name)
            print(f" * Schema {schema_name}: {len(tables)} tables")
            for table in tables:
                # For each table, iterate over all columns and print them
                table_definition = catalog.get_table_definition(name=table)
                print(f"  -> Table {table.name}: {len(table_definition.columns)} columns")
                for column in table_definition.columns:
                    nullability = " NOT NULL" if column.nullability == Nullability.NOT_NULLABLE else ""
                    collation = " " + column.collation if column.collation is not None else ""
                    print(f"    -> {column.name} {column.type}{nullability}{collation}")

1 schemas:
 * Schema "public": 1 tables
  -> Table "taxi": 18 columns
    -> "vendorid" BIG_INT NOT NULL
    -> "tpep_pickup_datetime" TEXT
    -> "tpep_dropoff_datetime" TEXT
    -> "passenger_count" BIG_INT NOT NULL
    -> "trip_distance" DOUBLE
    -> "ratecodeid" BIG_INT NOT NULL
    -> "store_and_fwd_flag" TEXT
    -> "pulocationid" BIG_INT NOT NULL
    -> "dolocationid" BIG_INT NOT NULL
    -> "payment_type" BIG_INT NOT NULL
    -> "fare_amount" DOUBLE
    -> "extra" DOUBLE
    -> "mta_tax" DOUBLE
    -> "tip_amount" DOUBLE
    -> "tolls_amount" DOUBLE
    -> "improvement_surcharge" DOUBLE
    -> "total_amount" DOUBLE
    -> "congestion_surcharge" DOUBLE


## Reading and Writing Multiple Tables

**frames_to_hyper** and **frames_from_hyper** can write and return a dictionary of DataFrames for Hyper extract, respectively.

In [11]:
import pandas as pd
import pantab
from tableauhyperapi import TableName

# dataframes to write to hyper
taxi_df = pd.read_csv('../data/2019_Yellow_Taxi_Trip_Data.csv')
hols_df = pd.read_csv('../data/holidays.csv')

# define a dictionary (key pair) of dataframes
dict_of_frames = {
    "taxis": taxi_df,
    "holidays": hols_df
}

pantab.frames_to_hyper(dict_of_frames, "multi.hyper")

# Reading this hyper file will return a dictionary of dataframes
result = pantab.frames_from_hyper("multi.hyper")

## Appending Data to Existing Tables

By default, **frame_to_hyper** and **frames_to_hyper** will fully drop and reloaded targeted tables. However, you can also append records to existing tables by supplying table_mode="a" as a keyword argument.

In [15]:
import pandas as pd
import pantab

df = pd.DataFrame([
    ["dog", 4],
    ["cat", 4],
], columns=["animal", "num_of_legs"])

pantab.frame_to_hyper(df, "animals.hyper", table="animals")

new_data = pd.DataFrame([["moose", 4]], columns=["animal", "num_of_legs"])

# Instead of overwriting the animals table, we can append via table_mode
pantab.frame_to_hyper(df, "animals.hyper", table="animals", table_mode="a")

# Please note that table_mode="a" will create the table(s) if they do not already exist.

## Issuing SQL queries

With **frame_from_hyper_query**, one can execute SQL queries against a Hyper file and retrieve the resulting data as a DataFrame. This can be used, e.g. to retrieve only a part of the data (using a WHERE clause) or to offload computations to Hyper.

In [17]:
import pandas as pd
import pantab

# create dataframe
df = pd.read_csv('../data/2019_Yellow_Taxi_Trip_Data.csv')

# convert dataframe to hyper file and save locally
pantab.frame_to_hyper(df, "taxi_trips.hyper", table="taxi")

# Read a subset of the data from the Hyper file
query = """
SELECT 
    passenger_count, 
    COUNT(*) as trips,
    SUM(total_amount) as total_fares
FROM taxi
WHERE passenger_count > 1
GROUP BY passenger_count
"""
df = pantab.frame_from_hyper_query("taxi_trips.hyper", query)
print(df)

   passenger_count  trips  total_fares
0                5    336      7227.88
1                3    425      9945.16
2                4    177      4579.05
3                2   1475     35462.18
4                6    203      4401.86


## Publish a Hyper File

To publish an extract we'll need to work with the **tableauserverclient** like we did at the end of [Lesson 8 Loops](./fundamentals-08-loops.ipynb)

In [6]:
# pip install tableauserverclient via terminal, if this package is not found
import tableauserverclient as TSC

# If you don't have a Tableau Cloud account you can sign up for free here: 
# Tableau Developer Programme: https://www.tableau.com/en-gb/developer

# There are two ways to authenticate, which one you choose will depend on how you login to your Tableau Server:
# - Username & Password
# - Personal Access Token

# However we do not want to share these details with the world, 
# so we will read these values from a local file, "config".
# this means you can share this script without comprimising your access.

In [9]:
# TO DO

# Under python_charmers/data is a json file "config_lesson_8.json"
# Download this file and fill in your login details, Username & Password or Personal Access Token
# Reupload this file and run the script below, you should see either your username or PAT name below

import json

with open('../data/config_lesson_8.json', 'r') as file:
    config = json.load(file)

username = config['username']
password = config['password']
pat_name = config['pat_name']
pat_secret = config['pat_secret']
server_url = config['server_url']
site_name = config['site_name']

print(username)
print(pat_name)

will.sutton



### 1. Login with either Username & Password or Personal Access Token

The other code block can be deleted.

In [10]:
# Username & Password - Tableau Auth
tableau_auth = TSC.TableauAuth(username, password, site_name)
server = TSC.Server(server_url, use_server_version=True)
server.auth.sign_in(tableau_auth)
print('login successful')

login successful


In [None]:
# Personal Access Token - Tableau Auth
tableau_auth = TSC.PersonalAccessTokenAuth(pat_name, pat_secret, site_name)
server = TSC.Server(server_url, use_server_version=True)
server.auth.sign_in(tableau_auth)
print('login successful')

### 2. Find a project folder to store your extract

In [27]:
import pandas as pd

with server.auth.sign_in(tableau_auth):
    # Initialize lists to store project names and IDs
    project_names = []
    project_ids = []

    # set limit to 1000
    req_option = TSC.RequestOptions(pagesize=1000)
    all_projects, pagination_item = server.projects.get(req_option)

    for project in all_projects:
        project_names.append(project.name)
        project_ids.append(project.id)
    
proj_df = pd.DataFrame()
proj_df['project_name'] = project_names
proj_df['target_project_id'] = project_ids
print(proj_df)

                        project_name                     target_project_id
0                            Default  224b0122-fad5-11e3-b3b7-273002f82ba8
1                           Finances  224b0122-fad5-11e3-8c56-5bc312a8caf8
2                          Marketing  224b283c-fad5-11e3-b609-2b4df1a13689
3                 Monitoring Reports  7014fc06-522c-4a28-b32d-f0a313f53efb
4                            Archive  c2798c6e-664e-4a47-986e-1638820ae9f3
..                               ...                                   ...
195                               JP  5ca4f2ba-5f55-454e-bb37-3a0599da1da5
196                           Sorcha  98632413-0fc5-4b78-8fa0-3216b643b638
197  External Assets Default Project  c461ef7e-11ec-462d-a1e1-52a5a482e1b6
198                         Pat Test  452f3a3e-5af7-4be8-a576-487e9a5e8919
199                  Pat Test Folder  0d0d5349-38e1-4b80-a8e0-83e39a8b1cbb

[200 rows x 2 columns]


In [31]:
my_proj = proj_df[proj_df['project_name'].str.contains('Will')]

print(my_proj)

    project_name                     target_project_id
104  Will Sutton  2d4c61c7-8704-4128-b658-c767b01240b2


### 3. Submitting a request to publish your hyper file

Using the project folder id, the hyper file path, and your tableau connection you can publish this extract to a Tableau Server/Cloud.

In [34]:
## TO DO 
# Enter your project folder id

my_proj_id = ''
hyper_filepath = '../data/taxi_trips.hyper'
display_name = '2019 Taxi Trips'

with server.auth.sign_in(tableau_auth):

  # Use the project id to create new datsource_item
  new_datasource = TSC.DatasourceItem(project_id = my_proj_id, name = display_name)

  # publish data source (specified in file_path)
  new_datasource = server.datasources.publish(new_datasource, hyper_filepath, 'CreateNew')


## Additional Resources
- 📰 **hyper-api-samples** - Tableau Hyper Samples - https://github.com/tableau/hyper-api-samples
- 📺 **Tableau Dev Day 2023** - Hyper API Running in a Google Cloud Function | DataDev Day June 2023 - https://www.youtube.com/watch?v=j159gCmei8Q&list=PL_qx68DwhYA_5t8pl7r-GhAVBsT0ZvF4o&index=6
- 📰 **tableau.github.io** - Tableau Server Client (Python) - https://tableau.github.io/server-client-python/docs/

## Summary

In this workshop we created hyper files, modified the files, queried data within the hyper file, and lastly saw how we can publish these datasources to Tableau Server/Cloud.