# Rasgo User Defined Transformations

This notebook introduces User Defined Transformations (UDTs) in Rasgo, shows how to use them from within Python with the PyRasgo package to transform and publish data on Rasgo.

### Packages

This tutorial uses:
* [pandas](https://pandas.pydata.org/docs/)
* [PyRasgo](https://app.gitbook.com/@rasgo/s/rasgo-docs/pyrasgo-0.1/dataframe-prep)

In [1]:
import pandas as pd
import pyrasgo

## Access Rasgo

### Create account

Next, click [here](https://app.rasgoml.com/account/register) to create an account on the Rasgo UI. Fill in the required information on the web page.

<p align="center">
  <img src="img/RasgoAccountRegistration.png" alt="Rasgo Account Registration" width="512">
</p>

You can close the browser tab as you will receive an email from rasgo to verify your email address. Click the **Verify Email** button to verify.

<p align="center">
  <img src="img/RasgoWelcome.png" alt="Verify Email" width="390">
</p>

This will open browser tab where you can log into the UI.

### Log into Rasgo UI

Enter your username and password and click **Login**.

<p align="center">
  <img src="img/RasgoLogin.png" alt="Login to Rasgo" width="528">
</p>

to be taken to the Rasgo App homepage.

### Copy your API Key

Click the **API KEY** button in the upper right of the screen

<img src="img/APIKEY.png" alt="Copy API Key" width="128">

to copy your API key to the clipboard.

## Work with PyRasgo

Paste the api key below

In [None]:
API_KEY = '<YOUR API KEY>'

### Connect to Rasgo

In [3]:
rasgo = pyrasgo.connect(API_KEY)

### Transform Data

This tutorial will work with [Daily Dark Sky Weather Data](https://app.rasgoml.com/sources/2).  First, call `rasgo.get.data_source` to allow us to work with this data

In [4]:
datasource = rasgo.get.data_source(id=2)
datasource

Source(id=2, name=Dark Sky: Daily, sourceType=Table, table=RASGOCOMMUNITY.PUBLIC.DARKSKY_DAILY_FEATURES)

Call `rasgo.get.transforms` to get a list of supported transforms already loaded into Rasgo.

In [5]:
transforms = rasgo.get.transforms()
transforms

[Transform(name='rasgo_datetrunc', sourceCode='\n{% set date_list = None %}\n{%- if date_column is string -%}\n    {% set date_list = [date_column] %}\n{% else %}\n    {% set date_list = date_column %}\n{%- endif -%}\n\nSELECT *, \n{%- for col in date_list -%}\n    DATE_TRUNC({{date_part}}, {{col}}) as {{col}}_{{date_part}} {{ ", " if not loop.last else "" }}\n{%- endfor -%}\nfrom {{ source_table }}', id=66, arguments=['date_part', 'date_column']),
 Transform(name='rasgo_lag', sourceCode='\nSELECT *, \n    {% for col in Columns %}\n        {%- for amount in Amounts -%}\n            lag({{col}}, {{amount}}) over (partition by {{Partition}} order by {{OrderBy}}) as Lag_{{col}}_{{amount}}{{ ", " if not loop.last else "" }}\n        {%- endfor -%}\n        {{ ", " if not loop.last else "" }}\n    {% endfor %}\nfrom {{ source_table }}', id=64, arguments=['Partition', 'Amounts', 'OrderBy', 'Columns'])]

Calling `datasource.transform` with one of these transformations will apply the transformation to the datasource from above.  In this case, we will create **lags** of 1 day and 1 week for the features: *DS_DAILY_HIGH_TEMP*, *DS_DAILY_LOW_TEMP*, *DS_DAILY_TOTAL_RAINFALL*, *DS_DAILY_WINDSPEED*, *DS_WEATHER_ICON*.

The parameters we need to specify for the transform are:
- *rasgo_lag*: to define the transform we want to call
- *Columns*: the list of columns/features to transform. In this case the list of features above
- *Amounts*: the number of days to lag.  In this case, 1 and 7
- *Partition*: If there are more than one series in the data, the field that specifies the series. In this case, that represents the location of the series.
- *OrderBy*: the date field to use in the lag

In [6]:
newsource = datasource.transform(transform_name='rasgo_lag',
                                 Columns = ['DS_DAILY_HIGH_TEMP', 'DS_DAILY_LOW_TEMP', 'DS_DAILY_TOTAL_RAINFALL', 
                                            'DS_DAILY_WINDSPEED', 'DS_WEATHER_ICON'],
                                 Amounts = [1, 7],
                                 Partition = 'FIPS',
                                 OrderBy = 'DATE')

The function `preview_sql` shows the SQL `SELECT` statement that will be run to perform the transformation.

In [7]:
newsource.preview_sql()

'\nSELECT *, \n    lag(DS_DAILY_HIGH_TEMP, 1) over (partition by FIPS order by DATE) as Lag_DS_DAILY_HIGH_TEMP_1, lag(DS_DAILY_HIGH_TEMP, 7) over (partition by FIPS order by DATE) as Lag_DS_DAILY_HIGH_TEMP_7, \n    lag(DS_DAILY_LOW_TEMP, 1) over (partition by FIPS order by DATE) as Lag_DS_DAILY_LOW_TEMP_1, lag(DS_DAILY_LOW_TEMP, 7) over (partition by FIPS order by DATE) as Lag_DS_DAILY_LOW_TEMP_7, \n    lag(DS_DAILY_TOTAL_RAINFALL, 1) over (partition by FIPS order by DATE) as Lag_DS_DAILY_TOTAL_RAINFALL_1, lag(DS_DAILY_TOTAL_RAINFALL, 7) over (partition by FIPS order by DATE) as Lag_DS_DAILY_TOTAL_RAINFALL_7, \n    lag(DS_DAILY_WINDSPEED, 1) over (partition by FIPS order by DATE) as Lag_DS_DAILY_WINDSPEED_1, lag(DS_DAILY_WINDSPEED, 7) over (partition by FIPS order by DATE) as Lag_DS_DAILY_WINDSPEED_7, \n    lag(DS_WEATHER_ICON, 1) over (partition by FIPS order by DATE) as Lag_DS_WEATHER_ICON_1, lag(DS_WEATHER_ICON, 7) over (partition by FIPS order by DATE) as Lag_DS_WEATHER_ICON_7\n   

The function `preview` will return a dataframe with the first 10 rows transformed to check the transformation.

In [8]:
df = newsource.preview()
df.columns

Index(['FIPS', 'DS_DAILY_HIGH_TEMP', 'DS_DAILY_LOW_TEMP',
       'DS_DAILY_TEMP_VARIATION', 'DS_WEATHER_ICON', 'IS_CLEAR_DAY',
       'IS_CLOUDY', 'IS_PARTLY_CLOUDY', 'IS_RAINY', 'IS_SNOWY', 'IS_WINDY',
       'DS_DAILY_HUMIDITY', 'DS_DAILY_WINDSPEED', 'DS_DAILY_CLOUDCOVER',
       'DS_SNOW_INDICATOR', 'DS_DAILY_TOTAL_RAINFALL',
       'DS_TEMP_VARIATION_PREVIOUS_DAY', 'DATE', 'LAG_DS_DAILY_HIGH_TEMP_1',
       'LAG_DS_DAILY_HIGH_TEMP_7', 'LAG_DS_DAILY_LOW_TEMP_1',
       'LAG_DS_DAILY_LOW_TEMP_7', 'LAG_DS_DAILY_TOTAL_RAINFALL_1',
       'LAG_DS_DAILY_TOTAL_RAINFALL_7', 'LAG_DS_DAILY_WINDSPEED_1',
       'LAG_DS_DAILY_WINDSPEED_7', 'LAG_DS_WEATHER_ICON_1',
       'LAG_DS_WEATHER_ICON_7'],
      dtype='object')

In [9]:
df

Unnamed: 0,FIPS,DS_DAILY_HIGH_TEMP,DS_DAILY_LOW_TEMP,DS_DAILY_TEMP_VARIATION,DS_WEATHER_ICON,IS_CLEAR_DAY,IS_CLOUDY,IS_PARTLY_CLOUDY,IS_RAINY,IS_SNOWY,...,LAG_DS_DAILY_HIGH_TEMP_1,LAG_DS_DAILY_HIGH_TEMP_7,LAG_DS_DAILY_LOW_TEMP_1,LAG_DS_DAILY_LOW_TEMP_7,LAG_DS_DAILY_TOTAL_RAINFALL_1,LAG_DS_DAILY_TOTAL_RAINFALL_7,LAG_DS_DAILY_WINDSPEED_1,LAG_DS_DAILY_WINDSPEED_7,LAG_DS_WEATHER_ICON_1,LAG_DS_WEATHER_ICON_7
0,20117,64.25,26.32,37.93,clear-day,1,0,0,0,0,...,,,,,,,,,,
1,20117,65.29,39.89,25.4,partly-cloudy-day,0,0,1,0,0,...,64.25,,26.32,,3.5e-05,,10.52,,clear-day,
2,20117,53.88,30.61,23.27,partly-cloudy-day,0,0,1,0,0,...,65.29,,39.89,,2e-05,,7.64,,partly-cloudy-day,
3,20117,59.69,32.51,27.18,clear-day,1,0,0,0,0,...,53.88,,30.61,,8.8e-05,,6.35,,partly-cloudy-day,
4,20117,62.31,32.39,29.92,clear-day,1,0,0,0,0,...,59.69,,32.51,,5.4e-05,,8.71,,clear-day,
5,20117,57.83,35.89,21.94,wind,0,0,0,0,0,...,62.31,,32.39,,1.5e-05,,7.69,,clear-day,
6,20117,56.58,23.25,33.33,clear-day,1,0,0,0,0,...,57.83,,35.89,,1.8e-05,,14.88,,wind,
7,20117,67.72,37.41,30.31,clear-day,1,0,0,0,0,...,56.58,64.25,23.25,26.32,3.6e-05,3.5e-05,6.35,10.52,clear-day,clear-day
8,20117,69.67,45.38,24.29,rain,0,0,0,1,0,...,67.72,65.29,37.41,39.89,2.4e-05,2e-05,15.36,7.64,clear-day,partly-cloudy-day
9,20117,53.65,33.31,20.34,rain,0,0,0,1,0,...,69.67,53.88,45.38,30.61,0.00034,8.8e-05,19.32,6.35,rain,partly-cloudy-day


Multiple calls of `transform` can be chained together to create more complicated transformations of the original data.

Once satisfied that the data is being transformed correctly, the data can be saved as a new source on Rasgo by calling the function `to_source`.

In [10]:
transformed_source = newsource.to_source(new_table_name="DS_LAGS_TUTORIAL")

Finally, the new columns can be be published as features in Rasgo to make them available for further use by everyone on the team.

In [11]:
new_source = rasgo.publish.features_from_source(data_source_id=transformed_source.id,
                                                dimensions=['DATE', 'FIPS'],
                                                granularity=['day', 'FIPS'],
                                                features=['LAG_DS_DAILY_HIGH_TEMP_1', 'LAG_DS_DAILY_HIGH_TEMP_7', 
                                                          'LAG_DS_DAILY_LOW_TEMP_1','LAG_DS_DAILY_LOW_TEMP_7', 
                                                          'LAG_DS_DAILY_TOTAL_RAINFALL_1','LAG_DS_DAILY_TOTAL_RAINFALL_7', 
                                                          'LAG_DS_DAILY_WINDSPEED_1','LAG_DS_DAILY_WINDSPEED_7', 
                                                          'LAG_DS_WEATHER_ICON_1','LAG_DS_WEATHER_ICON_7'],
                                                tags=['darksky', 'transformation_tutorial'],
                                                sandbox=False)

## Work with this data on Rasgo

### Find the data on Rasgo

When you first open [Rasgo](https://app.rasgoml.com), you are shown the homepage that details recent activity within your organization. You can search for and examine features by clicking the feature button in the upper left hand corner.

<img src="img/Explore_Features.jpg" alt="Explore Features">

Within the Features page you can explore by **Hashtags**, **Data Sources**, **Dimensions**, or **Data Types**. Most commonly you will explore by **Hashtags** or **Data Sources**

<p align="center">
  <img src="img/Rasgo_Find_Features.jpg" alt="Explore by Hashtags or by Data Source">
</p>


Next to **Hashtags**, you can see the hashtags created in the previous step:

<p align="center">
  <img src="img/darksky.png" alt="demo_sales" width="128">
  and
  <img src="img/transformation_tutorial.png" alt="sales_sports" width="128">
</p>

Clicking on either card will take you to the list of all the features with that tag (in this case, all features published in the prior step). Similarly, clicking on the **RASGO.PUBLIC.DS_LAGS_TUTORIAL** card next to **Data Sources** will take you to the same list.

This list of features will show a card for each feature. For example,

<p align="center">
  <img src="img/LAG_DS_DAILY_HIGH_TEMP_7.png" alt="LAG DS DAILY HIGH TEMP 7" width="30%">
</p>

This feature is called LAG_DS_DAILY_HIGH_TEMP_7:  
  1. The *.00* means it's a floating point number.  
  2. It comes from "RASGO.PUBLIC.DS_LAGS_TUTORIAL" Data Source.   
  3. It has the dimension/granularity of **day** and **FIPS**.  

Clicking the details button will show you the feature details including basic statistics, a histogram of the distribution, the value over time, and data quality checks. 

<img src="img/Feature_Details.jpg" alt="Feature Details Overview">


Click the browser back button to go back to the previous page and explore additional features from the list, or click **< Features** in the upper left to go back to the intial **Explore Features** page.

## Create your own UDT

### Note: this only works for enterprise customers

To create a UDT in Rasgo, write the transform as a SQL `SELECT` statement.  Rasgo uses [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) templates to define these transforms.  The template below creates a `SELECT` statement that returns all fields from *source_table* that is provided from the Rasgo datasource aquired above.  The fields
- *fields_to_average*
- *window_sizes*
- *serial_dim*
- *date_dim*
are fields that can be specified when the transform is called on a datasource, and the first two are both arrays that **Jinja** will parse to create the statement.

In [None]:
sqltext = """SELECT * 
{%- for column in fields_to_average -%}
    {%- for window in window_sizes -%}
        , avg({{column}}) OVER(PARTITION BY {{serial_dim}} ORDER BY {{date_dim}} ROWS BETWEEN {{window - 1}} PRECEDING AND CURRENT ROW) AS mean_{{column}}_{{window}} 
    {%- endfor -%}
{%- endfor -%}
FROM {{source_table}}"""

If you are an enterprise customer, you can call `rasgo.create.transform` to create the transformation.  This transformation will then appear in the list returned by `rasgo.get.transforms` and can be used just as was done above.

In [None]:
new_transform = rasgo.create.transform(
    name="moving_average", source_code=sqltext)