# PyRasgo Tutorial

With this tutorial, in under 4 minutes you'll be able to generate feature profiles that give you full visibility into your data, and calculate feature importance score on your features to select which features are most impactful to your prediction. Please give it a try. We think you’ll find PyRasgo easier to use and more powerful than other packages for these tasks.

This notebook will use gold and silver price data from `rdatasets` to explore feature engineering to predict the price of gold in one week.  In addition, `pyrasgo` uses SHAP values from the `catboost` package to calculate feature inportance to capture the impact of this feature engineering and prune features at the end of the tutorial.

### Workflow

* Connect to Rasgo
* Create initial dataset
* Feature engineering
* Feature selection

### Packages

The documentation for each package used in this tutorial is linked below:
* [pandas](https://pandas.pydata.org/docs/)
* [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [statsmodels.api](https://www.statsmodels.org/stable/api.html#statsmodels-api)
* [PyRasgo](https://app.gitbook.com/@rasgo/s/rasgo-docs/pyrasgo-0.1/dataframe-prep)

Install pyrasgo if it is not already available

In [None]:
#!pip install -U pyrasgo[df]

In [1]:
import statsmodels.api as sm
import pandas as pd
import pyrasgo

## Connect to Rasgo

Enter your email and password to create an account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis.  In addition, this account allows you to maintain access to your analysis and share with your colleagues.

**Note** This only needs to be run the first time you use pyrasgo.  

In [None]:
#pyrasgo.register(email='<your email>', password='<your password>')

Enter the email and password you used at registration to connect to Rasgo.

In [2]:
rasgo = pyrasgo.login(email='<your email>', password='<your password>')

## Create initial dataset

The data is from `rdatasets` imported using the Python package `statsmodels`.

In [3]:
df = sm.datasets.get_rdataset('GoldSilver', 'AER').data.reset_index().rename(columns={'index': 'date'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9132 entries, 0 to 9131
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    9132 non-null   object 
 1   gold    9132 non-null   float64
 2   silver  9132 non-null   float64
dtypes: float64(2), object(1)
memory usage: 214.2+ KB


### Create target

The target will be the gold price in one week.  **target_df** is created to hold the future gold price and it will be merged back into the original dataframe to create the initial dataframe to be analyzed.  For ease, **target** will be set to **future_gold_price** here.

In [4]:
df['date'] = pd.to_datetime(df.date)
target_df = df[['date', 'gold']].copy()
target_df['date'] = target_df.date - pd.to_timedelta('28 day')
target_df.rename(columns={'gold': 'future_gold_price'}, inplace=True)
target_df

Unnamed: 0,date,future_gold_price
0,1977-12-02,100.00
1,1977-12-05,100.00
2,1977-12-06,100.00
3,1977-12-07,100.00
4,1977-12-08,100.00
...,...,...
9127,2012-11-27,906.96
9128,2012-11-28,907.61
9129,2012-11-29,909.26
9130,2012-11-30,905.00


In [5]:
target = 'future_gold_price'

In [6]:
training_df = df.merge(target_df, on='date', how='left')
df = training_df[training_df.date < pd.to_datetime('2012-12-04')].ffill()

## Feature engineering

### Start experiment

Create an experiment on Rasgo to allow you to track changes to features over time, run time travel analysis, and understand feature importance.  Creating an experiment allows you to track the changes you make during feature engineering and their impact on both the model performance and feature importance.

In [7]:
rasgo.activate_experiment('Tutorial Experiment')

Activated new experiment with name Tutorial Experiment for dataframe: aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU


### Profile starting data

#### Generate feature profiles

In [8]:
response = rasgo.evaluate.profile(df)

Profile URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/features


This page shows 4 features: **date*, **gold**, **silver** and **future_gold_price**.  Clicking on any of the rows will provide detailed statistics about that feature.

#### Calculate feature importance

This generates a baseline to compare the impact of our feature engineering to.  PyRasgo automates the creation of a `catboost` model and the calculation of the SHAP values.  The feature importance score is the mean absolute value of the SHAP value for that feature.

In [9]:
response = rasgo.evaluate.feature_importance(df, target_column=target, timeseries_index='date')
response['modelPerformance']['RMSE']

Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance


331.5336228692031

The graph shows that **gold** is much more important to the model prediction than **silver**.  Keep in mind that with an **RMSE** of 332, this is not a good model. 

### Start feature engineering

Create initial lag variables

In [10]:
df['gold_lag7'] = df['gold'].shift(7)
df['gold_lag14'] = df['gold'].shift(14)
df['gold_lag28'] = df['gold'].shift(28)
df['gold_lag60'] = df['gold'].shift(60)

df['silver_lag7'] = df['silver'].shift(7)
df['silver_lag14'] = df['silver'].shift(14)
df['silver_lag28'] = df['silver'].shift(28)
df['silver_lag60'] = df['silver'].shift(60)

Calculate feature importance to understand value of these new features.

In [11]:
response = rasgo.evaluate.feature_importance(df, target_column=target)
response['modelPerformance']['RMSE']

Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance


8.914154234793658

Adding these 8 features improved the model's **RMSE** significantly. Checking the graph allows us to see the relative importance of each of the features in the dataset.

#### Calculate ratios  and differences of gold price over time

In [12]:
# ratio of current gold price to prior prices
df['gold_to_last7'] = df['gold'] / df['gold_lag7']
df['gold_to_last14'] = df['gold'] / df['gold_lag14']
df['gold_to_last28'] = df['gold'] / df['gold_lag28']
df['gold_to_last60'] = df['gold'] / df['gold_lag60']

# ratio of prior gold price to previous prices
df['gold_7_to_last14'] = df['gold_lag7'] / df['gold_lag14']
df['gold_7_to_last28'] = df['gold_lag7'] / df['gold_lag28']
df['gold_14_to_last28'] = df['gold_lag14'] / df['gold_lag28']

# difference between current gold price and prior prices
df['gold_minus_last7'] = df['gold'] - df['gold_lag7']
df['gold_minus_last14'] = df['gold'] - df['gold_lag14']
df['gold_minus_last28'] = df['gold'] - df['gold_lag28']
df['gold_minus_last60'] = df['gold'] - df['gold_lag60']

# difference between prior gold price and previous prices
df['gold_7_minus_last14'] = df['gold_lag7'] - df['gold_lag14']
df['gold_7_minus_last28'] = df['gold_lag7'] - df['gold_lag28']
df['gold_14_minus_last28'] = df['gold_lag14'] - df['gold_lag28']

Check feature importance for these new features.

In [13]:
response = rasgo.evaluate.feature_importance(df, target_column=target)
response['modelPerformance']['RMSE']

Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance


9.829427350534086

We can compare the **RMSE** and see that adding these additional features is not improving the mdoel.  Examining the feature importance plots, none of these new variables are as important as the lag variables based on gold.

### Feature selection

Keep top half of features.

In [14]:
df = rasgo.prune.features(df, target_column=target, top_n_pct=.5)

Prune Method: Keeping top 0.5 of features
Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance
Dropped features not in top 0.5 pct: ['gold_7_minus_last28', 'gold_minus_last7', 'gold_14_to_last28', 'gold_minus_last28', 'gold_to_last14', 'gold_minus_last14', 'gold_to_last7', 'gold_to_last28', 'gold_14_minus_last28', 'gold_7_to_last28', 'gold_7_minus_last14', 'gold_7_to_last14']


Calculate the feature importance to check the impact of pruning the features.

In [15]:
response = rasgo.evaluate.feature_importance(df, target_column=target)
response['modelPerformance']['RMSE']

Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance


8.58348035854779

Dropping the least important half of the features gives not just a simplier model with only 12 features, but also improves the **RMSE**.  The feature importance graph continues to show the importance of the lagged gold prices.

Trim another quarter of the features.

In [16]:
df = rasgo.prune.features(df, target_column=target, top_n_pct=.75)

Prune Method: Keeping top 0.75 of features
Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance
Dropped features not in top 0.75 pct: ['silver_lag28', 'gold_minus_last60', 'silver_lag14']


In [17]:
response = rasgo.evaluate.feature_importance(df, target_column=target)
response['modelPerformance']['RMSE']

Importance URL: https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance


9.704334104247536

Pruning further to 9 features gives a simplier model with a similar **RMSE**.

### End the experiment

In [18]:
rasgo.end_experiment()

Experiment ended


The results of this expirement can be viewed [here](https://app.rasgoml.com/dataframes/aUf7WI2ErGIsGS4mAIKIKdcPhNfoR0zCcmYJ0-Ed9PU/importance)

## Additional Resources

* Provide feedback and ask questions about PyRasgo on the [Rasgo Forum](https://forum.rasgoml.com)
* Join our community on [Slack](https://join.slack.com/t/rasgousergroup/shared_invite/zt-nytkq6np-ANEJvbUSbT2Gkvc8JICp3g)
* [View](https://github.com/rasgointelligence/feature-engineering-tutorials) our feature engineering tutorials