This notebook will walk you through an example of setting up a model for the LastFM dataset stored in a tsv file (tab separated csv) and then fetching ranked artists for a specific user.

Let's get started! 🚀

### Setup

Replace `<YOUR_API_KEY>` with your API key below.

*If you don't have an API Key, feel free to [signup on our website](https://www.shaped.ai/#contact-us) :)*

In [1]:
import os

SHAPED_API_KEY = os.getenv('TEST_SHAPED_API_KEY', '<YOUR_API_KEY>')

1. Install `shaped` to leverage the Shaped CLI to create, view, and use your model.
2. Install `pandas` to view and edit the sample dataset.
3. Install `pyyaml` to create Shaped Dataset and Model schema files.

In [None]:
! pip install shaped
! pip install pandas
! pip install pyyaml

Initialize the CLI with your API key.

In [None]:
! shaped init --api-key $SHAPED_API_KEY

### Download Public Dataset

Fetch the publicly hosted LastFM dataset. (NOTE: This step can take ~10 min)

In [4]:
! echo "Downloading LastFM data..."

DIR_NAME = "notebook_assets"
! mkdir $DIR_NAME
! curl http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz -o $DIR_NAME/lastfm-dataset-360K.tar.gz
! tar -xzf $DIR_NAME/lastfm-dataset-360K.tar.gz -C $DIR_NAME

Downloading LastFM data...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  542M  100  542M    0     0   826k      0  0:11:12  0:11:12 --:--:--  997k  0     0   137k      0  1:07:34  0:00:02  1:07:32  137k  0   803k      0  0:11:31  0:00:14  0:11:17  958k      0  0:10:21  0:00:33  0:09:48  969k06k      0  0:10:12  0:00:38  0:09:34  986k     0   913k      0  0:10:08  0:00:44  0:09:24  958k 0   896k      0  0:10:20  0:00:53  0:09:27  688kM   11 60.2M    0     0   907k      0  0:10:12  0:01:07  0:09:05  969k:10:31  0:01:15  0:09:16  539k 0:09:14  728k   840k      0  0:11:01  0:01:33  0:09:28  465k   831k      0  0:11:08  0:01:37  0:09:31  516k 0     0   837k      0  0:11:03  0:01:42  0:09:21  956k   0   843k      0  0:10:59  0:01:48  0:09:11  963k:35  931k   0   860k      0  0:10:46  0:02:12  0:08:34  941k      0  0:10:43  0:02:19  0:08:24  931k   860k      0  0:10:45  0:02:33

Let's take a look at the downloaded dataset. There are two tables of interest:
- `plays` which are stored in `lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv`
- `users` which are stored in `lastfm-dataset-360K/usersha1-profile.tsv`

Unfortunately each of these tab separated files don't have a header (which is required by Shaped). To address this, we can prepend the header as shown below:

In [5]:
import pandas as pd

data_dir = "notebook_assets/lastfm-dataset-360K"

events_cols = ['user_id', 'artist_id', 'artist_name', 'plays']
events_df = pd.read_csv(f'{data_dir}/usersha1-artmbid-artname-plays.tsv', sep='\t', names=events_cols, encoding='latin-1')
display(events_df.head())
events_df.to_csv(f'{data_dir}/events.tsv', sep='\t', index=False)

users_cols = ['user_id', 'gender', 'age', 'country', 'signup']
users_df = pd.read_csv(f'{data_dir}/usersha1-profile.tsv', sep='\t', names=users_cols, encoding='latin-1')
display(users_df.head())
users_df.to_csv(f'{data_dir}/users.tsv', sep='\t', index=False)

Unnamed: 0,user_id,artist_id,artist_name,plays
0,00000c289a1829a808ac09c00daf10bc3c4e223b,3bd73256-3905-4f3a-97e2-8b341527f805,betty blowtorch,2137
1,00000c289a1829a808ac09c00daf10bc3c4e223b,f2fb0ff0-5679-42ec-a55c-15109ce6e320,die Ãrzte,1099
2,00000c289a1829a808ac09c00daf10bc3c4e223b,b3ae82c2-e60b-4551-a76d-6620f1b456aa,melissa etheridge,897
3,00000c289a1829a808ac09c00daf10bc3c4e223b,3d6bbeb7-f90e-4d10-b440-e153c0d10b53,elvenking,717
4,00000c289a1829a808ac09c00daf10bc3c4e223b,bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8,juliette & the licks,706


Unnamed: 0,user_id,gender,age,country,signup
0,00000c289a1829a808ac09c00daf10bc3c4e223b,f,22.0,Germany,"Feb 1, 2007"
1,00001411dc427966b17297bf4d69e7e193135d89,f,,Canada,"Dec 4, 2007"
2,00004d2ac9316e22dc007ab2243d6fcb239e707d,,,Germany,"Sep 1, 2006"
3,000063d3fe1cf2ba248b9e3c3f0334845a27a6bf,m,19.0,Mexico,"Apr 28, 2008"
4,00007a47085b9aab8af55f52ec8846ac479ac4fe,m,28.0,United States,"Jan 27, 2006"


### Upload Data to Shaped

Shaped has support for many data connectors! For this tutorial we're going to be using native Shaped Datasets. To do that we need to:
1. Create a .yaml file containing the dataset schema definition.
2. Use Shaped CLI to create the dataset.
3. Use Shaped CLI to upload the .tsv files we just created.

In [6]:
"""
Create a Shaped Dataset schema for each of the datasets and store in a .yaml file.
"""

import yaml

dir_path = "notebook_assets"

events_dataset_schema = {
    "dataset_name": "lastfm_events",
    "schema_type": "CUSTOM",
    "schema": {
        "plays": "Int32",
        "user_id": "String",
        "artist_id": "String",
        "artist_name": "String"
    }
}

with open(f'{dir_path}/events_dataset_schema.yaml', 'w') as file:
    yaml.dump(events_dataset_schema, file)


users_dataset_schema = {
    "dataset_name": "lastfm_users",
    "schema_type": "CUSTOM",
    "schema": {
        "user_id": "String",
        "gender": "String",
        "age": "Int32",
        "country": "String",
        "signup": "String"
    }
}


with open(f'{dir_path}/users_dataset_schema.yaml', 'w') as file:
    yaml.dump(users_dataset_schema, file)

In [7]:
"""
Create a Shaped Dataset using the .yaml schema files.
"""
! shaped create-dataset --file $DIR_NAME/events_dataset_schema.yaml
! shaped create-dataset --file $DIR_NAME/users_dataset_schema.yaml

{
  "dataset_name": "lastfm_events",
  "schema": {
    "artist_id": "String",
    "artist_name": "String",
    "plays": "Int32",
    "user_id": "String"
  },
  "schema_type": "CUSTOM"
}
message: Dataset with name 'lastfm_events' was successfully scheduled for creation

{
  "dataset_name": "lastfm_users",
  "schema": {
    "age": "Int32",
    "country": "String",
    "gender": "String",
    "signup": "String",
    "user_id": "String"
  },
  "schema_type": "CUSTOM"
}
message: Dataset with name 'lastfm_users' was successfully scheduled for creation



It takes a moment to provision the infrastructure required for the datasets. You can monitor them using the CLI commnad:

In [9]:
! shaped list-datasets

datasets:
- dataset_name: lastfm_events
  dataset_uri: https://api.shaped.ai/v1/datasets/lastfm_events
  created_at: 2024-05-15T16:32:12 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: lastfm_users
  dataset_uri: https://api.shaped.ai/v1/datasets/lastfm_users
  created_at: 2024-05-15T16:32:13 UTC
  schema_type: CUSTOM
  status: ACTIVE



Upload the .tsv files. You'll see the records uploading in batches of 1000.

The full LastFM dataset has ~17 million events! To save time for this tutorial, we trim the dataset down to the first 100k events.

In [10]:
! head -n 100001 notebook_assets/lastfm-dataset-360K/events.tsv > notebook_assets/lastfm-dataset-360K/events_100k.tsv
! head -n 2033 notebook_assets/lastfm-dataset-360K/users.tsv > notebook_assets/lastfm-dataset-360K/users_100k.tsv

# Comment out these lines and replace them with the commented-out lines below if you wish to use the full dataset.
# Warning - this will take ~2 hours to upload!
! shaped dataset-insert --dataset-name lastfm_events --file notebook_assets/lastfm-dataset-360K/events_100k.tsv --type 'tsv'
! shaped dataset-insert --dataset-name lastfm_users --file notebook_assets/lastfm-dataset-360K/users_100k.tsv --type 'tsv'

# ! shaped dataset-insert --dataset-name lastfm_events --file notebook_assets/lastfm-dataset-360K/events.tsv --type 'tsv'
# ! shaped dataset-insert --dataset-name lastfm_users --file notebook_assets/lastfm-dataset-360K/users.tsv --type 'tsv'

100000 Records [01:14, 1344.27 Records/s]
2032 Records [00:01, 1055.04 Records/s]


### Model Creation

We're now ready to create your Shaped model! To keep things simple, today, we're using the plays records to build a collaborative filtering model. Shaped will use these ratings to determine which users like which artist with the assumption that the more times the user played an artists music the more likely a user likes the artist.


1. Create a .yaml file containing the model schema definition.
2. Use Shaped CLI to create the model!

For further details about creating models please refer to the [Create Model](https://docs.shaped.ai/docs/api#tag/Model/operation/post_create_models_post) API reference.

In [11]:
"""
Create a Shaped Model schema and store in a .yaml file.
"""

import yaml

lastfm_plays_model_schema = {
    "model": {
        "name": "lastfm_artist_recommendations"
    },
    "connectors": [
        {
            "type": "Dataset",
            "id": "lastfm_events",
            "name": "lastfm_events"
        },
        {
            "type": "Dataset",
            "id": "lastfm_users",
            "name": "lastfm_users"
        },
    ],
    "fetch": {
        "events": "SELECT user_id, artist_id AS item_id, 0 AS created_at, plays AS label FROM lastfm_events",
        "users": "SELECT user_id, gender, age, country, signup FROM lastfm_users",
        "items": "SELECT artist_id AS item_id, artist_name FROM lastfm_events" 
    }
}

with open(f'{dir_path}/lastfm_plays_model_schema.yaml', 'w') as file:
    yaml.dump(lastfm_plays_model_schema, file)

In [12]:
"""
Create a Shaped Model using the .yaml schema file.
"""

! shaped create-model --file $DIR_NAME/lastfm_plays_model_schema.yaml

{
  "connectors": [
    {
      "id": "lastfm_events",
      "name": "lastfm_events",
      "type": "Dataset"
    },
    {
      "id": "lastfm_users",
      "name": "lastfm_users",
      "type": "Dataset"
    }
  ],
  "fetch": {
    "events": "SELECT user_id, artist_id AS item_id, 0 AS created_at, plays AS label FROM lastfm_events",
    "items": "SELECT artist_id AS item_id, artist_name FROM lastfm_events",
    "users": "SELECT user_id, gender, age, country, signup FROM lastfm_users"
  },
  "model": {
    "name": "lastfm_artist_recommendations"
  }
}
model_url: https://api.shaped.ai/v1/models/lastfm_artist_recommendations



Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e. the volume of your users, items and interactions and the number of attributes you're providing.

While the model is being setup, you can view its status with either the [List Models](https://docs.shaped.ai/docs/api#tag/Model/operation/get_models_models_get) or [View Model](https://docs.shaped.ai/docs/api) endpoints. For example, with the CLI:

In [14]:
! shaped list-models

models:
- model_name: lastfm_artist_recommendations
  model_uri: https://api.shaped.ai/v1/models/lastfm_artist_recommendations
  created_at: 2024-05-15T17:03:17 UTC
  status: ACTIVE



The initial model creation goes through the following stages in order:

1. `SCHEDULING`<br/>
2. `FETCHING`<br/>
3. `TUNING`<br/>
4. `TRAINING`<br/>
5. `DEPLOYING`<br/>
6. `ACTIVE`

You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to next step and use it to make rank requests.

### Rank!

You're now ready to fetch your movie recommendations! You can do this with the [Rank endpoint](https://docs.shaped.ai/docs/api#tag/Rank/operation/post_rank_models__model_id__rank_post). Just provide the user_id you wish to get the recommendations for and the number of recommendations you want returned.

Shaped's CLI provides a convenience rank command to quickly retrieve results from the command line. You can use it as follows:

In [15]:
! shaped rank --model-name lastfm_artist_recommendations --user-id 00000c289a1829a808ac09c00daf10bc3c4e223b --limit 5

ids:
- b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d
- f9ef7a22-4262-4596-a2a8-1d19345b8e50
- a74b1b7f-71a5-4011-9441-d0b5e4122711
- 67e344da-ec54-4e26-b2a4-8351d744a14c
- 31745282-b1ea-4d62-939f-226b14d68e7c
scores:
- 1.0
- 0.80925343
- 0.56741309
- 0.45006668
- 0.37382729



The response returns 2 parallel arrays containing the ids and ranking scores for the movies that Shaped estimates are most interesting to the given user.

If you want to integrate this endpoint into your website or application you can use the Rank POST REST endpoint directly with the following request:

In [None]:
! curl https://api.prod.shaped.ai/v1/models/lastfm_artist_recommendations/rank \
  -H "x-api-key: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{ "user_id": "00000c289a1829a808ac09c00daf10bc3c4e223b", "limit": 5 }'

Wow! It was that easy to see top 5 artists for the passed in `user_id` 🍾. Now let's add ranking to your product :)

### Clean Up

Don't forget to delete your model (and its assets) and the datasets once you're finished with them. You can do it with the following CLI command:

In [17]:
! shaped delete-model --model-name lastfm_artist_recommendations

! shaped delete-dataset --dataset-name lastfm_events
! shaped delete-dataset --dataset-name lastfm_users

! rm -r notebook_assets

message: Model with name 'lastfm_artist_recommendations' is deleting...

message: Dataset with name 'lastfm_events' was successfully scheduled for deletion.

message: Dataset with name 'lastfm_users' was successfully scheduled for deletion.

