This notebook will walk you through an example of setting up a model for the Movielens dataset stored in a csv file and then fetching ranked movies for a specific user.

Let's get started! 🚀

### Setup

Replace `<YOUR_API_KEY>` with your API key below.

*If you don't have an API Key, feel free to [signup on our website](https://www.shaped.ai/#contact-us) :)*

In [None]:
import os

SHAPED_API_KEY = os.getenv('TEST_SHAPED_API_KEY', '<YOUR_API_KEY>')

1. Install `shaped` to leverage the Shaped CLI to create, view, and use your model.
2. Install `pandas` to view and edit the sample dataset.
3. Install `pyyaml` to create Shaped Dataset and Model schema files.

In [39]:
! pip install shaped
! pip install pandas
! pip install pyyaml



Initialize the CLI with your API key.

In [None]:
! shaped init --api-key $SHAPED_API_KEY

### Download Public Dataset

Fetch the publicly hosted movielens dataset.

In [41]:
! echo "Downloading movielens data..."

DIR_NAME = "notebook_assets"
! mkdir $DIR_NAME
! wget http://files.grouplens.org/datasets/movielens/ml-100k.zip --no-check-certificate -P $DIR_NAME
! unzip $DIR_NAME/ml-100k.zip -d $DIR_NAME

Downloading movielens data...
--2023-04-19 18:45:24--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘notebook_assets/ml-100k.zip’


2023-04-19 18:45:25 (17.6 MB/s) - ‘notebook_assets/ml-100k.zip’ saved [4924029/4924029]

Archive:  notebook_assets/ml-100k.zip
   creating: notebook_assets/ml-100k/
  inflating: notebook_assets/ml-100k/allbut.pl  
  inflating: notebook_assets/ml-100k/mku.sh  
  inflating: notebook_assets/ml-100k/README  
  inflating: notebook_assets/ml-100k/u.data  
  inflating: notebook_assets/ml-100k/u.genre  
  inflating: notebook_assets/ml-100k/u.info  
  inflating: notebook_assets/ml-100k/u.item  
  inflating: notebook_assets/ml-100k/u.occupation  
  inflating: notebook_assets/ml-100k/u.user  
  inflat

Let's take a look at the downloaded dataset. There are three tables of interest:
- `ratings` which are stored in `ml-100k/u.data`
- `users` which are stored in `ml-100k/u.user`
- `movies` which are stored in `ml-100k/u.item`

Unfortunately each of these tab separated files don't have a header (which is required by Shaped). To address this, we can prepend the header as shown below:

In [42]:
import pandas as pd

data_dir = "notebook_assets/ml-100k"

events_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
events_df = pd.read_csv(f'{data_dir}/u.data', sep='\t', names=events_cols, encoding='latin-1')
display(events_df.head())
events_df.to_csv(f'{data_dir}/events.csv', sep='\t', index=False)

users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users_df = pd.read_csv(f'{data_dir}/u.user', sep='|', names=users_cols, encoding='latin-1')
display(users_df.head())
users_df.to_csv(f'{data_dir}/users.csv', sep='\t', index=False)

genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film_Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci_Fi", "Thriller", "War", "Western"
]
movies_cols = ['movie_id', 'title', 'release_date', "video_release_date", "imdb_url"] + genre_cols
movies_df = pd.read_csv(f'{data_dir}/u.item', sep='|', names=movies_cols, encoding='latin-1')
# Drop null column.
movies_df = movies_df.drop(columns=["video_release_date"])
display(movies_df.head())
movies_df.to_csv(f'{data_dir}/items.csv', sep='\t', index=False)

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Unnamed: 0,movie_id,title,release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Upload Data to Shaped

Shaped has support for many data connectors! For this tutorial we're going to be using native Shaped Datasets. To do that we need to:
1. Create a .yaml file containing the dataset schema definition.
2. Use Shaped CLI to create the dataset.
3. Use Shaped CLI to upload the .csv files we just created.

In [43]:
"""
Create a Shaped Dataset schema for each of the datasets and store in a .yaml file.
"""

import yaml

dir_path = "notebook_assets"

events_dataset_schema = {
    "dataset_name": "movielens_events",
    "schema_type": "CUSTOM",
    "schema": {
        "rating": "Int32",
        "user_id": "String",
        "movie_id": "String",
        "timestamp": "DateTime"
    }
}

with open(f'{dir_path}/events_dataset_schema.yaml', 'w') as file:
    yaml.dump(events_dataset_schema, file)


users_dataset_schema = {
    "dataset_name": "movielens_users",
    "schema_type": "CUSTOM",
    "schema": {
        "user_id": "String",
        "age": "Int32",
        "sex": "String",
        "occupation": "String",
        "zip_code": "String"
    }
}


with open(f'{dir_path}/users_dataset_schema.yaml', 'w') as file:
    yaml.dump(users_dataset_schema, file)

items_dataset_schema = {
    "dataset_name": "movielens_items",
    "schema_type": "CUSTOM",
    "schema": {
        "movie_id": "String",
        "title": "String",
        "release_date": "DateTime",
        "imdb_url": "String",
        "genre_unknown": "Bool",
        "Action": "Bool",
        "Adventure": "Bool",
        "Animation": "Bool",
        "Children": "Bool",
        "Comedy": "Bool",
        "Crime": "Bool",
        "Documentary": "Bool",
        "Drama": "Bool",
        "Fantasy": "Bool",
        "Film_Noir": "Bool",
        "Horror": "Bool",
        "Musical": "Bool",
        "Mystery": "Bool",
        "Romance": "Bool",
        "Sci_Fi": "Bool",
        "Thriller": "Bool",
        "War": "Bool",
        "Western": "Bool",
    }
}

with open(f'{dir_path}/items_dataset_schema.yaml', 'w') as file:
    yaml.dump(items_dataset_schema, file)

In [47]:
"""
Create a Shaped Dataset using the .yaml schema files.
"""
! shaped create-dataset --file $DIR_NAME/events_dataset_schema.yaml
! shaped create-dataset --file $DIR_NAME/users_dataset_schema.yaml
! shaped create-dataset --file $DIR_NAME/items_dataset_schema.yaml

{
  "dataset_name": "movielens_events",
  "schema": {
    "movie_id": "String",
    "rating": "Int32",
    "timestamp": "DateTime",
    "user_id": "String"
  },
  "schema_type": "CUSTOM"
}
message: Dataset with name 'movielens_events' was successfully scheduled for creation

{
  "dataset_name": "movielens_users",
  "schema": {
    "age": "Int32",
    "occupation": "String",
    "sex": "String",
    "user_id": "String",
    "zip_code": "Int32"
  },
  "schema_type": "CUSTOM"
}
message: Dataset with name 'movielens_users' was successfully scheduled for creation

{
  "dataset_name": "movielens_items",
  "schema": {
    "Action": "Bool",
    "Adventure": "Bool",
    "Animation": "Bool",
    "Children": "Bool",
    "Comedy": "Bool",
    "Crime": "Bool",
    "Documentary": "Bool",
    "Drama": "Bool",
    "Fantasy": "Bool",
    "Film_Noir": "Bool",
    "Horror": "Bool",
    "Musical": "Bool",
    "Mystery": "Bool",
    "Romance": "Bool",
    "Sci_Fi": "Bool",
    "Thriller": "Bool",
    "War"

It takes a moment to provision the infrastructure required for the datasets. You can monitor them using the CLI commnad:

In [48]:
! shaped list-datasets

datasets:
- name: movielens_events
  schema_type: CUSTOM
  status: ACTIVE
- name: movielens_users
  schema_type: CUSTOM
  status: ACTIVE
- name: movielens_items
  schema_type: CUSTOM
  status: ACTIVE



In [49]:
"""
Upload the .csv files. You'll see the records uploading in batches of 1000. To upload all 100,000 events, it will take a couple minutes.
"""

! shaped dataset-insert --dataset-name movielens_events --file notebook_assets/ml-100k/events.csv --type 'tsv'
! shaped dataset-insert --dataset-name movielens_users --file notebook_assets/ml-100k/users.csv --type 'tsv'
! shaped dataset-insert --dataset-name movielens_items --file notebook_assets/ml-100k/items.csv --type 'tsv'

100000 Records [01:54, 874.94 Records/s]
943 Records [00:00, 981.78 Records/s]
1682 Records [00:03, 467.15 Records/s]


### Model Creation

We're now ready to create your Shaped model! To keep things simple, today, we're using the ratings records to build a collaborative filtering model. Shaped will use these ratings to determine which users like which movie with the assumption that the higher the rating the more likely a user likes the rated movie.


1. Create a .yaml file containing the model schema definition.
2. Use Shaped CLI to create the model!

For further details about creating models please refer to the [Create Model](https://docs.shaped.ai/docs/api#tag/Model/operation/post_create_models_post) API reference.

In [50]:
"""
Create a Shaped Model schema and store in a .yaml file.
"""

import yaml

movielens_ratings_model_schema = {
    "model": {
        "name": "movielens_movie_recommendations"
    },
    "connectors": [
        {
            "type": "Dataset",
            "id": "movielens_events",
            "name": "movielens_events"
        },
        {
            "type": "Dataset",
            "id": "movielens_users",
            "name": "movielens_users"
        },
        {
            "type": "Dataset",
            "id": "movielens_items",
            "name": "movielens_items"
        },
    ],
    "fetch": {
        "events": "SELECT user_id, movie_id AS item_id, unix_timestamp AS created_at, rating AS label FROM movielens_events",
        "users": "SELECT user_id, age, sex, occupation, zip_code FROM movielens_users",
        "items": "SELECT movie_id AS item_id, title, release_date, imdb_url, genre_unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film_Noir, Horror, Musical, Mystery, Romance, Sci_Fi, Thriller, War, Western FROM movielens_items" 
    }
}

with open(f'{dir_path}/movielens_ratings_model_schema.yaml', 'w') as file:
    yaml.dump(movielens_ratings_model_schema, file)

In [53]:
"""
Create a Shaped Model using the .yaml schema file.
"""

! shaped create-model --file $DIR_NAME/movielens_ratings_model_schema.yaml

{
  "connectors": [
    {
      "id": "movielens_events",
      "name": "movielens_events",
      "type": "Dataset"
    },
    {
      "id": "movielens_users",
      "name": "movielens_users",
      "type": "Dataset"
    },
    {
      "id": "movielens_items",
      "name": "movielens_items",
      "type": "Dataset"
    }
  ],
  "fetch": {
    "events": "SELECT user_id, movie_id AS item_id, unix_timestamp AS created_at, rating AS label FROM movielens_events",
    "items": "SELECT movie_id AS item_id, title, release_date, imdb_url, genre_unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film_Noir, Horror, Musical, Mystery, Romance, Sci_Fi, Thriller, War, Western FROM movielens_items",
    "users": "SELECT user_id, age, sex, occupation, zip_code FROM movielens_users"
  },
  "model": {
    "name": "movielens_movie_recommendations"
  }
}
model_url: https://api.prod.shaped.ai/v1/models/movielens_movie_recommendations



Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e. the volume of your users, items and interactions and the number of attributes you're providing.

While the model is being setup, you can view its status with either the [List Models](https://docs.shaped.ai/docs/api#tag/Model/operation/get_models_models_get) or [View Model](https://docs.shaped.ai/docs/api) endpoints. For example, with the CLI:

In [57]:
! shaped list-models

models:
- model_name: movielens_movie_recommendations
  model_uri: https://api.prod.shaped.ai/v1/models/movielens_movie_recommendations
  created_at: 2023-04-19T19:02:37 UTC
  trained_at: 2023-04-19T19:27:19 UTC
  status: ACTIVE



The initial model creation goes through the following stages in order:

1. `SCHEDULING`<br/>
2. `FETCHING`<br/>
3. `TRAINING`<br/>
4. `DEPLOYING`<br/>
5. `ACTIVE`

You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to next step and use it to make rank requests.

### Rank!

You're now ready to fetch your movie recommendations! You can do this with the [Rank endpoint](https://docs.shaped.ai/docs/api#tag/Rank/operation/post_rank_models__model_id__rank_post). Just provide the user_id you wish to get the recommendations for and the number of recommendations you want returned.

Shaped's CLI provides a convenience rank command to quickly retrieve results from the command line. You can use it as follows:

In [63]:
! shaped rank --model-name movielens_movie_recommendations --user-id 1 --limit 5

ids:
- '480'
- '435'
- '919'
- '276'
- '576'
scores:
- 0.8985671586432267
- 0.47889159105683576
- 0.6075377115092316
- 0.17812716399151673
- 0.6375729382785116



The response returns 2 parallel arrays containing the ids and ranking scores for the movies that Shaped estimates are most interesting to the given user.

If you want to integrate this endpoint into your website or application you can use the Rank POST REST endpoint directly with the following request:

In [None]:
! curl https://api.prod.shaped.ai/v1/models/movielens_movie_recommendations/rank \
  -H "x-api-key: <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{ "user_id": "1", "limit": 5 }'

Wow! It was that easy to see top 5 rated movies for the passed in `user_id` 🍾. Now let's add ranking to your product :)

### Clean Up

Don't forget to delete your model (and its assets) and the datasets once you're finished with them. You can do it with the following CLI command:

In [None]:
! shaped delete-model --model-name movielens_movie_recommendations

! shaped delete-dataset --dataset-name movielens_events
! shaped delete-dataset --dataset-name movielens_users
! shaped delete-dataset --dataset-name movielens_items

! rm -r notebook_assets