This notebook will walk you through an example of setting up a model for a movies dataset stored in a parquet file and then fetching ranked movies for a specific user.

Let's get started! 🚀

### Setup

Replace `<YOUR_API_KEY>` with your API key below.

*If you don't have an API Key, feel free to [signup on our website](https://www.shaped.ai/#contact-us) :)*

In [1]:
import os

SHAPED_API_KEY = os.getenv("SHAPED_KEY")

Install the packages needed:
- `requests` is needed for making HTTP requests
- `pandas` is needed for handling the data
- `boto3` is needed for making calls to AWS, specifically s3
- `pyarrow` is needed for creating the parquet file from the downloaded dataset

In [2]:
!pip install requests
!pip install pandas
!pip install boto3
!pip install pyarrow



Import the modules needed.

In [3]:
from urllib.request import urlretrieve
import requests
import zipfile
import pandas as pd
from IPython.display import display
import json
from datetime import datetime
import pyarrow as pa
import pyarrow.parquet as pq

### Download Public Dataset

Fetch the publicly hosted movie dataset.

In [4]:
print("Downloading movielens data...")
print()

urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()

print("Dataset contains:")
print(zip_ref.read('ml-100k/u.info').decode('UTF-8'))

Downloading movielens data...

Dataset contains:
943 users
1682 items
100000 ratings



Let's take a look at the downloaded dataset. There are three tables of interest:
- `ratings` which are stored in `ml-100k/u.data`
- `users` which are stored in `ml-100k/u.user`
- `movies` which are stored in `ml-100k/u.item`

*Note that the `users` and `movies` table have additional columns (also known as "features"). Currently Shaped accepts a single table so to input these features to the model, we'd need to join all 3 tables and save a parquet file for that. Then we can define the features in the call to setup the model (shown a few cells below).*

*Adding support for easier integration to your data is on our immediate roadmap :)*

In [5]:
ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')
display(ratings.head())

users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=users_cols, encoding='latin-1')
display(users.head())

genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = ['movie_id', 'title', 'release_date', "video_release_date", "imdb_url"] + genre_cols
movies = pd.read_csv('ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')
display(movies.head())

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


### Upload Data to Shaped

We need to do a couple of things to the downloaded dataset in order to upload it to Shaped:
1) We need to convert the `unix_timestamp` to be a `datetime` with timezone set.
2) We need to save the `ratings` file as a parquet file.

In [6]:
ratings['unix_timestamp'] = pd.to_datetime(ratings['unix_timestamp'], unit='s', utc=True)
print(ratings)
ratings_parquet_filename = 'ratings.parquet'
ratings_table = pa.Table.from_pandas(ratings)
pq.write_table(ratings_table, where=ratings_parquet_filename)

       user_id  movie_id  rating            unix_timestamp
0          196       242       3 1997-12-04 15:55:49+00:00
1          186       302       3 1998-04-04 19:22:22+00:00
2           22       377       1 1997-11-07 07:18:36+00:00
3          244        51       2 1997-11-27 05:02:03+00:00
4          166       346       1 1998-02-02 05:33:16+00:00
...        ...       ...     ...                       ...
99995      880       476       3 1997-11-22 05:10:44+00:00
99996      716       204       5 1997-11-17 19:39:03+00:00
99997      276      1090       1 1997-09-20 22:49:55+00:00
99998       13       225       2 1997-12-17 22:52:36+00:00
99999       12       203       3 1997-11-19 17:13:03+00:00

[100000 rows x 4 columns]


Once we have all our data prepared, we can upload it using a [`POST` call to the `/models` endpoint](https://shaped.stoplight.io/docs/shaped-api/b3A6NDA5ODEzNjc-create-model). The body of the request contains all the info needed to setup the model. [Please reference the docs for details on each field and their types](https://shaped.stoplight.io/docs/shaped-api/b3A6NDA5ODEzNjc-create-model#request-body).

*If you try `POST`ing to the `/models` endpoint multiple times with the same `model_name`, you will encounter an error saying `"Model with name: '{model_name}' already exists with status: '{status}'"`. If you would like to update or create a new model with the same `model_name` you must first delete the existing model with `model_name`. You can do that by making a [`DELETE` request to the `/models/{model_name}` endpoint](https://shaped.stoplight.io/docs/shaped-api/b3A6NDA5ODEzNjg-delete-model). The `DELETE` call can be made from the cell in the Clean Up section at the bottom of this notebook.*

In [7]:
model_name = "rating_events"
response = requests.post(
  "https://api.prod.shaped.ai/v0/models",
  headers={
      "x-api-key": SHAPED_API_KEY,
      "Content-Type":"application/json"
  },
  json={
    "model_name": model_name,
    "train_schedule": "@once",
    "connector_configs": [{
      "type": "File"
    }],
    'schema': {
      "user": {
        "id": "user_id",
        "source": "None"
      },
      "item": {
        "id": "item_id",
        "source": "None"
      },
      "interaction": {
        "created_at": "unix_timestamp",
        "source": f"./{ratings_parquet_filename}",
        "label": {
            "name": "rating",
            "type": "Rating"    
        }
      },
    }
  }
)
upload_request = json.loads(response.content)
print(json.dumps(upload_request, indent=2))

{
  "error": "Required field missing: 'connector_configs'."
}


The `response` from the `POST` call to the `/models` endpoint returns a json object containing info, namely `'url'` and `'fields'`, about the s3 bucket to upload your data to. So let's go ahead and use those to upload our movie data!

*In this example we are using the `File` type in the `connector_config` which requires explicitly uploading the data to an s3 bucket. If we use a different supported type, like `Redshift` for example, the data will be pulled directly from your datasource and won't require you to manually upload it.*

In [8]:
with open(ratings_parquet_filename, 'rb') as file:
    files = {'file': (ratings_parquet_filename, file)}
    upload_response = requests.post(upload_request['url'], data=upload_request['fields'], files=files)

print(f"Upload response: {upload_response}")

KeyError: 'url'

### Rank!

After we make the `POST` call to `/models`, we can make a [`GET` call to `/models`](https://shaped.stoplight.io/docs/shaped-api/b3A6NDA5ODEzNjY-list-models) to see our newly created model. 

In [None]:
response = requests.get(
    f"https://api.prod.shaped.ai/v0/models",
    headers={
        "x-api-key": SHAPED_API_KEY,
        "Content-Type":"application/json"
    }
)
print(json.dumps(json.loads(response.content), indent=2))

You'll notice the `"status"` of the model you just created is most likely `"PREPARING"`. This means that the initial training job hasn't completed yet. Depending on the size of your data this could take up to 30 min. Feel free to keep querying the `/models` endpoint to check the status of your model. When it is ready, the `"status"` will read `"ACTIVE"`.

Once your model is ready (`"status": "ACTIVE"`), you can hit the [`/models/{model_name}/rank?context_id={context_id}` endpoint](https://shaped.stoplight.io/docs/shaped-api/b3A6NDA5ODEzNjk-get-ranked-results)!

Remember, `{context_id}` is the id of the entity (in this example, User) you want to fetch rankings for. You can also add an optional query param, `limit`, which will inform how many results to return (with the default being 5).

In [None]:
response = requests.get(
    f"https://api.prod.shaped.ai/v0/models/{model_name}/rank?context_id=1",
    headers={
        "x-api-key": SHAPED_API_KEY,
        "Content-Type":"application/json"
    }
)
print(json.dumps(json.loads(response.content), indent=2))

Wow! It was that easy to see top 5 rated movies for the passed in `context_id` 🍾. Now let's add ranking to your product :)

### Clean Up

__The below code should ONLY be run if you want to delete the model with `model_name`.__

In [None]:
response = requests.delete(
    f"https://api.prod.shaped.ai/v0/models/{model_name}",
    headers={
        "x-api-key": SHAPED_API_KEY,
        "Content-Type":"application/json"
    }
)
print(json.dumps(json.loads(response.content), indent=2))