This notebook will walk you through an example of:
1. Downloading the Goodbooks dataset into csv files
2. Creating a shaped dataset and uploading the csv files into it through the MongoDB connector.
3. Setting up a model for the Goodbooks dataset from the MongoDB instance.
4. Fetching the books that are most likely to be read for a specific reader.

Let's get started! 🚀








## Using Shaped
### Setup

1. Install `shaped` to leverage the Shaped CLI to create, view, and use your model.
2. Install `pyyaml` to create Model schema files.

In [None]:
! pip install shaped
! pip install pyyaml

Collecting shaped
  Downloading shaped-0.6.1-py3-none-any.whl (4.6 kB)
Collecting pyarrow==11.0.0 (from shaped)
  Downloading pyarrow-11.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.9/34.9 MB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
Collecting tqdm==4.65.0 (from shaped)
  Downloading tqdm-4.65.0-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting colorama<0.5.0,>=0.4.3 (from typer[all]>=0.7.0->shaped)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting shellingham<2.0.0,>=1.3.0 (from typer[all]>=0.7.0->shaped)
  Downloading shellingham-1.5.3-py2.py3-none-any.whl (9.7 kB)
Installing collected packages: tqdm, shellingham, pyarrow, colorama, shaped
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.1
    Uninstalling tqdm-4.66.1:
      Successfully unin

### Initialize the Client


Replace `<YOUR_API_KEY>` with your API key below.

*If you don't have an API Key, feel free to [signup on our website](https://www.shaped.ai/#contact-us) :)*

In [None]:
import os

SHAPED_API_KEY = os.getenv('TEST_SHAPED_API_KEY', '<YOUR_API_KEY>')

Initialize the Shaped CLI with your API key.

In [None]:
! shaped init --api-key $SHAPED_API_KEY

## Preparing Data
### Setup
Install `pandas` to view and edit the sample dataset.

In [None]:
! pip install pandas



### Download Public Dataset

Fetch the publicly hosted Goodbooks dataset.

In [22]:
import zipfile
import os

def download_and_extract_dataset(url, destination_directory):
    print(f"Downloading dataset from {url}...")
    os.makedirs(destination_directory, exist_ok=True)
    zip_file_path = os.path.join(destination_directory, os.path.basename(url))

    # Download the ZIP file
    !wget $url --no-check-certificate -P $destination_directory

    # Extract the contents of the ZIP file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(destination_directory)

# Directory name for storing datasets
DIR_NAME = "notebook_assets"

# Download and extract each dataset
datasets = [
    ("https://github.com/zygmuntz/goodbooks-10k/releases/download/v1.0/ratings.zip", DIR_NAME),
    ("https://github.com/zygmuntz/goodbooks-10k/releases/download/v1.0/to_read.zip", DIR_NAME),
    ("https://github.com/zygmuntz/goodbooks-10k/releases/download/v1.0/books.zip", DIR_NAME)
]

for dataset_url, destination_dir in datasets:
    download_and_extract_dataset(dataset_url, destination_dir)


Downloading dataset from https://github.com/zygmuntz/goodbooks-10k/releases/download/v1.0/ratings.zip...
--2023-09-12 01:45:55--  https://github.com/zygmuntz/goodbooks-10k/releases/download/v1.0/ratings.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/103417214/a09dc434-9d79-11e7-906f-2ce45a241d63?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230912%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230912T014555Z&X-Amz-Expires=300&X-Amz-Signature=97a0c4b5115715aa9f63dbc2fffe828318836341ac0428c6d0f09dd60550e001&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=103417214&response-content-disposition=attachment%3B%20filename%3Dratings.zip&response-content-type=application%2Foctet-stream [following]
--2023-09-12 01:45:55--  https://objects.githubuserconte

Let's take a look at the downloaded dataset. There are two tables of interest:
- `books` which are stored in `books.csv`
- `ratings` which are stored in `ratings.csv`

In [32]:
import pandas as pd

data_dir = "notebook_assets"

events_df = pd.read_csv(f'{data_dir}/ratings.csv')
books_df = pd.read_csv(f'{data_dir}/books.csv')
display(events_df.head())
display(books_df.head())

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In most cases, you'd have to spend time cleaning all this data, however, with Shaped you can feed it through in this state and Shaped will do the cleaning for you. The way we do this is by treating all input data as unstructured, and using large language models to distill the meaning of each column.


Shaped doesn't require much data to work. At a minimum we need to know the `user_id`, `item_id`, `label`, and `created_at` columns of the interactions table. If the `users` and `items` tables are provided then the only requirement is their respective id columns are aliased to `user_id` and `item_id`.

To keep things simple, in this tutorial we will focus on the interaction data only (i.e., `events_df`). All 3 existing columns are relevant:
- `user_id`: Is the reader who is rating the book. The ID value is contiguous, which ranges from 1 to 53424.
- `book_id`: Is a unique identification for a book. The ID value is contiguous, which ranges from 1 to 10000. It will be used as an item to train our models.
- `ratings`: Is the rating of a book given by a reader, which ranges from 1 to 5.

Besides the above 3 columns, shaped also requires the interaction data to have `created_at` column, which is missing from the given data. As specified in the [docs](http://fastml.com/goodbooks-10k-a-new-dataset-for-book-recommendations/), the ratings are already sorted by time, so we can manually create an extra `ratingTime` column to reflect that property.

In [None]:
events_df = events_df[["user_id","book_id","rating"]]
events_df = events_df.head(1000)
# Create a new column 'ratingTime' with ascending values
events_df['ratingTime'] = range(1, len(events_df) + 1)

display(events_df.head())

Unnamed: 0,user_id,book_id,rating,ratingTime
0,1,258,5,1
1,2,4081,4,2
2,2,260,5,3
3,2,9296,5,4
4,2,2318,3,5


### Create and Insert Shaped Dataset Through MongoDB Connector



Shaped has support for many data connectors! For this tutorial we're going to be using the MongoDB connector to upload the local data from csv files to Shaped. To do that we need to:
1. Set up a MongoDB instance on your own. Make sure you also obtain your connection string ([way to find it](https://www.mongodb.com/basics/mongodb-connection-string)) and create both database and collection so we can make reference to them in the schema in step 2.
2. Create a `.yaml` file containing the dataset schema definition, and fill in the MongoDB instance's configuration detail (i.e., connection string, the names of the database and collection you created). In this tutorial, we named database as "goodbooks" and collection as "ratings".
3. Use Shaped CLI to create the dataset with the schema defined in the `.yaml` file.
4. Use Shaped CLI to insert or upload the local `.csv` files into the shaped dataset we've just created through the MongoDB connector.

In [None]:
"""
Create a Shaped Dataset schema for each of the datasets and store in a .yaml file.
"""

import yaml

dir_path = "notebook_assets"

events_dataset_schema = {
    "dataset_name": "goodbooks_events",
    "schema_type": "MONGODB",
    "config": {
        "schedule_interval": "@daily",
        "mongodb_connection_string": "<YOUR_CONNECTION_STRING>",
        "database": "goodbooks",
        "collection": "ratings"
    }
}

with open(f'{dir_path}/events_dataset_schema.yaml', 'w') as file:
    yaml.dump(events_dataset_schema, file)

In [None]:
! shaped create-dataset --file $DIR_NAME/events_dataset_schema.yaml

In [24]:
! shaped list-datasets

datasets:
- dataset_name: movielens_ratings_6
  dataset_uri: https://api.prod.shaped.ai/v1/datasets/movielens_ratings_6
  created_at: 2023-03-29T18:37:28 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: movielens_ratings_7
  dataset_uri: https://api.prod.shaped.ai/v1/datasets/movielens_ratings_7
  created_at: 2023-03-29T19:20:15 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: movielens_ratings_8
  dataset_uri: https://api.prod.shaped.ai/v1/datasets/movielens_ratings_8
  created_at: 2023-03-29T20:08:43 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: movielens_ratings_dan
  dataset_uri: https://api.prod.shaped.ai/v1/datasets/movielens_ratings_dan
  created_at: 2023-05-10T07:31:07 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: md
  dataset_uri: https://api.prod.shaped.ai/v1/datasets/md
  created_at: 2023-05-10T13:19:36 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: mind_small
  dataset_uri: https://api.prod.shaped.ai/v1/datasets/mind

In [None]:
! shaped dataset-insert --dataset-name goodbooks_events --file $DIR_NAME/ratings.csv --type 'csv'

5976479 Records [35:33, 2801.76 Records/s]


### Model Creation

We're now ready to create your Shaped model! To keep things simple, today, we're using the Goodbooks rating data to build a collaborative filtering model. Shaped will use these ratings to determine which reader like which book with the assumption that the higher the rating the more likely a reader will want to read that book.


1. Create a `.yaml` file containing the model schema definition.
2. Use Shaped CLI to create the model!

For further details about creating models please refer to the [Create Model](https://docs.shaped.ai/docs/api#tag/Model/operation/post_create_models_post) API reference.

In [25]:
"""
Create a Shaped Model schema and store in a .yaml file.
"""

import yaml

goodbooks_ratings_model_schema = {
    "model": {
        "name": "goodbooks_book_recommendations"
    },
    "connectors": [
        {
            "type": "Dataset",
            "id": "goodbooks_events",
            "name": "goodbooks_events"
        }
    ],
    "fetch": {
        "events": "SELECT JSON_EXTRACT_STRING(document, '$.book_id') as item_id, JSON_EXTRACT_STRING(document, '$.rating') as label, JSON_EXTRACT_STRING(document, '$.ratingTime') as created_at, JSON_EXTRACT_STRING(document, '$.user_id') as user_id FROM goodbooks_events"
    }
}

dir_path = "notebook_assets"

with open(f'{dir_path}/goodbooks_ratings_model_schema.yaml', 'w') as file:
    yaml.dump(goodbooks_ratings_model_schema, file)

In [26]:
"""
Create a Shaped Model using the .yaml schema file.
"""

! shaped create-model --file $DIR_NAME/goodbooks_ratings_model_schema.yaml

{
  "connectors": [
    {
      "id": "goodbooks_events",
      "name": "goodbooks_events",
      "type": "Dataset"
    }
  ],
  "fetch": {
    "events": "SELECT JSON_EXTRACT_STRING(document, '$.book_id') as item_id, JSON_EXTRACT_STRING(document, '$.rating') as label, JSON_EXTRACT_STRING(document, '$.ratingTime') as created_at, JSON_EXTRACT_STRING(document, '$.user_id') as user_id FROM goodbooks_events"
  },
  "model": {
    "name": "goodbooks_book_recommendations"
  }
}
model_url: https://api.prod.shaped.ai/v1/models/goodbooks_book_recommendations



Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e. the volume of your users, items and interactions and the number of attributes you're providing. For the model you just created it will take no more than 30 minutes.

While the model is being setup, you can view its status with either the [List Models](https://docs.shaped.ai/docs/api#tag/Model/operation/get_models_models_get) or [View Model](https://docs.shaped.ai/docs/api) endpoints. For example, with the CLI `shaped list-models` you can inspect the status of created models, going through the following stages in order:

1. `SCHEDULING`<br/>
2. `FETCHING`<br/>
3. `TRAINING`<br/>
4. `DEPLOYING`<br/>
5. `ACTIVE`

In [27]:
! shaped list-models

models:
- model_name: mind_small_model
  model_uri: https://api.prod.shaped.ai/v1/models/mind_small_model
  created_at: 2023-08-14T16:57:17 UTC
  trained_at: 2023-09-06T15:53:06 UTC
  status: ACTIVE
- model_name: goodreads_model
  model_uri: https://api.prod.shaped.ai/v1/models/goodreads_model
  created_at: 2023-08-22T15:00:16 UTC
  trained_at: 2023-08-29T15:05:53 UTC
  status: ACTIVE
- model_name: goodreads_model_with_items
  model_uri: https://api.prod.shaped.ai/v1/models/goodreads_model_with_items
  created_at: 2023-08-22T15:02:20 UTC
  status: IDLE
- model_name: goodbooks_book_recommendations
  model_uri: https://api.prod.shaped.ai/v1/models/goodbooks_book_recommendations
  created_at: 2023-09-12T01:46:21 UTC
  status: SCHEDULING



You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to next step and use it to make rank requests.

### Rank!

You're now ready to fetch your Goodbooks book recommendations! You can do this with the [Rank endpoint](https://docs.shaped.ai/docs/api#tag/Rank/operation/post_rank_models__model_id__rank_post). Just provide the `user_id` you wish to get the recommendations for and the number of recommendations you want returned. Make sure the `user_id` indeed exists in the dataset.

Shaped's CLI provides a convenience rank command to quickly retrieve results from the command line. You can use it as follows:

In [None]:
! shaped rank --model-name goodbooks_book_recommendations --user-id 5 --limit 5

ids:
- '8'
- '27'
- '101'
- '24'
- '2'
scores:
- 1.0
- 0.90909091
- 0.90909091
- 0.90909091
- 0.81818182



The response returns 2 parallel arrays containing the book ids and ranking scores for the books that Shaped estimates are most relevant to the given reader.

If you want to integrate this endpoint into your website or application you can use the Rank POST REST endpoint directly with the following request:

In [None]:
! curl https://api.prod.shaped.ai/v1/models/goodbooks_book_recommendations/rank \
  -H "x-api-key: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{ "user_id": "5", "limit": 5 }'

Congrats! You just retrieved the top 5 most liked books for the specified user! 🍾. Now let's add ranking to your product :)

### Clean Up

Don't forget to delete your created shaped datasets, models and associated notebook assets once you're finished with them. You can do it with the following CLI command:

In [38]:
! shaped delete-dataset --dataset-name goodbooks_events

! shaped delete-model --model-name goodbooks_book_recommendations

! rm -r notebook_assets

message: Dataset with name 'goodbooks_events' was successfully deleted

message: Model with name 'goodbooks_book_recommendations' is deleting...

