# Steam Review Dataset Recommendation Tutorial

This notebook will walk you through an example of setting up a model for the Steam Australian Users reviews dataset and then fetching ranked games for a specific user. The dataset contains review data from Australian Steam users, including which games they've reviewed and whether they recommended them.

Let's get started! 🚀

### Setup

Replace `<YOUR_API_KEY>` with your API key below.

*If you don't have an API Key, feel free to [signup on our website](https://www.shaped.ai/#contact-us) :)*

In [None]:
import os

SHAPED_API_KEY = os.getenv('TEST_SHAPED_API_KEY', '<YOUR_API_KEY>')

1. Install `shaped` to leverage the Shaped CLI to create, view, and use your model.
2. Install `pandas` to view and edit the sample dataset.
3. Install `pyyaml` to create Shaped Dataset and Model schema files.
4. Install `numpy` for numerical operations.

In [None]:
! pip install shaped
! pip install pyyaml
! pip install pandas==1.5.3
! pip install numpy==1.26.4

Initialize the CLI with your API key.

In [None]:
! shaped init --api-key $SHAPED_API_KEY

### Download Public Dataset

Fetch the publicly hosted Steam Australian Users reviews dataset. This dataset contains information about Australian Steam users and the games they've reviewed.

In [None]:
! wget https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_user_reviews.json.gz --no-check-certificate

### Parse and Prepare the Dataset

The Steam reviews dataset is stored in a JSON.gz file with a slightly complex structure. Each record contains a user_id and an array of reviews for different games. We need to transform this into a flattened format where each row represents a single user-game review.

Let's examine how to parse this data:

In [None]:
import pandas as pd
import numpy as np
import gzip
import json

def parse(path):
    """Parse each line of the compressed JSON file."""
    g = gzip.open(path, 'r')
    for l in g:
        yield eval(l)

def read_data(path):
    """Read all data from the compressed JSON file."""
    data = list(parse(path))
    return data

Now we'll read the dataset, flatten its structure, and save it as a TSV file that Shaped can work with.

In [None]:
# Read the compressed dataset
users_reviews = read_data('australian_user_reviews.json.gz')

# Parse the user reviews data
def parse_reviews_data(json_data):
    """Extract structured data from the reviews JSON."""
    cleaned_reviews = []

    for user_data in json_data:
        user_id = user_data.get('user_id')

        # Process each review for this user
        if 'reviews' in user_data and isinstance(user_data['reviews'], list):
            for review in user_data['reviews']:
                # Extract needed fields
                item_id = review.get('item_id')
                recommend = 1 if review.get('recommend', False) else 0

                # Parse the posted date if available
                posted_date = review.get('posted', '')
                # Extract date from string like 'Posted November 5, 2011.'
                import re
                date_match = re.search(r'Posted (\w+ \d+, \d{4})', posted_date)

                if date_match:
                    from datetime import datetime
                    try:
                        # Parse the date string to a datetime object
                        date_str = date_match.group(1)
                        date_obj = datetime.strptime(date_str, '%B %d, %Y')
                        # Convert to YYYY-MM-DD format
                        created_at = date_obj.strftime('%Y-%m-%d')
                    except:
                        # Use a default date if parsing fails
                        created_at = '2000-01-01'
                else:
                    created_at = '2000-01-01'

                # Create clean review record
                clean_review = {
                    'user_id': user_id,
                    'item_id': item_id,
                    'created_at': created_at,
                    'recommend': recommend
                }

                cleaned_reviews.append(clean_review)

    return cleaned_reviews

# Get the cleaned reviews data
cleaned_reviews = parse_reviews_data(users_reviews)

# Convert cleaned reviews to a DataFrame
df = pd.DataFrame(cleaned_reviews)
print(f"Dataset shape: {df.shape}")
print("Sample data:")
print(df.head())

# Save as TSV
csv_file_path = 'user_reviews.csv'
df.to_csv(csv_file_path, sep='\t', index=False)

### Upload Data to Shaped

Now that we have our data prepared, let's upload it to Shaped. First, we'll define a dataset schema:

In [None]:
import yaml
steam_review_events_schema = {
    "name": "steam_review_events",
    "schema_type": "CUSTOM",
    "column_schema": {
        "user_id": "String",
        "item_id": "String",
        "created_at": "DateTime",  # Using DateTime for review dates
        "recommend": "Int32"      # 1 for recommended, 0 for not recommended
    }
}

# Save schema to YAML file
schema_file_path = 'steam_review_events_schema.yaml'
with open(schema_file_path, 'w') as file:
    yaml.dump(steam_review_events_schema, file)

Create the dataset from the schema:

In [None]:
! shaped create-dataset --file steam_review_events_schema.yaml

Insert the data into the dataset:

In [None]:
! shaped dataset-insert --dataset-name steam_review_events --file user_reviews.csv --type 'tsv'

Let's check if the dataset was created successfully:

In [None]:
! shaped list-datasets

### Model Creation

We're now ready to create our recommendation model! We'll use the review data to build a collaborative filtering model. Shaped will use this data to determine which users like which games, with the assumption that if a user has recommended a game, they have a positive sentiment towards it.

First, let's define our model schema:

In [None]:
import yaml

steam_model_schema = {
    "model": {
        "name": "steam_review_game_recommendations"
    },
    "connectors": [
        {
            "type": "Dataset",
            "id": "steam_review_events",
            "name": "steam_review_events"
        }
    ],
    "fetch": {
        "events": "SELECT user_id, item_id, created_at, recommend AS label FROM steam_review_events"
    }
}

# Save the model schema to a YAML file
with open('steam_review_model_schema.yaml', 'w') as file:
    yaml.dump(steam_model_schema, file)

Now, let's create the model using our schema file:

In [None]:
! shaped create-model --file steam_review_model_schema.yaml

Let's check the status of our model:

In [None]:
! shaped list-models

Your recommendation model can take up to a few hours to provision your infrastructure and train on your historic events. This time mostly depends on how large your dataset is i.e. the volume of your users, items and interactions and the number of attributes you're providing.

The initial model creation goes through the following stages in order:

1. `SCHEDULING`<br/>
2. `FETCHING`<br/>
3. `TUNING`<br/>
4. `TRAINING`<br/>
5. `DEPLOYING`<br/>
6. `ACTIVE`

You can periodically poll Shaped to inspect these status changes. Once it's in the ACTIVE state, you can move to next step and use it to make rank requests.

### Rank!

You're now ready to fetch your game recommendations! You can do this with the [Rank endpoint](https://docs.shaped.ai/docs/api/#tag/Model-Inference/operation/post_rank_models__model_id__rank_post). Just provide the user_id you wish to get the recommendations for and the number of recommendations you want returned.

Shaped's CLI provides a convenience rank command to quickly retrieve results from the command line. You can use it as follows:

In [None]:
! shaped rank --model-name steam_review_game_recommendations --user-id '76561197970982479' --limit 5

The response returns 2 parallel arrays containing the ids and ranking scores for the games that Shaped estimates are most interesting to the given user.

If you want to integrate this endpoint into your website or application you can use the Rank POST REST endpoint directly with the following request:

In [None]:
'''
curl https://api.shaped.ai/v1/models/steam_review_game_recommendations/rank \
  -X POST \
  -H "x-api-key: MY_KEY" \
  -H "Content-Type: application/json" \
  -d '{"user_id": "76561197970982479", "limit": 5 }'
'''

### Clean Up

Don't forget to delete your model (and its assets) and the dataset once you're finished with them. You can do it with the following CLI command:

In [None]:
! shaped delete-model --model-name steam_review_game_recommendations
! shaped delete-dataset --dataset-name steam_review_events