# 1. Predictive Task

Goal of this project is to **predict whether a Steam user will reccommend a game** based on their past reviews and information about the game. The Steam Review Dataset contains a True/False recommend field, which we use as the label.

Binary Classification:
- 1 = user recommends the game 
- 0 = user doesn't recommend the game

## 1.1 Model Evaluation

Due to the model being a Binary Classification task, I will evaluate the model using:
- **Percision**: How often recommend = 1 predictions are correct
- **Accuracy**: Overall correctness
- **F-1 Score**: Balance of percision and recall 
- **Recall**: How many reccomendations the model actually finds

I will split the dataset into:
- 80% Training 
- 10% Validation 
- 10% Testing 

## 1.2 Baselines

**3 Baslines:** 

1. **Game Average Baseline**
    - Each game, compute fraction of training reviews where recommend='True'
    - Predict recommend = 'True' if that rate is greater than or equal to 0.5
2. **Majority**
    - Always predict the most commont label 
3. **User Average Baseline**
    -  Each user, compute the fractions of game they recommend in training set
    - Predict recommend = 'True' if that rate is greater than or equal to 0.5

## 1.3 Validity Of Predictions 

To assess whether models predictions are valid: 

- Ensure all stats (user averages, game averages) are computed from training set 
- compare model performance to baselines outlined above
- check results across diverse groups (popular steam games v.s. less popular steam games, active users v.s. not so active users)

This helps confirm that our model is learning meaningful patterns and not just basing off memorization of the training data

# 2. Exploratory Analysis

## 2.1 Dataset Context

The dataset comes from [Steam](https://store.steampowered.com/) video game platform (publicly avaliable) and following:

> **Self-attentive sequential recommendation**
> *Wang-Cheng Kang, Julian McAuley*
> ICDM, 2018
> [pdf](https://cseweb.ucsd.edu/~jmcauley/pdfs/icdm18.pdf)

> **Item recommendation on monotonic behavior chains**
> *Mengting Wan, Julian McAuley*
> RecSys, 2018
> [pdf](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18b.pdf)

> **Generating and personalizing bundle recommendations on Steam**
> *Apurva Pathak, Kshitiz Gupta, Julian McAuley*
> SIGIR, 2017
> [pdf](https://cseweb.ucsd.edu/~jmcauley/pdfs/sigir17.pdf)


- **`user_Reviews` (Version 1: Review Data, 6.7MB)** 
  Contains user reviews, grouped by user: 
  - `user_id`, `user_url` 
  - `reviews`: list of review objects with fields such as: 
    - `item_id` 
    - `recommend` (T/F) 
    - `review` (text) 
    - `posted`, `helpful`, etc. 
- **`user_items` Version 1: User and Item Data (71mb)**
  Contains user ownership and playtime info:
  - `user_id`, `items_count`, `steam_id`, `user_url`
  - `items`: list of owned games with:
    - `item_id`
    - `item_name`
    - `playtime_forever`
    - `playtime_2weeks`

- **`steam_games` (Version 2: Item metadata (2.7mb))**
  Contains game level metadata:
  - `id` , `app_name`, `title`, `url`
  - `price` (number or F2P)
  - `genres` (list of genres)
  - `tags` (list of tags)
  - `specs` (e.g., single player)
  - `publisher`, `developer`
  - `sentiment` (example: Mostly negative)


These files allow me to model both user behavior and game characteristics

## 2.2 Data Processing 

Since the files are in RAW JSON format and nested. We needed to perform the following steps to create a `user game` table

### 2.2.1 Flatter user reviews 

Using `user_reviews`, create one row per review:

- `user_id`
- `item_id`
- `recommend`
- `review_text`
- `posted`

In [None]:
import json

reviewRows = []
with open("file_name.json", "r") as f:
    for line in f:
        user = json.loads(line)
        userId = user['user_id']
        for r in user['reviews']:
            reviewRows.append({
                  'user_id': userId,
                  'item_id': r['item_id'],
                  'recommend': int(bool(r['recommend'])) # convert str (T/F) to binary values (0/1)
                  'review_text': r['review'],
                  'posted': r['posted']
            })


The following code above will create a flattened table of user game interactions

### 2.2.2 Extracting user stats
From `user_items`, We need to compute user level features:
- `user_total_items` - number of games the user owns
- `user_playtime` -  total `playtime_forever` across all games the Steam user owns
- `user_avg_playtime` - avg. playtime per own game

This will give us insight on how active the Steam user is

In [None]:
import json

userStats = {}
with open("file_name.json", "r") as f:
    for line in f:
        user = json.loads(line)
        userId = user['user_id']
        items = user['items']
        totalPlayTime = sum(it["playtime_forever"] for it in items)
        userStats[userId] = {
                  'user_total_items': len(items),
                  'user_playtime': totalPlayTime,
                  'user_avg_playtime': totalPlayTime/len(items)
            }


This is now easily mappable to review table using `user_id`

### 2.2.3 Game Metadata Processing

Using `steam_games` build following table:
- `item_id`
- `price`
- `genres`
- `tags`
- `specs`
- `sentiment`
- `release_date`

Then: 

- one hot encode a set of the most common genres (example: Indie, RPG, ...)
- create numeric features
    - `num_genres`
    - `num_tags`
    - `num_specs`

The code will follow same pattern previously shown with Extracting User stats and user reviews

### 2.2.4 Merge Everything

Merge all data sources from `2.2.1`, `2.2.2`, `2.2.3`, into a single dataframe (DF):
1. Begin from flatttened review table
2. Merge user stats on `user_id`
3. Merge game metadata on `item_id`

Filter: 
- user with few reviews ( < 2 reviews)
- games with few reviews ( < 2 reviews)

Final dataset is one row (user, game):
- label: recommend (0/1)
- user features: playtime, etc.
- game features: price, genres
- review features: `review_text_length`

## 2.3 Exploratory Analysis (TO DO)

Visualizations to understand data:

- label distribution 
- Game price distribution
- User activity
- Top game by number of reviews 


# 3. Modeling

## 3.1 ML Problem

After preprocessing, each row in the dataset corresponds to one user game review with:

- **Input features** (examples):
  - User features (from `user_items` + reviews):
    - `user_total_items`
    - `user_total_playtime`
    - `user_avg_playtime`
    - `user_total_reviews`
    - `user_recommend_rate` (fraction of reviews with recommend = 1)
  - Game features (from `steam_games`):
    - numeric `price`
    - one-hot encoded genres (RPG, Strategy)
    - one-hot encoded sentiment (Mostly Positive)
    - simple counts like `num_genres`, `num_tags`, `num_specs`
    - selected specs encoded as binary (`has_multiplayer`, `has_online`)
  - Review-level features:
    - `review_text_length` (length of the review string)

- **Output label**:
  - `recommend` âˆˆ {0, 1}

This is a binary classification problem where the model estimates:
$$
P(\text{recommend} = 1 \mid \text{user features}, \text{game features}, \text{review features})
$$

The objective is to minimize classification loss on the training set


## 3.2 Modeling Choices, Advantages and Disadvantages

We used simple models that were appropriate for our data

### Baselines (No learning)

1. **Majority Baseline**
    - Predicts most frequent label in the training set (`recommend = 1` or `no recommendation = 0`)
    - **Pros**: fast, easy to understand
    - **Cons**: ignores all user and game info (huge assumptions)
2. **Game Average Baseline**
    - For each game, compute the fraction of good recommendations in the training set and will predict `1` if that fraction is >= 0.5
    - **Pros**: Base prediction of how popular game is: simple to implement 
    - **Cons**: completely ignores user differences, well not be good for games with not that many reviews
3. ** User Average Baseline**
    - For each user, computes how often they recommend games and will predict `1` if their recommend rate >= 0.5
    - **Pros**: capture the "personality" of steam user if they are harsher or nice in reviews
    - **Cons**: ignores which game it is, not helpful for users with not that many reviews 

### Main Model (Logistic Regression)

- Use all created features (user, game, review)
- Outputs a prediction that `recommend = 1`
- Trained by minimizing **logistic loss**

**Advantages**

- Fast
- scalable
- Complements one hot encoded features
- Can inspect features which features are related/associated with a either higher or lower recommendation probability

**Disadvantages**

- only models linear relationships
- can miss complex interactions between games and users 

## 3.3 Code Walkthrough: Features to Trained Model

Walk through of main modeling code

### 3.3.1 Building Feature Matrix 'X' and Label Vector 'y'

In [None]:
import pandas as pd

# Note: assume `df` is the dataframe from `2.2.4`
# 1. Label vector
y = df["recommend"].astype(int)

# 2. Numeric features
numeric_cols = [
    "user_total_items",
    "user_total_playtime",
    "user_avg_playtime",
    "price",
    "review_text_length"
]

# 3. One-hot encode simple categorical features (e.g., sentiment)
sentiment_dummies = pd.get_dummies(df["sentiment"], prefix="sentiment")

# 4. One-hot encode genres 
genres_dummies = df["genres"].str.join("|").str.get_dummies(sep="|")

# 5. Combine all features
X = pd.concat(
    [
        df[numeric_cols].fillna(0),
        sentiment_dummies,
        genres_dummies
    ],
    axis=1
)


The code shows :
- how numeric features are selected
- how categorical features are one hot encoded 
- combination into a single feature matrix 'X'

### 3.3.2 Train/Validation/Test

### 3.3.3 Baseline Implementation Code (Optional?)

### 3.3.4 Main Model: Logistic Regression

### 3.3.5 Architectural Choices

- Features are built from 3 files:
    - `user_reviews` -> lables + review-level features
    - `user_items` -> user activity and playtime
    - `steam_games` -> game metadata such as price and genres 

- The modeling pipeline is:
    1. Build X and y from the merged df
    2. Split into train/validation/test
    3. Implement 3 baselines
    4. Train logistic regression as the main model 


# 4. Evaluation

# 5. Related Work