# Abstract

In this post, we present a prototype recommendation system based on *Deep Neural Networks for YouTube Recommendations* [@Covington2016DeepNN] and *Explore, Exploit, Explain: Personalizing Explainable Recommendations with Bandits* [@McInerney2018ExploreEE]. We present the prototype in the context of a company that operates as a short-video content platform. We define the strategy behind the prototype by presenting the business objectives. We follow this by walking through the architecture of the prototype and how it fulfills the objectives of the company. We end by illustrating potential next steps for our company. The code for this prototype can be found at the [github repo.](https://github.com/thebayesianbandit/thebayesianbandit.github.io/blob/main/posts/doom_alg/doom_alg.ipynb)

# Introduction

**InstaTok** is a new social media app that specializes in short-video content. The platform operates as a two-sided marketplace. There are user accounts and creator accounts. Anyone can sign-up for a user account for free. To join as a creator, creators must pay a small subscription fee plus a small commission, both billed monthly. 

Users interact with content via a mobile application. Like other short-video platforms, users are presented a **for-you-page** (FYP) with a single video playing. Users can choose to watch the video as many times as they like or move onto the next video at any time by swipping up on the app. 

Creators are incentivized to be on the platform by monetizing their videos through advertisements. InstaTok has partnerships with various brands to connect them with creators to advertise their products and services. Creators are then compensated directly by these companies based on their contractual agreements. 

Since InstaTok is a two-sided platform, we need to optimize experiences for both users and creators. Users need good videos to stay engaged and creators need users to monetize their videos. In order to balance these needs, we can define sound strategy on how we can define the problem and how it can be solved with it. 

**Note**: While this would technically be a three-sided marketplace since brands also would need to benefit from the platform, we are simplifying this problem to a two-sided one for this post.

## Company Strategy

As mentioned previously, in order for InstaTok to succeed, we need to develop a product that is appealing to both users and creators. In order to create a sound product that answers to both user and creator needs, we need to develop a robust strategy that will guide our approach to product creation. To do this, we need to answer four questions. 

1. Where do we compete?
2. What unique value do we bring to the market?
3. What resources and capabilities can we utilize to deliver unique value?
4. How do we sustain our ability to provide unique value?

We have established that we are competing in the mobile application market, specifically within the short-video platform arena. We hope to appeal to all ages, but are aiming to capture the college-young professional market. Furthermore, we are looking for those who are looking to separate the world of social media and creative short-video content. 

We need to provide unique value to both sides of our market. For our users, we are providing a platform where all content is created by official creators. There are no random company ads appearing in the FYP nor social connection content. We focus the user experience on entertainment content. On top of that, we provide personalization to our users by algorithmically matching them with content they would enjoy.  

For our creators, we provide unique value through monetization and verified creator status. Creators compete solely with other creators for views, not with general social media users. Additionally, other platforms collect the ad revenue generated by creator content. On our platform, we provide a way for creators to directly collect ad revenue from brands.

Our key resource and capabilities include our proprietary algorithms, our brand partnerships, and deep technology expertise. These are also key in how we sustain our unique value. By continuing to iterate our algorithms, grow our partnerships, and deepen our technology expertise, we can sustain a unique advantage in this marketplace. 

## From Strategy to Product

We've defined our company strategy, and now we must use it to create a viable product. Our minimum viable product (MVP) should be a mobile application with a user-friendly interface. This interface should provide short-video content to our users one video at a time. Users advance to the next video by swiping up on the screen. These functions should be designed in a way that is intuitive for our target demographic. Additionally, the algorithms powering the service need to be personalized to maximize each user's specific utility.

In this post, we will focus our attention on the personalization algorithms powering our platform. As mentioned, our personalization algorithm is one of our key resources/capabilities in our company strategy. We need it to be incredibly valuable in the eyes of our users and creators. To achieve this, we utilize data to drive our decision making. @tbl-prod-1 shows an example of this kind of data.

| Region | Interest |
|---|---|
| west | tech |
| mid | animals |
| east | food |

: User Interests by Region {#tbl-prod-1}

According to our market research, users generally have **two categories** of videos that entice them to keep watching. Furthermore, we learned that users are open to exploring new categories of videos based on their geographic location. For example, those in the midwest appear to be open to exploring videos featuring animals. These insights should drive product development for our personalization algorithm. 

# Personalization Algorithm

## System Design

While previous sections have focused on the business aspects of our product, this section describes the technical details needed to design and implement a personalization system. Our problem can be boiled down to this one question: **how can we optimize our content offerings for any given user such that a creator maximizes their revenue?** To do this, we begin with simple microeconomic theory. We believe that any given user is a rational agent seeking to maximize their utility. We further believe that users have downloaded InstaTok to maiximize their entertainment utility. Therefore, if we provide them with videos that help maximize this utility, they will continue to use the platform (i.e. maximize their watch time per session). 

To answer our question on how we can maximize entertainment utility for a given user, we propose a system in @fig-sys-1 that illustrates how we can capture user data to provide meaningful personalization.

![Personalization System Design](doom_diag.jpg){#fig-sys-1}

@fig-sys-1 begins with a user opening our app and interacting with a shown video. The interaction data is sent to our data warehouse where it is stored with historical interaction data. Data is then sent from the warehouse to three functions. These functions help define the current state of our user. The first function retrieves current session data pertinent to our given user (e.g. demographic data, location data, etc.).

The second function is our base recommendation system. The system parses our large video catalog and generates a candidate list of videos that best match user preferences. This list is further refined by adjusting the scores from the base recommendation with additional business logic. We will discuss the base recommendation algorithm in detail later on. 

The third function is our interest prediction layer. To better inform our decision on which video to show the user, we use past interaction data to predict user interest categories. Note: In a true production environment, this would be our approach. However, for the purposes of this post, we randomly generate these interests for each user.

These functions provide output that defines the current state of the user. This state is fed into our reinforcement learning agent (RL agent) which then chooses which action to take (i.e. which video to show the user next). The selection is sent back to the user and the process repeats itself until the user closes the app. We will provide more detail on the RL agent later on. 

## Recommendation System Architecture

Our recommendation system is a simple two-phase architecture. The first phase is the initial candidate list generation. To do this, we first assign scores to the interaction data. @tbl-rec-1 shows our assignment scores.

| Interaction | Score |
|---|---|
| watched | 1 |
| skipped | -2 |
| liked | 2 |
| other | -0.5 |

: Interaction Scores {#tbl-rec-1}

These scores are aggregated to create a item-user matrix, as depicted in @eq-rec-1. Note: For faster computation, we transform this into a sparse matrix using `scipy`.

$$
\begin{pmatrix}
2 & -.5 & 1 \\
1 & 1 & 1 \\
2 & -2 & 1
\end{pmatrix}
$${#eq-rec-1}

@eq-rec-1 is a $n$ by $m$ matrix where each row $i$ is an item and each column $j$ is a user. The goal of setting the matrix like this is to identify relationships between items. We want to identify which items are most similar to each other based on similar scores from users. To calculate this, we utilize *cosine similarity* [@Salton1975Vector] as shown in @eq-rec-2.

$$
\text{similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|}
$${#eq-rec-2}

@eq-rec-2 allows us to measure the cosine of the angle between two item vectors. A smaller angle means higher similarity. Once these are all calculated, the results are stored in a $n$ by $n$ matrix, where each row and column corresponds to a specific item (i.e. an item-item matrix). We can then perform lookups on this matrix to recommend items that are most similar to those a user previously interacted positively with. Once we have $r$ number of recommendations for a user, these items with their respective scores are sent to our weighted-sum function for weighting based on important business objectives. This process of using item-item similarity and weighting according to business objectives is known as *collaborative filtering* [@Sarwar2001].

$$
\text{Final Score} = w_{1}\text{Similarity Score} + w_{2}\text{Num Views} + w_{3}\text{Is Interest}
$${#eq-rec-3}

The final score of a video is weighted by three key attributes: the similarity score (produced from @eq-rec-1), the number of views a video has (the popularity of it), and a boolean variable indicating whether the video aligns with the user's interests. 

## RL Agent Architecture

Our recommendation system currently outputs the top $r$ recommendations for a user based on their past interaction history. This is a great feature, but it does not capture the entire story of a user. Each time a user opens our app, they are in a different state of being. They could be bored, anxious, excited, or experiencing any other emotion. To better understand our users and present them content that aligns with their current state, we utilize reinforcement learning (RL) to model these states and their subsequent feedback. RL is the industry standard for capturing sequential decision making. 

From a high-level, we refer back to @fig-sys-1. The RL agent takes in current session information (number of videos watched so far, number of skips, last video category show, etc), the candidate list of videos with their final scores, and current "predicted" user interests. The RL agent then chooses an action (i.e. which video to show), sends that video to the user, and records the interaction. As the RL agent continues to do this, it learns which states matches certain videos. The more data it has around the interactions between users and videos given current states, the more optimized its **policy** will be at choosing the correct next best video.

To model this system, we use a *Deep Q Network (DQN)* [@Mnih2015HumanlevelCT]. The DQN utilizes neural networks to approximate the *Q Function* [@Watkins1989Learning], as shown in @eq-rl-1.

$$Q^*(s, a) = \mathbb{E}_{s' \sim P(\cdot|s,a)} \left[ r + \gamma \max_{a'} Q^*(s', a') \right]$${#eq-rl-1}

@eq-rl-1 is the optimal Q-value function for a Q-learning algorithm. The equation basically says that the optimal Q-value for taking action $a$ in state $s$ is equal the expectation of the sum of the immediate reward of taking action $a$ in state $s$ and the optimal Q-value for the next state $s$ over all possible next actions $a$. Simply put, this function estimates the expected future cumulative reward of taking action $a$ in state $s$. 

In our DQN, we estimate these Q-values using neural networks. The neural network takes the current state $s$ as input, passes it through the hidden layers, and produces $r$ number of Q-values. The DQN then chooses a video to show the user. To do this, we use an *epsilon-greedy* [@Sutton1998Reinforcement] approach to balance exploration with exploitation. The video is then shown to the user, the user interacts with the video, the interaction is recorded, and the process begins again. 

Our DQN learns is via **experience replay**. The RL agent learns via mini-batches of experiences that are randomly sampled from a "memory storage". This breaks correlation and improves training stability. The training is performed like many other DL algorithms where we attempt to minimize a loss function. In our prototype, we use **mean squared error**. The loss is calculated as the difference between our target Q-value and the predicted Q-value, as shown in @eq-rl-2.

$$\min_{\theta} \mathcal{L}(\theta) = \min_{\theta} \mathbb{E} \left[ \left( Y - Q(s, a; \theta) \right)^2 \right]$${#eq-rl-2}


Since we don't directly observe target Q-values in the real world, we estimate them via a **target network** in our RL agent. Our predicted Q-values are derived from an entirely different network known as the **online network**. 

**All specific technical details of the DQN, state and action space, etc. can be found at the github link in the abstract.** 

# Conclusion

In this post, we presented InstaTok, a short-video platform aimed at revolutionizing the short-video entertainment space by providing ad-free experiences to users via a new incentive and verification structure for creators. We walked through the core company strategy and product development on how we can create a prodcut to fulfill the needs of both users and creators. We presented a high-level overview of a user's journey on the app, as well as deeper technical details of how we can algorithmically match users to videos they would enjoy, thereby maximizing their watch time per session. In a real-life scenario where we'd be launching this system, we'd want to implement experiments to properly measure changes in user watch time between different algorithms (e.g. testing a different base recommendation system vs current one to observe its impact on user watch time).

Overall, we hope this post demonstrated the power of recommendation systems and the importance of each layer of the system. Additionally, we hope that readers gained an appreciation for how sound business strategy can guide product development. With both good business strategy and deep technical expertise, one can build an app like TikTok following the principles outlined in this post. 

**For code of this prototype, see [github repo.](https://github.com/thebayesianbandit/thebayesianbandit.github.io/blob/main/posts/doom_alg/doom_alg.ipynb)**

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random as rd
from datetime import datetime as dt
from datetime import timedelta as tdel

from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
#Define video class
class Video:
    def __init__(self, vid_id, category, duration, creator_id, num_views):
        self.vid_id = vid_id
        self.category = category
        self.duration = duration 
        self.creator_id = creator_id
        self.num_views = num_views

In [3]:
#Define user class
class User:
    def __init__(self, user_id, age, gender, loc, interests):
        self.user_id = user_id
        self.age = age
        self.gender = gender
        self.loc = loc
        self.interests = interests

In [4]:
#Generate videos
vids_dict = {}
categories = ['comedy', 'educational', 'news', 'food', 'politics', 'animals', 'tech', 'fashion']
num_vids = 3000
num_creators = 20
mean_views = 50
sd_views = 10

for i in range(num_vids):
    vid_id = i
    vid_cat = np.random.choice(categories)
    vid_dur = 6
    vid_creator_id = np.random.choice(num_creators)
    num_views = round(np.random.normal(mean_views, sd_views))
    
    vids_dict[i] = Video(vid_id, vid_cat, vid_dur, vid_creator_id, num_views)

In [5]:
#Generate users
users_dict = {}
loc = ['west', 'mid', 'east']
num_users = 200

for i in range(num_users):
    user_id = i
    user_age = np.random.randint(18, 30)
    user_gender = np.random.choice(['m', 'f'])
    user_loc = np.random.choice(loc)
    user_int = np.random.choice(categories, 2)
    
    users_dict[i] = User(user_id, user_age, user_gender, user_loc, user_int)

In [6]:
#Generate interactions
actions = ['watched', 'skipped', 'liked']
int_dict = {
    "vid_id": [],
    "user_id": [],
    "session_id": [],
    "action": [],
    "timestamp": []
}
num_int_mu = 30
num_int_sd = 10

for i in users_dict.keys():
    current_user = users_dict[i]
    current_sess = 0
    current_timestamp = 0
    num_int = round(max(5, np.random.normal(num_int_mu, num_int_sd)))
    
    for j in range(num_int):
        vid_key = np.random.choice(list(vids_dict.keys()))
        current_vid = vids_dict[vid_key]
        prob_cont = np.random.uniform()
        
        if current_vid.category in current_user.interests:
            prob_cont = min(1, prob_cont + 0.3)
        
        if current_vid.category == "tech" and current_user.loc == "west":
            prob_cont = min(1, prob_cont + 0.1)
        elif current_vid.category == "animals" and current_user.loc == "mid":
            prob_cont = min(1, prob_cont + 0.1)
        elif current_vid.category == "food" and current_user.loc == "east":
            prob_cont = min(1, prob_cont + 0.1)
            
        if current_vid.category == "educational" and current_user.loc == "west":
            prob_cont = min(1, prob_cont - 0.2)
        elif current_vid.category == "news" and current_user.loc == "mid":
            prob_cont = min(1, prob_cont - 0.2)
        elif current_vid.category == "tech" and current_user.loc == "east":
            prob_cont = min(1, prob_cont - 0.2)
        
        comp_prob_cont = 1- prob_cont
        action = rd.choices(actions, weights=[prob_cont, comp_prob_cont / 2, comp_prob_cont / 2])[0]
        current_vid.num_views += 1
        
        if action == "liked":
            prob_cont = min(1, prob_cont + 0.05)
            current_timestamp += 6
        elif action == "skipped":
            prob_cont = max(0, prob_cont - 0.3)
            current_vid.num_views -= 1
            dur = np.random.uniform(1, 5)
            current_timestamp += dur
        elif action == "watched":
            current_timestamp += 6
        else:
            dur = np.random.uniform(1, 5)
            current_timestamp += dur
        
        int_dict["vid_id"].append(current_vid.vid_id)
        int_dict["user_id"].append(current_user.user_id)
        int_dict["session_id"].append(current_sess)
        int_dict["timestamp"].append(current_timestamp)
        
        if prob_cont < np.random.uniform():
            current_sess += 1
            current_timestamp += np.random.normal(3600, 400)
            action = "done"
            int_dict["action"].append(action)
        else:
            int_dict["action"].append(action)
            

In [7]:
#Transform dict to dataframe
df = pd.DataFrame(int_dict)
df['vid_category'] = df['vid_id'].map(lambda x: vids_dict.get(x).category)
df['vid_num_views'] = df['vid_id'].map(lambda x: vids_dict.get(x).num_views)
df['user_age'] = df['user_id'].map(lambda x: users_dict.get(x).age)
df['user_gender'] = df['user_id'].map(lambda x: users_dict.get(x).gender)
df['user_loc'] = df['user_id'].map(lambda x: users_dict.get(x).loc)
df['user_int_1'] = df['user_id'].map(lambda x: users_dict.get(x).interests[0])
df['user_int_2'] = df['user_id'].map(lambda x: users_dict.get(x).interests[1])

In [8]:
#| output: false

#View head of df
df.head()

Unnamed: 0,vid_id,user_id,session_id,action,timestamp,vid_category,vid_num_views,user_age,user_gender,user_loc,user_int_1,user_int_2
0,2706,0,0,done,6.0,comedy,59,27,m,mid,politics,food
1,2837,0,1,watched,3445.988341,tech,51,27,m,mid,politics,food
2,179,0,1,done,3447.399895,tech,41,27,m,mid,politics,food
3,1845,0,2,watched,6372.084546,educational,47,27,m,mid,politics,food
4,1307,0,2,done,6373.976712,tech,59,27,m,mid,politics,food


In [9]:
#| output: false

#View shape of df
df.shape

(6134, 12)

In [10]:
#Generate score for each interatcion
df = df.assign(interaction_score = lambda x: np.where(x.action == "liked", 2, 
                                    np.where(x.action == "watched", 1, 
                                            np.where(x.action == "skipped", -1, -.5))))

In [11]:
#Aggregate scores and create item-user matrix for cosine similarity
sparse_matrix = (df.groupby(['user_id', 'vid_id'])['interaction_score']
 .sum()
 .reset_index()
 .pivot_table(index='vid_id', columns='user_id', values='interaction_score', fill_value=0)
 .pipe(lambda x: csr_matrix(x))
)

In [12]:
#Get cosine similarity from sparse matrix
cos_sim_mat = cosine_similarity(sparse_matrix)

In [13]:
#Define recommendation function
def get_recom(u_id, main_df, sim_mat, num_rec=10):
    watched_vids = main_df.query("user_id == @u_id")['vid_id'].tolist()
    watched_scores = main_df.query("user_id == @u_id")['interaction_score'].tolist()
    look_lst = main_df['vid_id'].sort_values().unique().tolist()
    rec_vids = {}
    
    for i, vid in enumerate(watched_vids):
        current_idx = look_lst.index(vid)
        for j, item in enumerate(sim_mat[current_idx,:]):
            if look_lst[j] not in watched_vids:
                rec_vids[look_lst[j]] = rec_vids.get(look_lst[j], 0) + item * watched_scores[i]
    
    rec_vids = sorted(rec_vids.items(), key=lambda x: x[1], reverse=True)
    
    return dict(rec_vids[:num_rec])

In [14]:
temp_rec = get_recom(1, df, cos_sim_mat, 20)

In [15]:
#Define additional scoring function
def gen_scores(recs_dict, u_id, w_1, w_2, w_3):
    new_dict = {}
    current_user = users_dict[u_id]
    
    for v_id, val in recs_dict.items():
        score = 0
        current_vid = vids_dict[v_id]
        score += (val * w_1) + ((current_vid.num_views * .1) * w_2) + (np.where(current_vid.category in current_user.interests, 1, 0) * w_3)
        new_dict[v_id] = score
        
    new_dict = sorted(new_dict.items(), key=lambda x: x[1], reverse=True)
    
    return dict(new_dict)

In [16]:
#Create state features
df['session_length'] = df['timestamp'] - df.groupby(['user_id', 'session_id'])['timestamp'].transform('min')
df['num_vids_session'] = df.groupby(['user_id', 'session_id']).cumcount()
df['last_vid_category'] = df.groupby(['user_id', 'session_id'])['vid_category'].shift(1).fillna("animals")
df['last_action'] = df.groupby(['user_id', 'session_id'])['action'].shift(1).fillna("session_end")

In [17]:
# #Create state df
# state_df = (df.drop(['vid_id', 'interaction_score'], axis=1)
#  .pipe(lambda x: pd.get_dummies(x, dtype='float'))
# )

In [18]:
# state_df.head()

In [19]:
#Define embedding mappings 
VID_CATEGORY_MAP = {'comedy': 0, 'educational': 1, 'news': 2, 'food': 3,
                    'politics': 4, 'animals': 5, 'tech': 6, 'fashion': 7}
USER_GENDER_MAP = {'m': 0, 'f': 1}
USER_LOC_MAP = {'west': 0, 'mid': 1, 'east': 2}
LAST_ACTION_MAP = {'watched': 0, 'skipped': 1, 'liked': 2, 'session_end': 3}

# Define cardinalities
NUM_VID_CATEGORIES = len(VID_CATEGORY_MAP) 
NUM_GENDERS = len(USER_GENDER_MAP)         
NUM_LOCS = len(USER_LOC_MAP)               
NUM_LAST_ACTIONS = len(LAST_ACTION_MAP)    

# Define embedding dims
EMBEDDING_DIM_GENDER = 1       
EMBEDDING_DIM_LOC = 2          
EMBEDDING_DIM_USER_INT = 4     
EMBEDDING_DIM_LAST_ACTION = 2
EMBEDDING_DIM_VID_CATEGORY = 4 

#Define numerical feature dims
NUM_NUMERICAL_FEATURES = 13

In [20]:
#Define state df
state_df = df[['user_id', 'user_age', 'num_vids_session', 'session_length', 'user_gender', 
               'user_loc', 'user_int_1', 'user_int_2', 'last_vid_category', 
               'last_action']]


In [22]:
#Define Deep Q Network
class DQN(nn.Module):
    def __init__(self, hidden_dim, action_size):
        super(DQN, self).__init__()
        
        self.user_gender_embedding = nn.Embedding(NUM_GENDERS, EMBEDDING_DIM_GENDER)
        self.user_loc_embedding = nn.Embedding(NUM_LOCS, EMBEDDING_DIM_LOC)
        self.user_int_1_embedding = nn.Embedding(NUM_VID_CATEGORIES, EMBEDDING_DIM_USER_INT)
        self.user_int_2_embedding = nn.Embedding(NUM_VID_CATEGORIES, EMBEDDING_DIM_USER_INT)
        self.last_vid_category_embedding = nn.Embedding(NUM_VID_CATEGORIES, EMBEDDING_DIM_USER_INT)
        self.last_action_embedding = nn.Embedding(NUM_LAST_ACTIONS, EMBEDDING_DIM_LAST_ACTION)

        total_embedding_dim = (
            EMBEDDING_DIM_GENDER +
            EMBEDDING_DIM_LOC +
            EMBEDDING_DIM_USER_INT * 2 + 
            EMBEDDING_DIM_LAST_ACTION + 
            EMBEDDING_DIM_VID_CATEGORY
        )
        
        state_dim = NUM_NUMERICAL_FEATURES + total_embedding_dim
        
        self.state_dim = state_dim
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden_dim, action_size)
    
    def forward(self, state_batch):
        numerical_features = state_batch[:, :NUM_NUMERICAL_FEATURES]

        gender_ids = state_batch[:, NUM_NUMERICAL_FEATURES + 0].long()
        loc_ids = state_batch[:, NUM_NUMERICAL_FEATURES + 1].long()
        user_int_1_ids = state_batch[:, NUM_NUMERICAL_FEATURES + 2].long()
        user_int_2_ids = state_batch[:, NUM_NUMERICAL_FEATURES + 3].long()
        last_vid_category_ids = state_batch[:, NUM_NUMERICAL_FEATURES + 4].long()
        last_action_ids = state_batch[:, NUM_NUMERICAL_FEATURES + 5].long()

        gender_embedded = self.user_gender_embedding(gender_ids)
        loc_embedded = self.user_loc_embedding(loc_ids)
        user_int_1_embedded = self.user_int_1_embedding(user_int_1_ids)
        user_int_2_embedded = self.user_int_2_embedding(user_int_2_ids)
        last_vid_category_embedded = self.last_vid_category_embedding(last_vid_category_ids)
        last_action_embedded = self.last_action_embedding(last_action_ids)

        x = torch.cat((
            numerical_features,
            gender_embedded,
            loc_embedded,
            user_int_1_embedded,
            user_int_2_embedded,
            last_vid_category_embedded,
            last_action_embedded
        ), dim=1)

        # Pass through the dense layers
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

In [23]:
#Define Agent
class Agent:
    def __init__(self, state_size, action_size, model, optimizer, criterion):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = []
        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if rd.random() <= self.epsilon:
            return rd.randrange(self.action_size)
        with torch.no_grad():
            act_values = self.model(state.unsqueeze(0))
        return torch.argmax(act_values[0]).item()

    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return

        minibatch = rd.sample(self.memory, batch_size)
        states, actions, rewards, next_states, dones = zip(*minibatch)

        states = torch.stack(states)
        actions = torch.tensor(actions, dtype=torch.long)
        rewards = torch.tensor(rewards, dtype=torch.float32)
        next_states = torch.stack(next_states)
        dones = torch.tensor(dones, dtype=torch.float32)

        q_values = self.model(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.model(next_states).max(1)[0]
        targets = rewards + (1 - dones) * self.gamma * next_q_values

        loss = self.criterion(q_values, targets)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
        
        return loss.item()

In [24]:
#Define get state function
def get_state(current_s, v_scores):
    state_vec = torch.zeros(9+len(v_scores))
    
    state_vec[0] = current_s['user_age']
    state_vec[1] = current_s['num_vids_session']
    state_vec[2] = current_s['session_length']
    
    for i in range(len(v_scores)):
        state_vec[3+i]
    
    state_vec[3+len(v_scores)] = USER_GENDER_MAP.get(current_s['user_gender'])
    state_vec[4+len(v_scores)] = USER_LOC_MAP.get(current_s['user_loc'])
    state_vec[5+len(v_scores)] = VID_CATEGORY_MAP.get(current_s['user_int_1'])
    state_vec[6+len(v_scores)] = VID_CATEGORY_MAP.get(current_s['user_int_2'])
    state_vec[7+len(v_scores)] = VID_CATEGORY_MAP.get(current_s['last_vid_category'])
    state_vec[8+len(v_scores)] = LAST_ACTION_MAP.get(current_s['last_action'])
    
    return state_vec

In [25]:
#Define reward function
def get_reward(current_action):
    reward = 0
    done_bool = False
    
    if current_action == "watched":
        reward += 1
    elif current_action == "liked":
        reward += 2
    elif current_action == "skipped":
        reward -= 2
    else:
        reward -= .5
        done_bool = True
    
    return reward, done_bool

In [26]:
#Define update function for similarity matrix
def update_sim_mat(temp_df):
    temp_sparse_matrix = (temp_df.groupby(['user_id', 'vid_id'])['interaction_score']
                     .sum()
                     .reset_index()
                     .pivot_table(index='vid_id', columns='user_id', values='interaction_score', fill_value=0)
                     .pipe(lambda x: csr_matrix(x))
                    )
    
    temp_cos_sim_mat = cosine_similarity(temp_sparse_matrix)
    
    return temp_cos_sim_mat

In [27]:
#Define parameters for DQN
hidden_lay_size = 64
action_size = 10
mod = DQN(hidden_lay_size, action_size)
optimizer = optim.Adam(mod.parameters(), lr=0.001)
criterion = nn.MSELoss()

In [28]:
#Define parameters for Agent
agent = Agent(mod.state_dim, action_size, mod, optimizer, criterion)
batch_size = 50

In [29]:
#Define RL actions
actions_rl = ['watched', 'skipped', 'liked', 'session_end']

In [30]:
#Define simulation function
def perf_sim(n_eps, n_steps, sim_df):
    rewards_per_episode = []
    losses_per_episode = []
    
    sim_dict = {
        'user_id': [],
        'user_age': [],
        'time_step': [],
        'num_vids_session': [],
        'session_length': [],
        'user_gender': [],
        'user_loc': [],
        'user_int_1': [],
        'user_int_2': [],
        'last_vid_category': [],
        'last_action': [],
        'current_vid_rec': [],
        'action': [],
        'interaction_score': []
    }
    
    temp_sim_df = sim_df.copy()
    
    for e in range(n_eps):
        total_reward = 0
        
        next_state_dict = {}
        
        current_user_id = np.random.choice(list(users_dict.keys()))
        current_user = users_dict[current_user_id]
        
        current_state = state_df.query("user_id == @current_user_id").drop(['user_id'], axis=1).iloc[-1]
        current_state['num_vids_session'] = 0
        current_state['session_length'] = 0
        
        last_vid_cat = current_state['last_vid_category']
        last_action = current_state['last_action']
        
        cos_sim_mat = update_sim_mat(temp_sim_df)
        current_vid_cand = get_recom(current_user_id, temp_sim_df, cos_sim_mat, action_size)
        current_vid_cand = gen_scores(current_vid_cand, current_user_id, .5, .1, .3)
        vid_scores = [score for score in current_vid_cand.values()]
        
        current_state = current_state.to_dict()
        current_state = get_state(current_state, vid_scores)
        
        session_len = 0
        num_vids_sess = 0
        
        for t in range(n_steps):
            chosen_vid_idx = agent.act(current_state)
            chosen_vid_key = list(current_vid_cand.keys())[chosen_vid_idx]
            chosen_vid = vids_dict[chosen_vid_key]
            
            prob_cont = np.random.uniform()

            if chosen_vid.category in current_user.interests:
                prob_cont = min(1, prob_cont + 0.3)

            if chosen_vid.category == "tech" and current_user.loc == "west":
                prob_cont = min(1, prob_cont + 0.1)
            elif chosen_vid.category == "animals" and current_user.loc == "mid":
                prob_cont = min(1, prob_cont + 0.1)
            elif chosen_vid.category == "food" and current_user.loc == "east":
                prob_cont = min(1, prob_cont + 0.1)

            if chosen_vid.category == "educational" and current_user.loc == "west":
                prob_cont = min(1, prob_cont - 0.2)
            elif chosen_vid.category == "news" and current_user.loc == "mid":
                prob_cont = min(1, prob_cont - 0.2)
            elif chosen_vid.category == "tech" and current_user.loc == "east":
                prob_cont = min(1, prob_cont - 0.2)

            comp_prob_cont = 1- prob_cont
            action = rd.choices(actions_rl, weights=[prob_cont, comp_prob_cont / 3, comp_prob_cont / 3, comp_prob_cont / 3])[0]
            chosen_vid.num_views += 1

            current_reward, current_done = get_reward(action)
            
            sim_dict['user_id'].append(current_user_id)
            sim_dict['user_age'].append(current_user.age)
            sim_dict['time_step'].append(t)
            sim_dict['user_gender'].append(current_user.gender)
            sim_dict['user_loc'].append(current_user.loc)
            sim_dict['user_int_1'].append(current_user.interests[0])
            sim_dict['user_int_2'].append(current_user.interests[1])
            sim_dict['last_vid_category'].append(last_vid_cat)
            sim_dict['last_action'].append(last_action)
            sim_dict['current_vid_rec'].append(chosen_vid_key)
            sim_dict['action'].append(action)
            sim_dict['interaction_score'].append(current_reward)
            
            if action == 'session_end' or action == 'skipped':
                session_len += np.random.uniform(1, 5)
            else:
                session_len += 6
                num_vids_sess += 1
            
            sim_dict['session_length'].append(session_len)
            sim_dict['num_vids_session'].append(num_vids_sess)
            
            next_state_dict['user_id'] = current_user_id
            next_state_dict['user_age'] = current_user.age
            next_state_dict['time_step'] = t + 1
            next_state_dict['user_gender'] = current_user.gender
            next_state_dict['user_loc'] = current_user.loc
            next_state_dict['user_int_1'] = current_user.interests[0]
            next_state_dict['user_int_2'] = current_user.interests[1]
            next_state_dict['last_vid_category'] = chosen_vid.category
            next_state_dict['last_action'] = action
            next_state_dict['session_length'] = session_len
            next_state_dict['num_vids_session'] = num_vids_sess
            
            cos_sim_mat = update_sim_mat(temp_sim_df)
            current_vid_cand = get_recom(current_user_id, temp_sim_df, cos_sim_mat, action_size)
            current_vid_cand = gen_scores(current_vid_cand, current_user_id, .5, .1, .3)
            vid_scores = [score for score in current_vid_cand.values()]
            
            next_state = get_state(next_state_dict, vid_scores)
            
            agent.remember(current_state, chosen_vid_idx, current_reward, next_state, current_done)
            loss_item = agent.replay(batch_size)
            
            if loss_item is not None:
                losses_per_episode.append(loss_item)
            
            total_reward += current_reward
            
            current_state = next_state
            last_vid_cat = chosen_vid.category
            last_action = action
            
            new_data = {'vid_id': [chosen_vid_key], 'user_id': [current_user_id], 'interaction_score': [current_reward]}
            temp_sim_df = pd.concat([temp_sim_df, pd.DataFrame(new_data)], ignore_index=True)
            
            if current_done:
                break
                
        rewards_per_episode.append(total_reward)
        
        if (e + 1) % 10 == 0:
            avg_reward = np.mean(rewards_per_episode)
            avg_loss = np.mean(losses_per_episode)
            
            print(f"Avg Reward: {avg_reward:.2f}")
            print(f"Avg Loss: {avg_loss:.2f}")
    
    return rewards_per_episode, losses_per_episode, pd.DataFrame(sim_dict)

In [31]:
# #Perform simulation
# total_rew, total_loss, rl_df = perf_sim(500, 50, df[['vid_id', 'user_id', 'interaction_score']])

In [32]:
# #Save rl_df
# rl_df.to_csv("rl_sim_df.csv", index=False)

In [33]:
#Load rl_df
rl_df = pd.read_csv("rl_sim_df.csv")

In [34]:
#| output: false

#Show rl_df
rl_df.head(10)

Unnamed: 0,user_id,user_age,time_step,num_vids_session,session_length,user_gender,user_loc,user_int_1,user_int_2,last_vid_category,last_action,current_vid_rec,action,interaction_score
0,98,25,0,1,6.0,f,mid,tech,animals,comedy,watched,1336,watched,1.0
1,98,25,1,2,12.0,f,mid,tech,animals,fashion,watched,1826,watched,1.0
2,98,25,2,3,18.0,f,mid,tech,animals,tech,watched,452,watched,1.0
3,98,25,3,4,24.0,f,mid,tech,animals,tech,watched,1683,watched,1.0
4,98,25,4,5,30.0,f,mid,tech,animals,animals,watched,484,watched,1.0
5,98,25,5,6,36.0,f,mid,tech,animals,food,watched,204,watched,1.0
6,98,25,6,7,42.0,f,mid,tech,animals,news,watched,2960,watched,1.0
7,98,25,7,7,43.242581,f,mid,tech,animals,tech,watched,445,session_end,-0.5
8,30,26,0,1,6.0,m,west,food,tech,fashion,skipped,728,watched,1.0
9,30,26,1,2,12.0,m,west,food,tech,fashion,watched,752,watched,1.0


In [35]:
#| output: false

#Show rl_df tail
rl_df.tail(10)

Unnamed: 0,user_id,user_age,time_step,num_vids_session,session_length,user_gender,user_loc,user_int_1,user_int_2,last_vid_category,last_action,current_vid_rec,action,interaction_score
3319,55,28,0,1,6.0,f,west,fashion,food,tech,liked,1,watched,1.0
3320,55,28,1,2,12.0,f,west,fashion,food,educational,watched,2271,watched,1.0
3321,55,28,2,3,18.0,f,west,fashion,food,tech,watched,1115,watched,1.0
3322,55,28,3,4,24.0,f,west,fashion,food,fashion,watched,2495,watched,1.0
3323,55,28,4,5,30.0,f,west,fashion,food,fashion,watched,618,watched,1.0
3324,55,28,5,6,36.0,f,west,fashion,food,news,watched,397,watched,1.0
3325,55,28,6,6,40.464115,f,west,fashion,food,food,watched,1547,skipped,-2.0
3326,55,28,7,7,46.464115,f,west,fashion,food,educational,skipped,1242,liked,2.0
3327,55,28,8,8,52.464115,f,west,fashion,food,animals,liked,2546,watched,1.0
3328,55,28,9,8,57.077824,f,west,fashion,food,comedy,watched,387,session_end,-0.5


In [36]:
# #Show rewards and losses
# fig, ax = plt.subplots(1, 2, figsize=(15,5))

# ax[0].plot(total_rew)
# ax[0].set_title("Total Rewards per Episode")
# ax[1].plot(total_loss)
# ax[1].set_title("Total Loss per Episode");