## Data Preprcoessing for Rec. Sys.

The data preparation process transforms the original dataset, groups the implicit feedbacks and interactions, and creates specific datasets for training and model testing.

credit: [Marlesson](https://github.com/marlesson/recsys_autoencoders)

### Datasets

The dataset used in this project is Steam-Vide-Games obtained from https://www.kaggle.com/tamber/steam-video-games.

This dataset is a list of user behaviors, with columns:`user_id`, `game`, `type`, `hours`, `none`. The type included are 'purchase' and 'play'. The value indicates the degree to which the behavior was performed - in the case of 'purchase' the value is always 1, and in the case of 'play' the value represents the number of hours the user has played the game.

`../data/rec_data/raw/rating.csv`

| user_id  | game | type | hours | none |
| -------- | ---------------- | ---------------- | ----------- |  ------------ |
| 151603712 | "The Elder Scrolls V Skyrim" | purchase | 1.0 | 0 |
| 151603712 | "The Elder Scrolls V Skyrim" | play | 273.0 | 0 |
| 151603712 | "Fallout 4" | purchase | 1.0 | 0 |
| ... | ... | ... | ... | ... |


### Data Preparation

The data preparation process transforms the original dataset, groups the implicit feedbacks and interactions.

Datasets created:
* ../data/rec_data/articles_df.csv
* ../data/rec_data/interactions_full_df.csv


`articles_df.csv` contain the data exclusively of the items (games).

| content_id  | game | total_users | total_hours |
| -------- | ---------------- | ---------------- | ----------- | 
| 0 | 007 Legends | 1 | 1.7 |
| 1 | 0RBITALIS | 3 | 4.2 |

`interactions_full_df.csv` contain the data of interactions between user X item, amount of hours played (hours) and played (view) as implicit feedback.

| user_id | content_id | game | hours | view | 
| -------- | ---------------- | ---------------- | ----------- |  ----------- | 
| 134  | 1680 | Far Cry 3 Blood Dragon | 2.2 | 1 |
| 2219 | 1938 | Gone Home | 1.2 | 1 |
| 3315 | 3711 | Serious Sam 3 BFE | 3.7 | 1 |

### Tasks:

1. Remove infrequent users

2. Transform Data,groups the implicit feedbacks, interactions.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
base_path = '../data/rec_data/raw/'

# Contains logs of user interactions on shared articles
interactions_df = pd.read_csv(base_path+'rating.csv', index_col=None, header=None)
interactions_df.columns=['user_id', 'game', 'type', 'hours', 'none']

In [3]:
interactions_df.head()

Unnamed: 0,user_id,game,type,hours,none
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0,0
1,151603712,The Elder Scrolls V Skyrim,play,273.0,0
2,151603712,Fallout 4,purchase,1.0,0
3,151603712,Fallout 4,play,87.0,0
4,151603712,Spore,purchase,1.0,0


In [4]:
# Group interations by user_id and game
interactions_full_df = interactions_df.groupby(['user_id', 'game'])\
                                      .sum()['hours'].reset_index()
# View is 1 if interaction
interactions_full_df['view'] = 1

In [5]:
interactions_full_df.head()

Unnamed: 0,user_id,game,hours,view
0,5250,Alien Swarm,5.9,1
1,5250,Cities Skylines,145.0,1
2,5250,Counter-Strike,1.0,1
3,5250,Counter-Strike Source,1.0,1
4,5250,Day of Defeat,1.0,1


##### Filter Interactions

In [6]:
# Define the threshold for frequent users 
min_interactions  = 5
users_interactions_count_df = interactions_full_df.groupby('user_id').size()
print('# users: %d' % len(users_interactions_count_df))
users_interactions_count_df.head()

# users: 12393


user_id
5250      21
76767     36
86540     82
103360    10
144736     8
dtype: int64

In [7]:
# Get the user id for these frequent users
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= min_interactions]\
                                    .reset_index()[['user_id']]
  
print('# users with at least %d interactions: %d' % (min_interactions, len(users_with_enough_interactions_df)))  

print('# of interactions: %d' % len(interactions_df))


# users with at least 5 interactions: 3757
# of interactions: 200000


In [8]:
# Remove infrequent users  via right join
interactions_full_df = interactions_full_df.merge(users_with_enough_interactions_df, 
                                                               how = 'right',
                                                               left_on = 'user_id',
                                                               right_on = 'user_id')
print('# of interactions from users with at least %d interactions: %d' % (min_interactions, len(interactions_full_df)))



# of interactions from users with at least 5 interactions: 115139


In [9]:
# Define dummy ID
interactions_full_df['content_id'] = interactions_full_df['game'].astype('category').cat.codes
interactions_full_df['user_id']    = interactions_full_df['user_id'].astype('category').cat.codes

In [10]:
interactions_full_df.head()

Unnamed: 0,user_id,game,hours,view,content_id
0,0,Alien Swarm,5.9,1,226
1,0,Cities Skylines,145.0,1,846
2,0,Counter-Strike,1.0,1,972
3,0,Counter-Strike Source,1.0,1,978
4,0,Day of Defeat,1.0,1,1125


In [11]:
# Create a DataFrame with Content Information
articles_df = interactions_full_df.groupby(['game', 'content_id'])\
                                  .agg({'user_id': 'count', 'hours': np.sum})[['user_id','hours']]\
                                  .reset_index()\
                                  .rename(columns={'user_id': 'total_users', 'hours': 'total_hours'})

print('# of unique user/item interactions: %d' % len(interactions_full_df))

# of unique user/item interactions: 115139


In [12]:
articles_df.head()

Unnamed: 0,game,content_id,total_users,total_hours
0,007 Legends,0,1,1.7
1,0RBITALIS,1,3,4.2
2,1... 2... 3... KICK IT! (Drop That Beat Like a...,2,7,27.0
3,10 Second Ninja,3,6,11.9
4,10000000,4,1,4.6


##### Save files

1.  full interaction data
2.  items content data

In [13]:
interactions_full_df[['user_id','content_id','game','hours','view']].to_csv('data//interactions_full_df.csv', index = False)
articles_df[['content_id', 'game','total_users','total_hours']].to_csv('data//articles_df.csv', index = False)