In [17]:
import data_gen

In [22]:
import importlib
importlib.reload(data_gen)

<module 'data_gen' from '/Users/robinburke/Library/CloudStorage/OneDrive-UCB-O365/Documents/repos/lafs/data_gen.py'>

# Data Generation for SCRUF Experiments through LAFS

This code generates simulated recommender system output through a process of LAtent Factor Simulation (LAFS). 

For each user, there is a list of items and an associated score. Users can be produced with different propensities towards the features of items, which may be sensitive or not. 
User propensities can be segmented temporally into multiple regimes: such that users with certain characteristics occur first and a set of users with different propensities show up next.

## Input - DataGenParameters

Encapsulates the parameters needed to do the generation. Can be loaded from TOML

* `num_items`: number of items (int)
* `num_factors`: number of factors (int)
* `item_feature_propensities`: the distributions used to generate item models ([int x num_factors])
* `std_dev_factors`: standard deviation for the factor generation (float <0.0,1.0>)
* `num_agents`: number of agents/protected factors (int)
* `agent_discount`: subtraction for agents associated items ([(mean,variance) x num_agents])
* `items_dependency`: an indication whether the first two item protected factors are co-dependent (boolean)
* `num_users_per_propensity`: number of users per user propensity [int x number of user propensity groups]
* `user_feature_propensities`: the distributions used to generate user models ( [(propensity) x number of factors] x number of user propensity groups ) 
* `initial_list_size`: the size of the list generated for each user (int)
* `recommendation_size`: the size of the recommendation list delivered as output (int)

# Example

## Setup of the Parameters

In [33]:
CONFIG_STRING = '''
num_items = 1000
initial_list_size = 200
recommendation_size = 50
num_users_per_propensity= [100,100]

# Matrix info
num_factors = 10
std_dev_factors = 1.0

# User and item generation info
user_feature_propensities = [[[0.9, 0.1],[0.1, 0.1],[0.1, 0.1], [0.3, 1.0],[0.6, 1.0],[0.1, 0.6], [0.4, 1.0],[0.9, 1.0],[0.1, 0.6], [0.0, 1.0]],
                    [[0.5, 0.5],[0.5, 0.5],[0.5, 0.5], [0.3, 1.0],[0.6, 1.0],[0.1, 0.6], [0.4, 1.0],[0.9, 1.0],[0.1, 0.6], [0.0, 1.0]]]
item_feature_propensities = [0.1, 0.3, 0.9, 0.5, 0.6, 0.2, 0.5, 0.7, 0.6, 0.1]

# Fairness info
num_sensitive_features = 3
feature_bias = [[0.5, 0.1], [0.0, 0.0], [0.0, 0.0]]

# Output files
compatibilities_file = "data/sample_compatibilities.csv"
item_features_file = "data/sample_item_features.csv"
user_factors_file = "data/sample_user_factors.csv"
item_factors_file = "data/sample_item_factors.csv"
ratings_file = "data/sample_ratings.csv"
'''

In [35]:
params = data_gen.DataGenParameters()
params.from_string(CONFIG_STRING)
params

DataGenParameters(num_items=1000, num_factors=10, std_dev_factors=1.0, num_sensitive_features=3, feature_bias=[[0.5, 0.1], [0.0, 0.0], [0.0, 0.0]], num_users_per_propensity=[100, 100]

## Generating the Output Data

In [38]:
lafs = data_gen.DataGen(params, save=False)

In [41]:
lafs.generate_data()
lafs.ratings[:10]

[(0, np.int64(542), np.float64(6.263802103517)),
 (0, np.int64(176), np.float64(6.020054977757706)),
 (0, np.int64(576), np.float64(5.316127896510761)),
 (0, np.int64(874), np.float64(5.084777235609874)),
 (0, np.int64(978), np.float64(4.9007439851817365)),
 (0, np.int64(205), np.float64(4.72650170469997)),
 (0, np.int64(853), np.float64(4.57386193481619)),
 (0, np.int64(222), np.float64(4.475202407950422)),
 (0, np.int64(168), np.float64(4.454761107438119)),
 (0, np.int64(943), np.float64(4.051090163061253))]

You can use this to save the output. The file names in the params will be used.

In [None]:
# lafs.save_ratings()