# 人工データの作成

## 目的

- Movie-Lensのデータは不均一なので、得られた解釈が正しいかどうかわからない
- 真の構造が既知な、均一データを人工的に作成する、解釈が正しいかどうか確認する

## データ形式

顧客属性
- 性別
    - [M, F]
- 年齢
    - [10, 20, 30, 40, 50, 60]

商品属性
- ジャンル
    - [M, E, F]
    - 性別と相関がある
- 公開年
    - [1960, 1970, 1980, 1990, 2000, 2010]
    - 年齢と相関がある

- 顧客の人数
    - 150 * 60 * 2 * 6 = 10800

- 商品の個数
    - 50 * 3 * 6 = 900

- 顧客が閲覧している映画
    - 個数
    [0, 20, 40, 60, 80, 100]
        - 均一に選ぶ
    - 分布
        - Mの場合
            - ジャンルの比率
                - M: 0.60, E: 0.30, F: 0.10
        - 20歳の場合
            - 公開年の比率
                - 2000: 0.50, other: 0.10 * 5

- 顧客の名前: u_id_性別_年齢
- 商品の名前: v_id_ジャンル_公開年

In [17]:
from typing import Dict, List, Any
import random

In [79]:
random.seed(0)
user_count_per_segment = 150
item_count_per_segment = 50
seq_lengths = [20, 40, 60, 80, 100]
genders = ["M", "F"]
ages = [10, 20, 30, 40, 50, 60]
# genres = ["M", "E", "F"]
genres = ["M1", "M2", "M3", "M4", "M5", "E1", "E2", "E3", "E4", "E5", "F1", "F2", "F3", "F4", "F5"]
base_year = 2020
years = [1960, 1970, 1980, 1990, 2000, 2010]

raw_sequences: Dict[str, List[str]] = {}
users: Dict[str, Dict[str, Any]] = {}
items: Dict[str, Dict[str, Any]] = {}

def get_user_name(user_id: int, gender: str, age: int, seq_length: int):
    return f"u_{user_id}_{gender}_{age}_{seq_length}"

def get_item_name(item_id: int, genre: str, year: int):
    return f"v_{item_id}_{genre}_{year}"

for gender in genders:
    for age in ages:
        for seq_length in seq_lengths:
            for user_id in range(user_count_per_segment):
                p = []
                if gender == "M":
                    p = [0.12] * 5 + [0.06] * 5 + [0.02] * 5
                else:
                    p = [0.02] * 5 + [0.06] * 5 + [0.12] * 5

                genre_list = random.choices(genres, p, k=seq_length)

                year_weight = list(map(lambda e: 0.50 if e == base_year - age else 0.10, years))
                year_list = random.choices(years, year_weight, k=seq_length)

                item_id_list = [random.randint(0, item_count_per_segment - 1) for _ in range(seq_length)]
                item_id_list.sort()

                sequences = list(map(lambda x: get_item_name(*x), zip(item_id_list, genre_list, year_list)))

                user_name = get_user_name(user_id, gender, age, seq_length)
                raw_sequences[user_name] = sequences

                users[user_name] = {
                    "gender": gender,
                    # "age": age,
                }

for genre in genres:
    for year in years:
        for item_id in range(item_count_per_segment):
            item_name = get_item_name(item_id, genre, year)
            items[item_name] = {
                "genre": genre,
                # "year": year
            }

In [80]:
import pandas as pd

In [81]:
user_df = pd.DataFrame(users.values(), index=users.keys())
item_df = pd.DataFrame(items.values(), index=items.keys())

In [82]:
train_sequences = list(map(lambda s: " ".join(s), raw_sequences.values()))
train_df = pd.DataFrame(train_sequences, index=raw_sequences.keys(), columns=["sequence"])

In [83]:
user_df.index.name = "user_id"
item_df.index.name = "item_id"
train_df.index.name = "user_id"

In [84]:
data_dir = "../data/toydata-simple/"
user_df.to_csv(data_dir + "users.csv")
item_df.to_csv(data_dir + "items.csv")
train_df.to_csv(data_dir + "train.csv")

In [85]:
train_df

Unnamed: 0_level_0,sequence
user_id,Unnamed: 1_level_1
u_0_M_10_20,v_0_E5_1990 v_4_E3_2010 v_5_M4_2010 v_5_M3_201...
u_1_M_10_20,v_1_F5_1960 v_5_M5_2010 v_7_M3_1970 v_7_M3_197...
u_2_M_10_20,v_0_M3_1970 v_3_E4_1970 v_7_M1_1980 v_10_M2_20...
u_3_M_10_20,v_0_E2_2010 v_2_F1_1970 v_3_E5_1990 v_4_F1_201...
u_4_M_10_20,v_0_F5_2010 v_2_E1_1990 v_8_M1_1960 v_11_E3_19...
...,...
u_145_F_60_100,v_0_F1_1980 v_0_E5_2000 v_0_F2_1960 v_0_F3_196...
u_146_F_60_100,v_0_F5_1960 v_1_E1_1970 v_1_M2_1960 v_2_F3_196...
u_147_F_60_100,v_0_E2_1970 v_1_M4_2000 v_1_E4_1980 v_2_M5_196...
u_148_F_60_100,v_0_F3_1990 v_0_F5_1960 v_1_F2_2000 v_2_F5_199...
