# 1. Příprava dat

# MIND dataset
 * Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research
 * The training data contains feedback for full slates displayed to users and it was captured during the first 6 days of the 5th week
* Additionally, the training data contains history of user interactions with other than news topics
 * [distribution page](https://msnews.github.io/)
 * [dataset description](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)
 * we will work with the smallest variant available

# MIND dataset preprocessing
 - Following notebook processes MIND dataset in order to demonstrate typical recommender system setting in production
 - Typical Seznam.cz user consumes content from many sources, each source might be separate recommender system or system without machine learning inside:
   - [feed recommendation](https://www.seznam.cz/)
   - [news recommendation](https://www.novinky.cz/zena/clanek/nicolas-cage-bude-znovu-otcem-popate-zenaty-herec-ceka-dalsiho-potomka-40391884#dop_ab_variant=0&dop_source_zone_name=novinky.web.nexttoart&dop_req_id=CyFCCuy363P-202204061157&dop_id=40391884)
   - [fulltext search](https://search.seznam.cz/?q=tesla&oq=tesla&aq=-1&sourceid=szn-HP&ks=7&ms=1348&sgId=MC40ODY5MjE3MTUzMjc3ODE1IDE2NDkyNDYzNzEuMDY3)
 - Typical recommender system works with limited amount of content (e.g. only news arcticles) yet there is vast amount of consumed content beyond particular recommendation task (e.g. fulltext search) - how to take advantage of such data?
 - Every article in MIND dataset has specific category - in our recommendation setting we will try and recommend best arcticle from the category `news` and  take other types of articles as separate input to our model as additional data
 - We will refer to category `news` as as cold-start category and to newly generated dataset as a cold-start dataset because there will be abundance of users with very small or no interaction history
 - We will also avoid item cold-start - all articles which did not occur in training dataset will be removed 
 - As a result we will obtain following files:
  - `behaviors_train.tsv`: training data for news category arcticles prediction
    - `slateid` - Slate id.
    - `userid` - The anonymous ID of a user.
    - `time` - The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
    - `impressions` - List of news displayed in this slate and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a slate have been shuffled.
    - `history` - The news click history (ID list of clicked news) of this user before this slate was displayed. The clicked news articles are ordered by time and are only from category `news`.
    - `history_all_categories` - Visited news category history of this user before this slate was displayed. Visited categories are ordered by time.
    - `history_all_subcategories` - Visited news subcategories history of this user before this slate was displayed. Visited categoreis are ordered by time.
    - `history_all` - The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time and contain news from all categories.
    - `history_all_title` - Titles of articles from history delimited by `;`
   - `behaviors_test.tsv`: testing data for news category arcticles  prediction
       - `slateid` - Slate id.
       - `userid` - The anonymous ID of a user.
       - `time` - The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
       - `impressions` - List of news displayed in this slate and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a slate have been shuffled.
       - `history` - The news click history (ID list of clicked news) of this user before this slate was displayed. The clicked news articles are ordered by time and are only from category `news`.
       - `history_all_categories` - Visited news category history of this user before this slate was displayed. Visited categories are ordered by time.
       - `history_all_subcategories` - Visited news subcategories history of this user before this slate was displayed. Visited categoreis are ordered by time.
       - `history_all` - The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time and contain news from all categories.
       - `history_all_title` - Titles of articles from history delimited by `;`
   - `news_catalogue_train.tsv`: contains articles data filtered only for category `news`
   - `auxiliary_data_catalogue_train.tsv`: contains all articles data from training set
training testing set
   - `categories.tsv`: list of all available articles categories
   - `subcategories.tsv`: list of all available articles subcategories









In [None]:
# mount google drive
try:
    from google.colab import drive

    drive.mount('/content/gdrive')
    BASE_DIR = "/content/gdrive/MyDrive/mlprague2022"
    IN_COLAB = True
except:
    BASE_DIR = ".."
    IN_COLAB = False

In [None]:
# install processing functionality from github repository
!pip install git+https://github.com/seznam/MLPrague-2022.git

# Load and transform MIND dataset for cold-start scenario

In [None]:
# import necessary functionality
from collections import Counter

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from mlprague22.dataset import transform_behaviors_to_coldstart

In [None]:
COLD_START_CATEGORY = "news"
MIND_DATA_SOURCE_DIR = "tmp/mind"
ORIGINAL_TRAIN_INPUT_DIR = os.path.join(BASE_DIR, MIND_DATA_SOURCE_DIR, "train/")
ORIGINAL_TEST_INPUT_DIR = os.path.join(BASE_DIR, MIND_DATA_SOURCE_DIR, "test/")
OUTPUT_DIR = os.path.join(BASE_DIR, "data/mind_cold_start_datasets_basic/")

COLD_START_BEHAVIORS_TRAIN = os.path.join(OUTPUT_DIR, "behaviors_train.tsv")
COLD_START_BEHAVIORS_TEST = os.path.join(OUTPUT_DIR, "behaviors_test.tsv")
NEWS_CATALOGUE_TRAIN = os.path.join(OUTPUT_DIR, "news_catalogue_train.tsv")
NEWS_CATALOGUE_TEST = os.path.join(OUTPUT_DIR, "news_catalogue_test.tsv")
AUXILIARY_DATA_CATALOGUE_TRAIN = os.path.join(OUTPUT_DIR, "auxiliary_data_catalogue_train.tsv")
AUXILIARY_DATA_CATALOGUE_TEST = os.path.join(OUTPUT_DIR, "auxiliary_data_catalogue_test.tsv")
ALL_CATEGORIES_PATH = os.path.join(OUTPUT_DIR, "categories.tsv")
ALL_SUBCATEGORIES_PATH = os.path.join(OUTPUT_DIR, "subcategories.tsv")

In [None]:
! mkdir -p $BASE_DIR
! mkdir -p $MIND_DATA_SOURCE_DIR

## Install deps, download and unzip original dataset

In [None]:
! apt update && apt install unzip

! mkdir -p $ORIGINAL_TRAIN_INPUT_DIR
! mkdir -p $ORIGINAL_TEST_INPUT_DIR

! wget https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip -O $MIND_DATA_SOURCE_DIR/MINDsmall_train.zip
! wget https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip -O $MIND_DATA_SOURCE_DIR/MINDsmall_dev.zip

! unzip -o $MIND_DATA_SOURCE_DIR/MINDsmall_train.zip -d $ORIGINAL_TRAIN_INPUT_DIR
! unzip -o $MIND_DATA_SOURCE_DIR/MINDsmall_dev.zip -d $ORIGINAL_TEST_INPUT_DIR

! rm $MIND_DATA_SOURCE_DIR/MINDsmall_train.zip
! rm $MIND_DATA_SOURCE_DIR/MINDsmall_dev.zip

!mkdir -p $OUTPUT_DIR

## Load and inspect original data

In [None]:
behaviors_train = pd.read_csv(
    os.path.join(ORIGINAL_TRAIN_INPUT_DIR, "behaviors.tsv"),
    sep="\t",
    names=["slateid", "userid", "time", "history", "impressions"]
)

behaviors_train.info()
behaviors_train

In [None]:
behaviors_test = pd.read_csv(
    os.path.join(ORIGINAL_TEST_INPUT_DIR, "behaviors.tsv"),
    sep="\t",
    names=["slateid", "userid", "time", "history", "impressions"]
)

behaviors_test.info()
behaviors_test

In [None]:
news_train = pd.read_csv(
    os.path.join(ORIGINAL_TRAIN_INPUT_DIR, "news.tsv"),
    sep="\t",
    names=["newsid", "category", "subcategory", "title", "abstract", "url", "title_entities", "abstract_entities"]
)

news_train.info()
news_train

In [None]:
news_train.category.unique()

## Transform datasets to cold-start
 - keep only arcticles with `news` category in the `history` and `impressions` columns 

### Transform train dataset

In [None]:
behaviors_train_ex = transform_behaviors_to_coldstart(behaviors_train, news_train, COLD_START_CATEGORY)

In [None]:
behaviors_train_ex.head(5)

In [None]:
behaviors_train_ex[
    [
        "slateid",
        "userid",
        "time",
        "history",
        "impressions",
        "history_all_categories",
        "history_all_subcategories",
        "history_all",
    ]
].to_csv(COLD_START_BEHAVIORS_TRAIN, sep="\t", index=False)

### Transform test dataset

In [None]:
behaviors_test_ex = transform_behaviors_to_coldstart(
    behaviors_test, news_train, COLD_START_CATEGORY
)

In [None]:
behaviors_test_ex.head(5)

In [None]:
behaviors_test_ex[
    [
        "slateid",
        "userid",
        "time",
        "history",
        "impressions",
        "history_all_categories",
        "history_all_subcategories",
        "history_all",
    ]
].to_csv(COLD_START_BEHAVIORS_TEST, sep="\t", index=False)

# Split `news` data to news-only (main catalogue) and non-news-only (auxiliary catalogue)

In [None]:
news_train.query("category == @COLD_START_CATEGORY").to_csv(
    NEWS_CATALOGUE_TRAIN, sep="\t", index=False
)
news_train.to_csv(
    AUXILIARY_DATA_CATALOGUE_TRAIN, sep="\t", index=False
)

# Extract all unique [sub]categories

In [None]:
categories_pd = pd.DataFrame(
    list(enumerate(sorted(news_train.category.unique().tolist()))),
    columns=["order", "category"]
)

In [None]:
categories_pd

In [None]:
categories_pd.to_csv(ALL_CATEGORIES_PATH, sep="\t")

In [None]:
subcategories_pd = pd.DataFrame(
    list(enumerate(sorted(news_train.subcategory.unique().tolist()))),
    columns=["order", "subcategory"]
)

In [None]:
subcategories_pd

In [None]:
subcategories_pd.to_csv(ALL_SUBCATEGORIES_PATH, sep="\t")