# MIND dataset
 * Microsoft News Dataset (MIND) is a large-scale dataset for news recommendation research
 * The training data contains feedback for full slates displayed to users and it was captured during the first 6 days of the 5th week
* Additionally, the training data contains history of user interactions with other than news topics
 * [distribution page](https://msnews.github.io/)
 * [dataset description](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)
 * we will work with the smallest variant available

# MIND dataset preprocessing
 - Following notebook processes MIND dataset in order to demonstrate typical recommender system setting in production
 - Typical Seznam.cz user consumes content from many sources, each source might be separate recommender system or system without machine learning inside:
   - [feed recommendation](https://www.seznam.cz/)
   - [news recommendation](https://www.novinky.cz/zena/clanek/nicolas-cage-bude-znovu-otcem-popate-zenaty-herec-ceka-dalsiho-potomka-40391884#dop_ab_variant=0&dop_source_zone_name=novinky.web.nexttoart&dop_req_id=CyFCCuy363P-202204061157&dop_id=40391884)
   - [fulltext search](https://search.seznam.cz/?q=tesla&oq=tesla&aq=-1&sourceid=szn-HP&ks=7&ms=1348&sgId=MC40ODY5MjE3MTUzMjc3ODE1IDE2NDkyNDYzNzEuMDY3)
 - Typical recommender system works with limited amount of content (e.g. only news arcticles) yet there is vast amount of consumed content beyond particular recommendation task (e.g. fulltext search) - how to take advantage of such data?
 - Every article in MIND dataset has specific category - in our recommendation setting we will try and recommend best arcticle from the category `news` and  take other types of articles as separate input to our model as additional data
 - We will refer to category `news` as as cold-start category and to newly generated dataset as a cold-start dataset because there will be abundance of users with very small or no interaction history
 - We will also avoid item cold-start - all articles which did not occur in training dataset will be removed 
 - As a result we will obtain following files:
  - `behaviors_train.tsv`: training data for news category arcticles prediction
    - `slateid` - Slate id.
    - `userid` - The anonymous ID of a user.
    - `time` - The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
    - `impressions` - List of news displayed in this slate and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a slate have been shuffled.
    - `history` - The news click history (ID list of clicked news) of this user before this slate was displayed. The clicked news articles are ordered by time and are only from category `news`.
    - `history_all_categories` - Visited news category history of this user before this slate was displayed. Visited categories are ordered by time.
    - `history_all_subcategories` - Visited news subcategories history of this user before this slate was displayed. Visited categoreis are ordered by time.
    - `history_all` - The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time and contain news from all categories.
    - `history_all_title` - Titles of articles from history delimited by `;`
   - `behaviors_test.tsv`: testing data for news category arcticles  prediction
       - `slateid` - Slate id.
       - `userid` - The anonymous ID of a user.
       - `time` - The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
       - `impressions` - List of news displayed in this slate and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a slate have been shuffled.
       - `history` - The news click history (ID list of clicked news) of this user before this slate was displayed. The clicked news articles are ordered by time and are only from category `news`.
       - `history_all_categories` - Visited news category history of this user before this slate was displayed. Visited categories are ordered by time.
       - `history_all_subcategories` - Visited news subcategories history of this user before this slate was displayed. Visited categoreis are ordered by time.
       - `history_all` - The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time and contain news from all categories.
       - `history_all_title` - Titles of articles from history delimited by `;`
   - `news_catalogue_train.tsv`: contains articles data filtered only for category `news`
   - `auxiliary_data_catalogue_train.tsv`: contains all articles data from training set
training testing set
   - `categories.tsv`: list of all available articles categories
   - `subcategories.tsv`: list of all available articles subcategories









In [1]:
# install processing functionality from github repository
!pip install git+https://github.com/seznam/MLPrague-2022.git

Collecting git+https://github.com/seznam/MLPrague-2022.git
  Cloning https://github.com/seznam/MLPrague-2022.git to /tmp/pip-req-build-dfnugvv5
  Running command git clone -q https://github.com/seznam/MLPrague-2022.git /tmp/pip-req-build-dfnugvv5
Building wheels for collected packages: mlprague22
  Building wheel for mlprague22 (setup.py) ... [?25l[?25hdone
  Created wheel for mlprague22: filename=mlprague22-0.0.0-py3-none-any.whl size=4390 sha256=9d80ec3266bd4514e9d48407af9cebf4b2423937b3cdb89d0abcaf71ed40c4af
  Stored in directory: /tmp/pip-ephem-wheel-cache-op9dtuo7/wheels/8e/30/46/f600aaa9e010eb66abaae33828e5ee3596fcdccb523d593440
Successfully built mlprague22
Installing collected packages: mlprague22
Successfully installed mlprague22-0.0.0


# Load and transform MIND dataset for cold-start scenario

In [2]:
# import necessary functionality
try:
    from google.colab import drive

    drive.mount('/content/gdrive')
    BASE_DIR = "/content/gdrive/MyDrive/mlprague2022"
    IN_COLAB = True
except:
    BASE_DIR = ".."
    IN_COLAB = False

from collections import Counter

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from mlprague22.dataset import transform_behaviors_to_coldstart

Mounted at /content/gdrive


In [3]:
COLD_START_CATEGORY = "news"
MIND_DATA_SOURCE_DIR = "tmp/mind"
ORIGINAL_TRAIN_INPUT_DIR = os.path.join(BASE_DIR, MIND_DATA_SOURCE_DIR, "train/")
ORIGINAL_TEST_INPUT_DIR = os.path.join(BASE_DIR, MIND_DATA_SOURCE_DIR, "test/")
OUTPUT_DIR = os.path.join(BASE_DIR, "data/mind_cold_start_datasets_basic/")

COLD_START_BEHAVIORS_TRAIN = os.path.join(OUTPUT_DIR, "behaviors_train.tsv")
COLD_START_BEHAVIORS_TEST = os.path.join(OUTPUT_DIR, "behaviors_test.tsv")
NEWS_CATALOGUE_TRAIN = os.path.join(OUTPUT_DIR, "news_catalogue_train.tsv")
NEWS_CATALOGUE_TEST = os.path.join(OUTPUT_DIR, "news_catalogue_test.tsv")
AUXILIARY_DATA_CATALOGUE_TRAIN = os.path.join(OUTPUT_DIR, "auxiliary_data_catalogue_train.tsv")
AUXILIARY_DATA_CATALOGUE_TEST = os.path.join(OUTPUT_DIR, "auxiliary_data_catalogue_test.tsv")
ALL_CATEGORIES_PATH = os.path.join(OUTPUT_DIR, "categories.tsv")
ALL_SUBCATEGORIES_PATH = os.path.join(OUTPUT_DIR, "subcategories.tsv")

In [4]:
! mkdir -p $BASE_DIR
! mkdir -p $MIND_DATA_SOURCE_DIR

## Install deps, download and unzip original dataset

In [5]:
! apt update && apt install unzip

! mkdir -p $ORIGINAL_TRAIN_INPUT_DIR
! mkdir -p $ORIGINAL_TEST_INPUT_DIR

! wget https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip -O $MIND_DATA_SOURCE_DIR/MINDsmall_train.zip
! wget https://mind201910small.blob.core.windows.net/release/MINDsmall_dev.zip -O $MIND_DATA_SOURCE_DIR/MINDsmall_dev.zip

! unzip -o $MIND_DATA_SOURCE_DIR/MINDsmall_train.zip -d $ORIGINAL_TRAIN_INPUT_DIR
! unzip -o $MIND_DATA_SOURCE_DIR/MINDsmall_dev.zip -d $ORIGINAL_TEST_INPUT_DIR

! rm $MIND_DATA_SOURCE_DIR/MINDsmall_train.zip
! rm $MIND_DATA_SOURCE_DIR/MINDsmall_dev.zip

!mkdir -p $OUTPUT_DIR

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [696 B]
Get:8 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]
Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:12 https://cloud.r-project.org/bin/linux/ubuntu bi

## Load and inspect original data

In [6]:
behaviors_train = pd.read_csv(
    os.path.join(ORIGINAL_TRAIN_INPUT_DIR, "behaviors.tsv"),
    sep="\t",
    names=["slateid", "userid", "time", "history", "impressions"]
)

behaviors_train.info()
behaviors_train

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156965 entries, 0 to 156964
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   slateid      156965 non-null  int64 
 1   userid       156965 non-null  object
 2   time         156965 non-null  object
 3   history      153727 non-null  object
 4   impressions  156965 non-null  object
dtypes: int64(1), object(4)
memory usage: 6.0+ MB


Unnamed: 0,slateid,userid,time,history,impressions
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...
...,...,...,...,...,...
156960,156961,U21593,11/14/2019 10:24:05 PM,N7432 N58559 N1954 N43353 N14343 N13008 N28833...,N2235-0 N22975-0 N64037-0 N47652-0 N11378-0 N4...
156961,156962,U10123,11/13/2019 6:57:04 AM,N9803 N104 N24462 N57318 N55743 N40526 N31726 ...,N3841-0 N61571-0 N58813-0 N28213-0 N4428-0 N25...
156962,156963,U75630,11/14/2019 10:58:13 AM,N29898 N59704 N4408 N9803 N53644 N26103 N812 N...,N55913-0 N62318-0 N53515-0 N10960-0 N9135-0 N5...
156963,156964,U44625,11/13/2019 2:57:02 PM,N4118 N47297 N3164 N43295 N6056 N38747 N42973 ...,N6219-0 N3663-0 N31147-0 N58363-0 N4107-0 N457...


In [7]:
behaviors_test = pd.read_csv(
    os.path.join(ORIGINAL_TEST_INPUT_DIR, "behaviors.tsv"),
    sep="\t",
    names=["slateid", "userid", "time", "history", "impressions"]
)

behaviors_test.info()
behaviors_test

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73152 entries, 0 to 73151
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   slateid      73152 non-null  int64 
 1   userid       73152 non-null  object
 2   time         73152 non-null  object
 3   history      70938 non-null  object
 4   impressions  73152 non-null  object
dtypes: int64(1), object(4)
memory usage: 2.8+ MB


Unnamed: 0,slateid,userid,time,history,impressions
0,1,U80234,11/15/2019 12:37:50 PM,N55189 N46039 N51741 N53234 N11276 N264 N40716...,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...
1,2,U60458,11/15/2019 7:11:50 AM,N58715 N32109 N51180 N33438 N54827 N28488 N611...,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...
2,3,U44190,11/15/2019 9:55:12 AM,N56253 N1150 N55189 N16233 N61704 N51706 N5303...,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...
3,4,U87380,11/15/2019 3:12:46 PM,N63554 N49153 N28678 N23232 N43369 N58518 N444...,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...
4,5,U9444,11/15/2019 8:25:46 AM,N51692 N18285 N26015 N22679 N55556,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...
...,...,...,...,...,...
73147,73148,U77536,11/15/2019 8:40:16 PM,N28691 N8845 N58434 N37120 N22185 N60033 N4702...,N496-0 N35159-0 N59856-0 N13270-0 N47213-0 N26...
73148,73149,U56193,11/15/2019 1:11:26 PM,N4705 N58782 N53531 N46492 N26026 N28088 N3109...,N49285-0 N31958-0 N55237-0 N42844-0 N29862-0 N...
73149,73150,U16799,11/15/2019 3:37:06 PM,N40826 N42078 N15670 N15295 N64536 N46845 N52294,N7043-0 N512-0 N60215-1 N45057-0 N496-0 N37055...
73150,73151,U8786,11/15/2019 8:29:26 AM,N3046 N356 N20483 N46107 N44598 N18693 N8254 N...,N23692-0 N19990-0 N20187-0 N5940-0 N13408-0 N3...


In [8]:
news_train = pd.read_csv(
    os.path.join(ORIGINAL_TRAIN_INPUT_DIR, "news.tsv"),
    sep="\t",
    names=["newsid", "category", "subcategory", "title", "abstract", "url", "title_entities", "abstract_entities"]
)

news_train.info()
news_train

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51282 entries, 0 to 51281
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   newsid             51282 non-null  object
 1   category           51282 non-null  object
 2   subcategory        51282 non-null  object
 3   title              51282 non-null  object
 4   abstract           48616 non-null  object
 5   url                51282 non-null  object
 6   title_entities     51279 non-null  object
 7   abstract_entities  51278 non-null  object
dtypes: object(8)
memory usage: 3.1+ MB


Unnamed: 0,newsid,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."
...,...,...,...,...,...,...,...,...
51277,N16909,weather,weathertopstories,"Adapting, Learning And Soul Searching: Reflect...",Woolsey Fire Anniversary: A community is forev...,https://assets.msn.com/labs/mind/BBWzQJK.html,"[{""Label"": ""Woolsey Fire"", ""Type"": ""N"", ""Wikid...","[{""Label"": ""Woolsey Fire"", ""Type"": ""N"", ""Wikid..."
51278,N47585,lifestyle,lifestylefamily,Family says 13-year-old Broadway star died fro...,,https://assets.msn.com/labs/mind/BBWzQYV.html,"[{""Label"": ""Broadway theatre"", ""Type"": ""F"", ""W...",[]
51279,N7482,sports,more_sports,St. Dominic soccer player tries to kick cancer...,"Sometimes, what happens on the sidelines can b...",https://assets.msn.com/labs/mind/BBWzQnK.html,[],[]
51280,N34418,sports,soccer_epl,How the Sounders won MLS Cup,"Mark, Jeremiah and Casey were so excited they ...",https://assets.msn.com/labs/mind/BBWzQuK.html,"[{""Label"": ""MLS Cup"", ""Type"": ""U"", ""WikidataId...",[]


In [9]:
news_train.category.unique()

array(['lifestyle', 'health', 'news', 'sports', 'weather',
       'entertainment', 'autos', 'travel', 'foodanddrink', 'tv',
       'finance', 'movies', 'video', 'music', 'kids', 'middleeast',
       'northamerica'], dtype=object)

## Transform datasets to cold-start
 - keep only arcticles with `news` category in the `history` and `impressions` columns 

### Transform train dataset

In [10]:
behaviors_train_ex = transform_behaviors_to_coldstart(behaviors_train, news_train, COLD_START_CATEGORY)

In [11]:
behaviors_train_ex.head(5)

Unnamed: 0,slateid,userid,time,history_all,impressions_all,impressions,history,history_all_categories,history_all_subcategories
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0,N35729-0,N45794 N19347 N31801,tv sports tv news sports lifestyle movies news...,tvnews baseball_mlb tvnews newscrime football_...
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...,N39317-0 N20495-0 N42977-0,N31739 N6072 N63045 N43353 N8129 N1569 N17686 ...,news news news finance travel news news news n...,newscrime newsus newscrime markets travelnews ...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...,N23877-0 N49712-0 N64174-0 N46821-0 N48017-0 N...,N7563 N24233,lifestyle lifestyle news sports tv weather spo...,lifestylebuzz lifestylehomeandgarden newsus fo...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0,N35729-0,,tv sports tv finance finance sports lifestyle ...,tv-celebrity baseball_mlb tv-celebrity markets...
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...,N16096-0 N45389-0 N35850-0 N28495-0 N39317-0 N...,,autos travel weather health,autosownership travelnews weathertopstories we...


In [12]:
behaviors_train_ex[
    [
        "slateid",
        "userid",
        "time",
        "history",
        "impressions",
        "history_all_categories",
        "history_all_subcategories",
        "history_all",
    ]
].to_csv(COLD_START_BEHAVIORS_TRAIN, sep="\t", index=False)

### Transform test dataset

In [13]:
behaviors_test_ex = transform_behaviors_to_coldstart(
    behaviors_test, news_train, COLD_START_CATEGORY
)

In [14]:
behaviors_test_ex.head(5)

Unnamed: 0,slateid,userid,time,history_all,impressions_all,impressions,history,history_all_categories,history_all_subcategories
0,1,U80234,11/15/2019 12:37:50 PM,N55189 N46039 N51741 N53234 N11276 N264 N40716...,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...,N50775-0,N46039 N53234 N6616 N63573 N38895,tv news tv news finance autos tv movies entert...,tvnews newsus tv-celebrity newsus finance-comp...
1,2,U60458,11/15/2019 7:11:50 AM,N58715 N32109 N51180 N33438 N54827 N28488 N611...,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...,N36779-0,N58715 N33438 N54827 N34775,news travel finance news news finance music ne...,newsus travelnews finance-companies newsscienc...
2,3,U44190,11/15/2019 9:55:12 AM,N56253 N1150 N55189 N16233 N61704 N51706 N5303...,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...,N36779-0 N50775-0,N1150 N16233 N53033,sports news tv news lifestyle sports news ente...,football_nfl newscrime tvnews newsus shop-book...
3,4,U87380,11/15/2019 3:12:46 PM,N63554 N49153 N28678 N23232 N43369 N58518 N444...,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...,N45057-0,N49153 N58518 N7649 N45794 N53033 N29361 N28247,travel news sports sports travel news tv news ...,traveltripideas newsus baseball_mlb football_n...
4,5,U9444,11/15/2019 8:25:46 AM,N51692 N18285 N26015 N22679 N55556,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...,N36779-0,,tv sports entertainment sports finance,tv-celebrity football_nfl celebrity golf perso...


In [15]:
behaviors_test_ex[
    [
        "slateid",
        "userid",
        "time",
        "history",
        "impressions",
        "history_all_categories",
        "history_all_subcategories",
        "history_all",
    ]
].to_csv(COLD_START_BEHAVIORS_TEST, sep="\t", index=False)

# Split `news` data to news-only (main catalogue) and non-news-only (auxiliary catalogue)

In [16]:
news_train.query("category == @COLD_START_CATEGORY").to_csv(
    NEWS_CATALOGUE_TRAIN, sep="\t", index=False
)
news_train.to_csv(
    AUXILIARY_DATA_CATALOGUE_TRAIN, sep="\t", index=False
)

# Extract all unique [sub]categories

In [17]:
categories_pd = pd.DataFrame(
    list(enumerate(sorted(news_train.category.unique().tolist()))),
    columns=["order", "category"]
)

In [18]:
categories_pd

Unnamed: 0,order,category
0,0,autos
1,1,entertainment
2,2,finance
3,3,foodanddrink
4,4,health
5,5,kids
6,6,lifestyle
7,7,middleeast
8,8,movies
9,9,music


In [19]:
categories_pd.to_csv(ALL_CATEGORIES_PATH, sep="\t")

In [20]:
subcategories_pd = pd.DataFrame(
    list(enumerate(sorted(news_train.subcategory.unique().tolist()))),
    columns=["order", "subcategory"]
)

In [21]:
subcategories_pd

Unnamed: 0,order,subcategory
0,0,ads-latingrammys
1,1,ads-lung-health
2,2,advice
3,3,animals
4,4,autosbuying
...,...,...
259,259,weightloss
260,260,wellness
261,261,wines
262,262,wonder


In [22]:
subcategories_pd.to_csv(ALL_SUBCATEGORIES_PATH, sep="\t")