In this notebook we will train a `Word2Vec` model.

We will use the `gensim` library which offers extremely fast training on the CPU.

We will again rely on `polars` and its small memory footprint to load and process the data. To speed things up, let's use the dataset in a parquet format (we won't have to deal with `jasonl` files anymore). [I shared the dataset here](https://www.kaggle.com/datasets/radek1/otto-full-optimized-memory-footprint). 

Why are we training word2vec embeddings in the first place?

A session where one action follows another action is very much like... a sentence! In sentences, words that are related appear together. We don't necessarily expect to see the word 'spaceship' in a sentence discussing various ways to cook a steak. The word "steak" is more likely to appear close to words such as rosemary, salt, pepper, oil and butter. In this sense, these words are thematically related. And with a large enough corpus we can start making further distinctions! Maybe butter will appear closer in the embedding space to milk than for instance to orange juice, even though both are drinks you can have with your breakfast! (that might be due to milk having the property of being a substance used to produce butter, which might tip the embeddings for "milk" and "butter" closer together assuming our corpus would contain texts on butter production!).

Similarly here we can exploit the fact that `aids` appearing in a sequence close together likely share some similarity. A person browsing for gardening equipment is probably not looking at surfboards and vice versa.

Once we train our model, what will we be able to use it for? First and foremost, candidate generation! Though one might also imagine using it for scoring. Essentially, a model such as this can be very handy in the context of session-based recommendation models!

Let's get to work! 🙂

## Other resources you might find useful:

* [💡 [2 methods] How-to ensemble predictions 🏅🏅🏅](https://www.kaggle.com/code/radek1/2-methods-how-to-ensemble-predictions)
* [co-visitation matrix - simplified, imprvd logic 🔥](https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic)
* [💡 Word2Vec How-to [training and submission]🚀🚀🚀](https://www.kaggle.com/code/radek1/word2vec-how-to-training-and-submission)
* [local validation tracks public LB perfecty -- here is the setup](https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991)
* [💡 For my friends from Twitter and LinkedIn -- here is how to dive into this competition 🐳](https://www.kaggle.com/competitions/otto-recommender-system/discussion/368560)
* [Full dataset processed to CSV/parquet files with optimized memory footprint](https://www.kaggle.com/competitions/otto-recommender-system/discussion/363843)

In [1]:
# ! pip install gensim

# Config 

In [2]:
debug = False

debug_rows = 10000

vector_size = 6

# Data Preprocessing

In [3]:
! mkdir ../model_training/w2v_v1

mkdir: ../model_training/w2v_v1: File exists


In [4]:
# !pip install polars

import polars as pl
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

train_dir = '../data/parquet/train1/*.parquet'
test_dir = '../data/parquet/train2/*.parquet'


model_file = '../model_training/w2v_v1/w2v.model'
if debug:
    train = pl.read_parquet(train_dir, n_rows=debug_rows)
    test = pl.read_parquet(test_dir, n_rows=debug_rows)
else:
    train = pl.read_parquet(train_dir)
    test = pl.read_parquet(test_dir)

Let us now transform the data into a format that the `gensim` library can work with. Thanks to `polars` we can do so very efficiently and very quickly.

There are various ways we could feed our data to our model, however doing so straight from RAM in the form of Python lists is probably one of the fastest! As we have enough resources on Kaggle to do so, let us take this approach!

In [5]:
train.shape

(107685893, 4)

In [6]:
test.shape

(9580522, 4)

In [7]:
sentences_df = pl.concat([train, test]).groupby('session').agg(
    pl.col('aid').alias('sentence')
)

In [8]:
sentences_df.shape

(10005085, 2)

In [9]:
sentences = sentences_df['sentence'].to_list()

In [10]:
# set([1]) + set([1])

Time to train our model.

# Training a word2vec model

In [None]:
%%time

w2vec = Word2Vec(sentences=sentences, vector_size=vector_size, min_count=1, workers=4)

In [None]:
train.head()

In [None]:
model_file

In [None]:
w2vec.save(model_file)

In [None]:
new_model = Word2Vec.load(model_file)

In [None]:
new_model

In [None]:
w2vec.wv[16246]

In [None]:
new_model.wv[16246]