[OTTO – Multi-Objective Recommender System](https://www.kaggle.com/competitions/otto-recommender-system/overview)

## Overview

### Terminology

* a multi-objective recommender system
* tailored recommendations
* Current recommender systems consist of various models with different approaches, ranging from **simple matrix factorization** to a **transformer-type deep neural network**.






### Vocabulary

* Online shoppers **have their pick of millions of products** from large retailers.

* However, no single model exists that can simultaneously optimize multiple objectives.

* In this competition, you’ll build a single entry to predict click-through, add-to-cart, and conversion rates based on previous same-session events.

*  Improving recommendations will **ensure navigating through seemingly endless options is more effortless and engaging for shoppers**.

## Questions about this competation::

---


* Q1: what is the distribution `elapsed time` for the sessions

* Q2: what is the distribution `number of data` for the sessions

* Q3: How many `article id (product code)` in the dataset

* Q4: Why `parquet` version of data is better than `csv`? `pickle` datatype from [here](https://www.kaggle.com/code/radek1/howto-full-dataset-as-parquet-csv-files?scriptVersionId=109945227)

## 框架学习

* [pandas使用chunksize分块处理大型csv文件](https://blog.csdn.net/weixin_43790560/article/details/88587123)

* [Python 中处理大型数据工具（dask)](https://blog.csdn.net/qq_42374697/article/details/121010300)

* [Spark工作原理及基础概念（超详细！）](https://blog.csdn.net/qq_42374697/article/details/121010300)

* colab pro --- 25G RAM & pro+ --- 52G RAM

* [安利一个Python大数据分析神器！Dask](https://zhuanlan.zhihu.com/p/302267112)

* [Dask 官网](https://dask.pydata.org/en/latest/)

# EDA Notebook

## [OTTO: Basic EDA](https://www.kaggle.com/code/aliphya/otto-basic-eda)

* As part of this notebook we will load a small sample of the total data and try to create **insightful visualizations**.

* In a single event the max number of actions performed are 495 which seems to be a little too **excessive** to me.

* The median number of actions in an event are 19 which seems **plausible**.

## [Time Series EDA - Users and Real Sessions](https://www.kaggle.com/code/cdeotte/time-series-eda-users-and-real-sessions) --- by [Chris Deotte](https://www.kaggle.com/cdeotte)

### Key Messages:

* In Kaggle's OttO competition the word "session" actually means "user".

* Lines of training data: "216,716,096 --> 0.2 billion (216.7 million)" 

* Unique sessions(users): 1,289,977 ---> 1.3 million 

* Unique aids(products): 1,855,603 ---> 1.86 million 

### Observations:

* Most users exhibit regular behavior. They click, cart and order at the same hours each day. 

* Also most users like to shop on the same days of the each week. 

* Most users are active during the waking hours of day but some users like to shop during the night while others are sleeping. 

* We also notice that users shop in clusters of activity. 

* **Our challenge in this competition** is that we must both predict the remainder of the last cluster (provided in test data) and predict new clusters (after last timestamp in test). 

* Furthermore all users in test data (not displayed in this notebook) have **less than 1 week data**, so **we must predict user behavior given little user history information (i.e. the RecSys "cold start" problem)**. Understanding users and their behavior will help us predict test users' future behavior!



### **To Be Learnt**:

* Time Series data processing -- py&da

* read document about **parquet** anf **pickle**


# Solution Notebook

## [Candidate ReRank Model - [LB 0.575]](https://www.kaggle.com/code/cdeotte/candidate-rerank-model-lb-0-575) --------- by [*CHRIS DEOTTE*](https://www.kaggle.com/cdeotte)

**Note** in this competition, a "session" actually means a unique "user". So our task is to predict what each of the `1,671,803` test "users" (i.e. "sessions") will do in the future. For each test "user" (i.e. "session") we must predict what they will `click`, `cart`, and `order` during the remainder of the week long test period.

### Key knowledge:

1. Candidate ReRank Model

  * [Recommendation Systems for Large Datasets](https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721)

2. ReRanker model (such as XGB)


### Questions:

* What is **final 20**?

* test user ---> possible choices ~~ i.e. candidates

### About this competation:

1. **Note** in this competition, a "session" actually means a unique "user". 

2. Nvida RAPIDA:
  
  * [RAPIDS，为数据科学和机器学习而生](https://blog.csdn.net/sunhf_csdn/article/details/83538591)

  * Nvidia [RAPIDS: Open GPU Data Science](https://rapids.ai/)


### Kagglers who have shared ideas(Credits):

* We use **co-visitation matrix idea** from *Vladimir* [here][1]. 

* We use **groupby sort logic** from *Sinan* in comment section [here][4]. 

* We use **duplicate prediction removal logic** from Radek [here][5]. 

* We use **multiple visit logic** from *Pietro* [here][2]. 

* We use **type weighting logic** from *Ingvaras* [here][3]. 

* We use **leaky test data** from CHRIS DEOTTE's previous notebook [here][4].

* And some ideas may have originated from *Tawara* [here][6] and *KJ* [here][7]. 

* We use *Colum2131*'s parquets [here][8]. 

* Above(in the kaggle notebook) image is from *Ravi*'s discussion about candidate rerank models [here][9]

[1]: https://www.kaggle.com/code/vslaykovsky/co-visitation-matrix
[2]: https://www.kaggle.com/code/pietromaldini1/multiple-clicks-vs-latest-items
[3]: https://www.kaggle.com/code/ingvarasgalinskas/item-type-vs-multiple-clicks-vs-latest-items
[4]: https://www.kaggle.com/code/cdeotte/test-data-leak-lb-boost
[5]: https://www.kaggle.com/code/radek1/co-visitation-matrix-simplified-imprvd-logic
[6]: https://www.kaggle.com/code/ttahara/otto-mors-aid-frequency-baseline
[7]: https://www.kaggle.com/code/whitelily/co-occurrence-baseline
[8]: https://www.kaggle.com/datasets/columbia2131/otto-chunk-data-inparquet-format
[9]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364721
[10]: https://www.kaggle.com/cdeotte/compute-validation-score-cv-564
[11]: https://www.kaggle.com/competitions/otto-recommender-system/discussion/364991

### Questions:

* Q1: what is *handcrafted rules*?

* Q2: `Co-visitation matrix`, `type weighting`, and `time weighting`?