\n",
@@ -257,53 +259,59 @@
"[2 rows x 40 columns]"
]
},
+ "execution_count": 3,
"metadata": {},
- "execution_count": 4
+ "output_type": "execute_result"
}
],
- "metadata": {}
+ "source": [
+ "raw_data = load_spark_df(size=DATA_SIZE, spark=spark, dbutils=dbutils)\n",
+ "# visualize data\n",
+ "raw_data.limit(2).toPandas().head()"
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### Feature Processing\n",
"The feature data provided has many missing values across both integer and categorical feature fields. In addition the categorical features have many distinct values, so effectively cleaning and representing the feature data is an important step prior to training a model.
\n",
"One of the simplest ways of managing both features that have missing values as well as high cardinality is to use the hashing trick. The [FeatureHasher](http://spark.apache.org/docs/latest/ml-features.html#featurehasher) transformer will pass integer values through and will hash categorical features into a sparse vector of lower dimensionality, which can be used effectively by LightGBM.
\n",
"First, the dataset is splitted randomly for training and testing and feature processing is applied to each dataset."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 5,
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
"source": [
"raw_train, raw_test = spark_random_split(raw_data, ratio=0.8, seed=42)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 6,
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
"source": [
"columns = [c for c in raw_data.columns if c != 'label']\n",
"feature_processor = FeatureHasher(inputCols=columns, outputCol='features')"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 7,
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
"source": [
"train = feature_processor.transform(raw_train)\n",
"test = feature_processor.transform(raw_test)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"## Model Training\n",
"In MMLSpark, the LightGBM implementation for binary classification is invoked using the `LightGBMClassifier` class and specifying the objective as `\"binary\"`. In this instance, the occurrence of positive labels is quite low, so setting the `isUnbalance` flag to true helps account for this imbalance.
\n",
@@ -315,12 +323,13 @@
"- `learningRate`: the learning rate for training across trees\n",
"- `featureFraction`: the fraction of features used for training a tree\n",
"- `earlyStoppingRound`: round at which early stopping can be applied to avoid overfitting"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 8,
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
"source": [
"lgbm = LightGBMClassifier(\n",
" labelCol=\"label\",\n",
@@ -336,132 +345,154 @@
" featureFraction=FEATURE_FRACTION,\n",
" earlyStoppingRound=EARLY_STOPPING_ROUND\n",
")"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### Model Training and Evaluation"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 9,
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ }
+ ],
"source": [
"model = lgbm.fit(train)\n"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 10,
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
"source": [
"predictions = model.transform(test)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 11,
- "source": [
- "evaluator = (\n",
- " ComputeModelStatistics()\n",
- " .setScoredLabelsCol(\"prediction\")\n",
- " .setLabelCol(\"label\")\n",
- " .setEvaluationMetric(\"AUC\")\n",
- ")\n",
- "\n",
- "result = evaluator.transform(predictions)\n",
- "auc = result.select(\"AUC\").collect()[0][0]\n",
- "result.show()"
- ],
+ "execution_count": 10,
+ "metadata": {},
"outputs": [
{
+ "name": "stderr",
"output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
+ {
"name": "stdout",
+ "output_type": "stream",
"text": [
"+---------------+------------------+\n",
"|evaluation_type| AUC|\n",
"+---------------+------------------+\n",
- "| Classification|0.6892773832319504|\n",
+ "| Classification|0.6590485347443004|\n",
"+---------------+------------------+\n",
"\n"
]
}
],
- "metadata": {}
+ "source": [
+ "evaluator = (\n",
+ " ComputeModelStatistics()\n",
+ " .setScoredLabelsCol(\"prediction\")\n",
+ " .setLabelCol(\"label\")\n",
+ " .setEvaluationMetric(\"AUC\")\n",
+ ")\n",
+ "\n",
+ "result = evaluator.transform(predictions)\n",
+ "auc = result.select(\"AUC\").collect()[0][0]\n",
+ "result.show()"
+ ]
},
{
"cell_type": "code",
- "execution_count": 10,
- "source": [
- "# Record results with papermill for tests\n",
- "sb.glue(\"auc\", auc)"
- ],
+ "execution_count": 11,
+ "metadata": {},
"outputs": [
{
- "output_type": "display_data",
"data": {
- "application/papermill.record+json": {
- "auc": 0.6870253907336659
+ "application/scrapbook.scrap.json+json": {
+ "data": 0.6590485347443004,
+ "encoder": "json",
+ "name": "auc",
+ "version": 1
}
},
- "metadata": {}
+ "metadata": {
+ "scrapbook": {
+ "data": true,
+ "display": false,
+ "name": "auc"
+ }
+ },
+ "output_type": "display_data"
}
],
- "metadata": {}
+ "source": [
+ "# Record results with papermill for tests\n",
+ "sb.glue(\"auc\", auc)"
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"## Model Saving \n",
"The full pipeline for operating on raw data including feature processing and model prediction can be saved and reloaded for use in another workflow."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
"source": [
"# save model\n",
"pipeline = PipelineModel(stages=[feature_processor, model])\n",
"pipeline.write().overwrite().save(MODEL_NAME)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
"source": [
"# cleanup spark instance\n",
"if not is_databricks():\n",
" spark.stop()"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"## Additional Reading\n",
"\\[1\\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
\n",
"\\[2\\] MML Spark: https://mmlspark.blob.core.windows.net/website/index.html
\n"
- ],
- "metadata": {}
+ ]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
- "display_name": "reco_full",
+ "display_name": "Python (reco)",
"language": "python",
- "name": "conda-env-reco_full-py"
+ "name": "reco"
},
"language_info": {
"codemirror_mode": {
@@ -473,9 +504,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.6.10"
+ "version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
-}
\ No newline at end of file
+}
diff --git a/examples/03_evaluate/als_movielens_diversity_metrics.ipynb b/examples/03_evaluate/als_movielens_diversity_metrics.ipynb
index 289d5b93fd..12a7ce0ed6 100644
--- a/examples/03_evaluate/als_movielens_diversity_metrics.ipynb
+++ b/examples/03_evaluate/als_movielens_diversity_metrics.ipynb
@@ -2,15 +2,16 @@
"cells": [
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"
Copyright (c) Microsoft Corporation. All rights reserved.\n",
"\n",
"
Licensed under the MIT License."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"# Apply Diversity Metrics \n",
"## -- Compare ALS and Random Recommenders on MovieLens (PySpark)\n",
@@ -40,11 +41,11 @@
"The comparision results show that the ALS recommender outperforms the random recommender on ranking metrics (Precision@k, Recall@k, NDCG@k, and\tMean average precision), while the random recommender outperforms ALS recommender on diversity metrics. This is because ALS is optimized for estimating the item rating as accurate as possible, therefore it performs well on accuracy metrics including rating and ranking metrics. As a side effect, the items being recommended tend to be popular items, which are the items mostly sold or viewed. It leaves the [long-tail items](https://github.com/microsoft/recommenders/blob/main/GLOSSARY.md) having less chance to get introduced to the users. This is the reason why ALS is not performing as well as a random recommender on diversity metrics. \n",
"\n",
"From the algorithmic point of view, items in the tail suffer from the cold-start problem, making them hard for recommendation systems to use. However, from the business point of view, oftentimes the items in the tail can be highly profitable, since, depending on supply, business can apply a higher margin to them. Recommendation systems that optimize metrics like novelty and diversity, can help to find users willing to get these long tail items. Usually there is a trade-off between one type of metric vs. another. One should decide which set of metrics to optimize based on business scenarios."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"**Coverage**\n",
"\n",
@@ -64,11 +65,11 @@
"p(i|R) = \\frac{|M_r (i)|}{|\\textrm{reco_df}|}\n",
"$$\n",
"and $M_r (i)$ denotes the users who are recommended item $i$.\n"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"\n",
"**Diversity**\n",
@@ -88,11 +89,11 @@
"$$\n",
"\\textrm{diversity} = 1 - \\textrm{IL}\n",
"$$\n"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"\n",
"**Novelty**\n",
@@ -111,11 +112,11 @@
"$$\n",
"\\textrm{novelty} = \\sum_{i \\in N_r} \\frac{|M_r (i)|}{|\\textrm{reco_df}|} \\textrm{novelty}(i)\n",
"$$\n"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"**Serendipity**\n",
"\n",
@@ -130,19 +131,30 @@
"\\textrm{serendipity} = \\frac{1}{|M|} \\sum_{u \\in M_r}\n",
"\\frac{1}{|N_r (u)|} \\sum_{i \\in N_r (u)} \\big(1 - \\textrm{expectedness}(i|u) \\big) \\, \\textrm{relevance}(i)\n",
"$$\n"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"**Note**: This notebook requires a PySpark environment to run properly. Please follow the steps in [SETUP.md](https://github.com/Microsoft/Recommenders/blob/master/SETUP.md#dependencies-setup) to install the PySpark environment."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "System version: 3.8.0 (default, Nov 6 2019, 21:49:08) \n",
+ "[GCC 7.3.0]\n",
+ "Spark version: 3.2.0\n"
+ ]
+ }
+ ],
"source": [
"# set the environment path to find Recommenders\n",
"%load_ext autoreload\n",
@@ -156,6 +168,8 @@
"from pyspark.sql.types import FloatType, IntegerType, LongType, StructType, StructField\n",
"from pyspark.ml.feature import Tokenizer, StopWordsRemover\n",
"from pyspark.ml.feature import HashingTF, CountVectorizer, VectorAssembler\n",
+ "import warnings\n",
+ "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
"\n",
"from recommenders.utils.timer import Timer\n",
"from recommenders.datasets import movielens\n",
@@ -171,31 +185,25 @@
"\n",
"print(\"System version: {}\".format(sys.version))\n",
"print(\"Spark version: {}\".format(pyspark.__version__))\n"
- ],
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "System version: 3.6.9 (default, Jan 26 2021, 15:33:00) \n",
- "[GCC 8.4.0]\n",
- "Spark version: 2.4.8\n"
- ]
- }
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"\n",
"Set the default parameters."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 2,
+ "metadata": {
+ "tags": [
+ "parameters"
+ ]
+ },
+ "outputs": [],
"source": [
"# top k items to recommend\n",
"TOP_K = 10\n",
@@ -209,72 +217,54 @@
"COL_RATING=\"Rating\"\n",
"COL_TITLE=\"Title\"\n",
"COL_GENRE=\"Genre\""
- ],
- "outputs": [],
- "metadata": {
- "tags": [
- "parameters"
- ]
- }
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### 1. Set up Spark context\n",
"\n",
"The following settings work well for debugging locally on VM - change when running on a cluster. We set up a giant single executor with many threads and specify memory cap. "
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 3,
+ "metadata": {},
+ "outputs": [],
"source": [
"# the following settings work well for debugging locally on VM - change when running on a cluster\n",
"# set up a giant single executor with many threads and specify memory cap\n",
"\n",
"spark = start_or_get_spark(\"ALS PySpark\", memory=\"16g\")\n",
- "\n",
+ "spark.conf.set(\"spark.sql.analyzer.failAmbiguousSelfJoin\", \"false\")\n",
"spark.conf.set(\"spark.sql.crossJoin.enabled\", \"true\")"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### 2. Download the MovieLens dataset"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 4,
- "source": [
- "# Note: The DataFrame-based API for ALS currently only supports integers for user and item ids.\n",
- "schema = StructType(\n",
- " (\n",
- " StructField(COL_USER, IntegerType()),\n",
- " StructField(COL_ITEM, IntegerType()),\n",
- " StructField(COL_RATING, FloatType()),\n",
- " StructField(\"Timestamp\", LongType()),\n",
- " )\n",
- ")\n",
- "\n",
- "data = movielens.load_spark_df(spark, size=MOVIELENS_DATA_SIZE, schema=schema, title_col=COL_TITLE, genres_col=COL_GENRE)\n",
- "data.show()"
- ],
+ "metadata": {},
"outputs": [
{
- "output_type": "stream",
"name": "stderr",
+ "output_type": "stream",
"text": [
- "100%|██████████| 4.81k/4.81k [00:00<00:00, 20.1kKB/s]\n"
+ "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.81k/4.81k [00:05<00:00, 862KB/s]\n",
+ " \r"
]
},
{
- "output_type": "stream",
"name": "stdout",
+ "output_type": "stream",
"text": [
"+-------+------+------+---------+--------------------+------+\n",
"|MovieId|UserId|Rating|Timestamp| Title| Genre|\n",
@@ -305,73 +295,116 @@
]
}
],
- "metadata": {}
+ "source": [
+ "# Note: The DataFrame-based API for ALS currently only supports integers for user and item ids.\n",
+ "schema = StructType(\n",
+ " (\n",
+ " StructField(COL_USER, IntegerType()),\n",
+ " StructField(COL_ITEM, IntegerType()),\n",
+ " StructField(COL_RATING, FloatType()),\n",
+ " StructField(\"Timestamp\", LongType()),\n",
+ " )\n",
+ ")\n",
+ "\n",
+ "data = movielens.load_spark_df(spark, size=MOVIELENS_DATA_SIZE, schema=schema, title_col=COL_TITLE, genres_col=COL_GENRE)\n",
+ "data.show()"
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"#### Split the data using the Spark random splitter provided in utilities"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 5,
- "source": [
- "train_df, test_df = spark_random_split(data.select(COL_USER, COL_ITEM, COL_RATING), ratio=0.75, seed=123)\n",
- "print (\"N train_df\", train_df.cache().count())\n",
- "print (\"N test_df\", test_df.cache().count())"
- ],
+ "metadata": {},
"outputs": [
{
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "N train_df 75147\n"
+ ]
+ },
+ {
+ "name": "stderr",
"output_type": "stream",
+ "text": [
+ "[Stage 19:================================================> (178 + 3) / 200]\r"
+ ]
+ },
+ {
"name": "stdout",
+ "output_type": "stream",
"text": [
- "N train_df 75066\n",
- "N test_df 24934\n"
+ "N test_df 24853\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\r",
+ " \r"
]
}
],
- "metadata": {}
+ "source": [
+ "train_df, test_df = spark_random_split(data.select(COL_USER, COL_ITEM, COL_RATING), ratio=0.75, seed=123)\n",
+ "print (\"N train_df\", train_df.cache().count())\n",
+ "print (\"N test_df\", test_df.cache().count())"
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"#### Get all possible user-item pairs"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"Note: We assume that training data contains all users and all catalog items. "
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 6,
+ "metadata": {},
+ "outputs": [],
"source": [
"users = train_df.select(COL_USER).distinct()\n",
"items = train_df.select(COL_ITEM).distinct()\n",
"user_item = users.crossJoin(items)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### 3. Train the ALS model on the training data, and get the top-k recommendations for our testing data\n",
"\n",
"To predict movie ratings, we use the rating data in the training set as users' explicit feedback. The hyperparameters used in building the model are referenced from [here](http://mymedialite.net/examples/datasets.html). We do not constrain the latent factors (`nonnegative = False`) in order to allow for both positive and negative preferences towards movies.\n",
"Timing will vary depending on the machine being used to train."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 7,
+ "metadata": {},
+ "outputs": [],
"source": [
"header = {\n",
" \"userCol\": COL_USER,\n",
@@ -390,42 +423,86 @@
" seed=42,\n",
" **header\n",
")"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 8,
- "source": [
- "with Timer() as train_time:\n",
- " model = als.fit(train_df)\n",
- "\n",
- "print(\"Took {} seconds for training.\".format(train_time.interval))"
- ],
+ "metadata": {},
"outputs": [
{
+ "name": "stderr",
"output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
+ {
"name": "stdout",
+ "output_type": "stream",
"text": [
- "Took 4.189040212018881 seconds for training.\n"
+ "Took 10.75137371000028 seconds for training.\n"
]
}
],
- "metadata": {}
+ "source": [
+ "with Timer() as train_time:\n",
+ " model = als.fit(train_df)\n",
+ "\n",
+ "print(\"Took {} seconds for training.\".format(train_time.interval))"
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"In the movie recommendation use case, recommending movies that have been rated by the users does not make sense. Therefore, the rated movies are removed from the recommended items.\n",
"\n",
"In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "1464772\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "[Stage 235:> (0 + 2) / 2]\r"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "9430\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\r",
+ " \r"
+ ]
+ }
+ ],
"source": [
"# Score all user-item pairs\n",
"dfs_pred = model.transform(user_item)\n",
@@ -446,31 +523,22 @@
"top_k_reco = top_all.select(\"*\", F.row_number().over(window).alias(\"rank\")).filter(F.col(\"rank\") <= TOP_K).drop(\"rank\")\n",
" \n",
"print(top_k_reco.count())"
- ],
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "1464853\n",
- "9430\n"
- ]
- }
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### 4. Random Recommender\n",
"\n",
"We define a recommender which randomly recommends unseen items to each user. "
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 10,
+ "metadata": {},
+ "outputs": [],
"source": [
"# random recommender\n",
"window = Window.partitionBy(COL_USER).orderBy(F.rand())\n",
@@ -491,20 +559,20 @@
" .filter(F.col(\"score\") <= TOP_K)\n",
" .drop(COL_RATING)\n",
")"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### 5. ALS vs Random Recommenders Performance Comparison"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 11,
+ "metadata": {},
+ "outputs": [],
"source": [
"def get_ranking_results(ranking_eval):\n",
" metrics = {\n",
@@ -525,13 +593,13 @@
" \"serendipity\": diversity_eval.serendipity()\n",
" }\n",
" return metrics "
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 12,
+ "metadata": {},
+ "outputs": [],
"source": [
"def generate_summary(data, algo, k, ranking_metrics, diversity_metrics):\n",
" summary = {\"Data\": data, \"Algo\": algo, \"K\": k}\n",
@@ -546,20 +614,28 @@
" summary.update(ranking_metrics)\n",
" summary.update(diversity_metrics)\n",
" return summary"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"#### ALS Recommender Performance Results"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
"execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ }
+ ],
"source": [
"als_ranking_eval = SparkRankingEvaluation(\n",
" test_df, \n",
@@ -573,13 +649,21 @@
")\n",
"\n",
"als_ranking_metrics = get_ranking_results(als_ranking_eval)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 17,
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ }
+ ],
"source": [
"als_diversity_eval = SparkDiversityEvaluation(\n",
" train_df = train_df, \n",
@@ -589,29 +673,37 @@
")\n",
"\n",
"als_diversity_metrics = get_diversity_results(als_diversity_eval)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 18,
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
"source": [
"als_results = generate_summary(MOVIELENS_DATA_SIZE, \"als\", TOP_K, als_ranking_metrics, als_diversity_metrics)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"#### Random Recommender Performance Results"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ }
+ ],
"source": [
"random_ranking_eval = SparkRankingEvaluation(\n",
" test_df,\n",
@@ -624,13 +716,21 @@
")\n",
"\n",
"random_ranking_metrics = get_ranking_results(random_ranking_eval)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 20,
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ }
+ ],
"source": [
"random_diversity_eval = SparkDiversityEvaluation(\n",
" train_df = train_df, \n",
@@ -640,48 +740,43 @@
")\n",
" \n",
"random_diversity_metrics = get_diversity_results(random_diversity_eval)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 21,
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [],
"source": [
"random_results = generate_summary(MOVIELENS_DATA_SIZE, \"random\", TOP_K, random_ranking_metrics, random_diversity_metrics)"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"#### Result Comparison"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 22,
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
"source": [
"cols = [\"Data\", \"Algo\", \"K\", \"Precision@k\", \"Recall@k\", \"NDCG@k\", \"Mean average precision\",\"catalog_coverage\", \"distributional_coverage\",\"novelty\", \"diversity\", \"serendipity\" ]\n",
"df_results = pd.DataFrame(columns=cols)\n",
"\n",
"df_results.loc[1] = als_results \n",
"df_results.loc[2] = random_results "
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 23,
- "source": [
- "df_results"
- ],
+ "execution_count": 20,
+ "metadata": {},
"outputs": [
{
- "output_type": "execute_result",
"data": {
"text/html": [
"
\n",
@@ -722,30 +817,30 @@
"
100k | \n",
" als | \n",
" 10 | \n",
- " 0.047296 | \n",
- " 0.016015 | \n",
- " 0.043097 | \n",
- " 0.004579 | \n",
- " 0.385793 | \n",
- " 7.967257 | \n",
- " 11.659776 | \n",
- " 0.892277 | \n",
- " 0.878733 | \n",
+ " 0.044374 | \n",
+ " 0.015567 | \n",
+ " 0.040657 | \n",
+ " 0.004202 | \n",
+ " 0.374158 | \n",
+ " 7.989889 | \n",
+ " 11.740626 | \n",
+ " 0.890659 | \n",
+ " 0.879359 | \n",
" \n",
" \n",
" 2 | \n",
" 100k | \n",
" random | \n",
" 10 | \n",
- " 0.016543 | \n",
- " 0.005566 | \n",
- " 0.016373 | \n",
- " 0.001441 | \n",
- " 0.994489 | \n",
- " 10.541850 | \n",
- " 12.136439 | \n",
- " 0.922613 | \n",
- " 0.892511 | \n",
+ " 0.018259 | \n",
+ " 0.006516 | \n",
+ " 0.018537 | \n",
+ " 0.002038 | \n",
+ " 0.998775 | \n",
+ " 10.543160 | \n",
+ " 12.180267 | \n",
+ " 0.923302 | \n",
+ " 0.892897 | \n",
"
\n",
" \n",
"\n",
@@ -753,43 +848,48 @@
],
"text/plain": [
" Data Algo K Precision@k Recall@k NDCG@k Mean average precision \\\n",
- "1 100k als 10 0.047296 0.016015 0.043097 0.004579 \n",
- "2 100k random 10 0.016543 0.005566 0.016373 0.001441 \n",
+ "1 100k als 10 0.044374 0.015567 0.040657 0.004202 \n",
+ "2 100k random 10 0.018259 0.006516 0.018537 0.002038 \n",
"\n",
" catalog_coverage distributional_coverage novelty diversity \\\n",
- "1 0.385793 7.967257 11.659776 0.892277 \n",
- "2 0.994489 10.541850 12.136439 0.922613 \n",
+ "1 0.374158 7.989889 11.740626 0.890659 \n",
+ "2 0.998775 10.543160 12.180267 0.923302 \n",
"\n",
" serendipity \n",
- "1 0.878733 \n",
- "2 0.892511 "
+ "1 0.879359 \n",
+ "2 0.892897 "
]
},
+ "execution_count": 20,
"metadata": {},
- "execution_count": 23
+ "output_type": "execute_result"
}
],
- "metadata": {}
+ "source": [
+ "df_results"
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"#### Conclusion\n",
"The comparision results show that the ALS recommender outperforms the random recommender on ranking metrics (Precision@k, Recall@k, NDCG@k, and\tMean average precision), while the random recommender outperforms ALS recommender on diversity metrics. This is because ALS is optimized for estimating the item rating as accurate as possible, therefore it performs well on accuracy metrics including rating and ranking metrics. As a side effect, the items being recommended tend to be popular items, which are the items mostly sold or viewed. It leaves the long-tail less popular items having less chance to get introduced to the users. This is the reason why ALS is not performing as well as a random recommender on diversity metrics. "
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### 6. Calculate diversity metrics using item feature vector based item-item similarity\n",
"In the above section we calculate diversity metrics using item co-occurrence count based item-item similarity. In the scenarios when item features are available, we may want to calculate item-item similarity based on item feature vectors. In this section, we show how to calculate diversity metrics using item feature vector based item-item similarity."
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 24,
+ "execution_count": 21,
+ "metadata": {},
+ "outputs": [],
"source": [
"# Get movie features \"title\" and \"genres\"\n",
"movies = (\n",
@@ -799,13 +899,13 @@
" .withColumn(COL_TITLE, F.regexp_replace(F.col(COL_TITLE), \"[\\(),:^0-9]\", \"\")) # remove year from title\n",
" .drop(\"count\") # remove unused columns\n",
")"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 25,
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [],
"source": [
"# tokenize \"title\" column\n",
"title_tokenizer = Tokenizer(inputCol=COL_TITLE, outputCol=\"title_words\")\n",
@@ -814,13 +914,51 @@
"# remove stop words\n",
"remover = StopWordsRemover(inputCol=\"title_words\", outputCol=\"text\")\n",
"clean_data = remover.transform(tokenized_data).drop(COL_TITLE, \"title_words\")"
- ],
- "outputs": [],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 26,
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "[Stage 1441:============================================> (172 + 2) / 200]\r"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "+-------+------------------------------------------------------------------------------------+\n",
+ "|MovieId|features |\n",
+ "+-------+------------------------------------------------------------------------------------+\n",
+ "|29 |(1043,[158,269,1025,1026,1029,1031],[1.0,1.0,1.0,1.0,1.0,1.0]) |\n",
+ "|26 |(1043,[54,139,1025],[1.0,1.0,1.0]) |\n",
+ "|1677 |(1043,[260,902,1024],[1.0,1.0,1.0]) |\n",
+ "|964 |(1043,[416,429,1024,1025],[1.0,1.0,1.0,1.0]) |\n",
+ "|474 |(1043,[112,302,329,517,540,787,933,1032,1034],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|\n",
+ "|1258 |(1043,[114,799,1025,1028],[1.0,1.0,1.0,1.0]) |\n",
+ "|541 |(1043,[635,910,1026,1029],[1.0,1.0,1.0,1.0]) |\n",
+ "|1224 |(1043,[978,1024],[1.0,1.0]) |\n",
+ "|558 |(1043,[231,524,1024,1027,1041],[1.0,1.0,1.0,1.0,1.0]) |\n",
+ "|191 |(1043,[206,1024,1035],[1.0,1.0,1.0]) |\n",
+ "+-------+------------------------------------------------------------------------------------+\n",
+ "only showing top 10 rows\n",
+ "\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "\r",
+ " \r"
+ ]
+ }
+ ],
"source": [
"# convert text input into feature vectors\n",
"\n",
@@ -841,43 +979,36 @@
"feature_data = assembler.transform(vectorized_data).select(COL_ITEM, \"features\")\n",
"\n",
"feature_data.show(10, False)"
- ],
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "+------+---------------------------------------------+\n",
- "|ItemId|features |\n",
- "+------+---------------------------------------------+\n",
- "|167 |(1043,[128,544,1025],[1.0,1.0,1.0]) |\n",
- "|1343 |(1043,[38,300,1024],[1.0,1.0,1.0]) |\n",
- "|1607 |(1043,[592,821,1024],[1.0,1.0,1.0]) |\n",
- "|966 |(1043,[389,502,1028],[1.0,1.0,1.0]) |\n",
- "|9 |(1043,[11,342,1014,1024],[1.0,1.0,1.0,1.0]) |\n",
- "|1230 |(1043,[597,740,902,1025],[1.0,1.0,1.0,1.0]) |\n",
- "|1118 |(1043,[702,1025],[1.0,1.0]) |\n",
- "|673 |(1043,[169,690,1027,1040],[1.0,1.0,1.0,1.0]) |\n",
- "|879 |(1043,[909,1026,1027,1034],[1.0,1.0,1.0,1.0])|\n",
- "|66 |(1043,[256,1025,1028],[1.0,1.0,1.0]) |\n",
- "+------+---------------------------------------------+\n",
- "only showing top 10 rows\n",
- "\n"
- ]
- }
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"The *features* column is represented with a SparseVector object. For example, in the feature vector (1043,[128,544,1025],[1.0,1.0,1.0]), 1043 is the vector length, indicating the vector consisting of 1043 item features. The values at index positions 128,544,1025 are 1.0, and the values at other positions are all 0. "
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": 27,
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0.8742459916963194\n",
+ "0.8891175823541189\n"
+ ]
+ }
+ ],
"source": [
"als_eval = SparkDiversityEvaluation(\n",
" train_df = train_df, \n",
@@ -892,22 +1023,29 @@
"als_serendipity=als_eval.serendipity()\n",
"print(als_diversity)\n",
"print(als_serendipity)"
- ],
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
"outputs": [
{
+ "name": "stderr",
"output_type": "stream",
+ "text": [
+ " \r"
+ ]
+ },
+ {
"name": "stdout",
+ "output_type": "stream",
"text": [
- "0.8738984131037538\n",
- "0.8873467159479473\n"
+ "0.896073781038039\n",
+ "0.8925253230847529\n"
]
}
],
- "metadata": {}
- },
- {
- "cell_type": "code",
- "execution_count": 28,
"source": [
"random_eval = SparkDiversityEvaluation(\n",
" train_df = train_df, \n",
@@ -922,28 +1060,18 @@
"random_serendipity=random_eval.serendipity()\n",
"print(random_diversity)\n",
"print(random_serendipity)"
- ],
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "0.8982144953920664\n",
- "0.8941807579293202\n"
- ]
- }
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"It's interesting that the value of diversity and serendipity changes when using different item-item similarity calculation approach, for both ALS algorithm and random recommender. The diversity and serendipity of random recommender are still higher than ALS algorithm. "
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {},
"source": [
"### References\n",
"The metric definitions / formulations are based on the following references:\n",
@@ -951,24 +1079,27 @@
"- G. Shani and A. Gunawardana, Evaluating recommendation systems, Recommender Systems Handbook pp. 257-297, 2010.\n",
"- E. Yan, Serendipity: Accuracy’s unpopular best friend in recommender Systems, eugeneyan.com, April 2020\n",
"- Y.C. Zhang, D.Ó. Séaghdha, D. Quercia and T. Jambor, Auralist: introducing serendipity into music recommendation, WSDM 2012\n"
- ],
- "metadata": {}
+ ]
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 26,
+ "metadata": {},
+ "outputs": [],
"source": [
"# cleanup spark instance\n",
"spark.stop()"
- ],
- "outputs": [],
- "metadata": {}
+ ]
}
],
"metadata": {
+ "interpreter": {
+ "hash": "7ec2189bea0434770dca7423a25e631e1cca9c4e2b4ff137a82f4dff32ac9607"
+ },
"kernelspec": {
- "name": "python3",
- "display_name": "Python 3.6.9 64-bit ('.env': venv)"
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
},
"language_info": {
"codemirror_mode": {
@@ -980,12 +1111,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.6.9"
- },
- "interpreter": {
- "hash": "7ec2189bea0434770dca7423a25e631e1cca9c4e2b4ff137a82f4dff32ac9607"
+ "version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 1
-}
\ No newline at end of file
+}
diff --git a/recommenders/utils/spark_utils.py b/recommenders/utils/spark_utils.py
index 42be301cd9..b294879375 100644
--- a/recommenders/utils/spark_utils.py
+++ b/recommenders/utils/spark_utils.py
@@ -9,7 +9,7 @@
except ImportError:
pass # skip this import if we are in pure python environment
-MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark:1.0.0-rc3-184-3314e164-SNAPSHOT"
+MMLSPARK_PACKAGE = "com.microsoft.azure:synapseml_2.12:0.9.5"
MMLSPARK_REPO = "https://mmlspark.azureedge.net/maven"
# We support Spark v3, but in case you wish to use v2, set
# MMLSPARK_PACKAGE = "com.microsoft.ml.spark:mmlspark_2.11:0.18.1"
diff --git a/setup.py b/setup.py
index 15b7a08aae..5530d36044 100644
--- a/setup.py
+++ b/setup.py
@@ -74,7 +74,7 @@
"spark": [
"databricks_cli>=0.8.6,<1",
"pyarrow>=0.12.1,<7.0.0",
- "pyspark>=2.4.5,<3.2.0",
+ "pyspark>=2.4.5,<4.0.0",
],
"dev": [
"black>=18.6b4,<21",