Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAR: added pySpark notebook #90

Merged
merged 4 commits into from
Oct 23, 2018
Merged

SAR: added pySpark notebook #90

merged 4 commits into from
Oct 23, 2018

Conversation

maxkazmsft
Copy link
Collaborator

No description provided.

notebooks/00_quick_start/sar_pyspark_movielens.ipynb Outdated Show resolved Hide resolved
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. In order to use SAR, we need to hash users and items and make sure there are no cold users"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like a comment than a title

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}
],
"source": [
"print(\"Obtaining all users and items \")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of printing the text, I would add the text as markdown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

}
],
"source": [
"print(\"Model:\\t\" + model.model_str,\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small detail, the names of the metrics in the other notebook are slightly different:

Model:	sar_ref
Top K:	
MAP@k:	
NDCG@k:	
Precision@k:	
Recall@k:	

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I opened an issue about this already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxkazmsft I was trying to find the issue, but I couldn't. Could you please point me to it?

@miguelgfierro
Copy link
Collaborator

This is great, added more reviewers

"cell_type": "markdown",
"metadata": {},
"source": [
"### 0. Set up Spark context"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point about the style of markdown text:

  • Use two #s for level-1 headings.
  • No "dot" after number.
  • For one level down, add "dot" to connect numbers in headings. E.g., "### 1.2 Model training".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I copied Nikhil's notebook and going to subsections makes the notebook more readable. We should really have a way to keep all notebooks in sync - please see my earlier comment.

}
],
"source": [
"schema = StructType((StructField(\"UserId\", StringType()),\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor points - coding style.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
],
"source": [
"train, test = spark_random_split(data)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to make the split ratio explicit?

],
"source": [
"model.fit(train_indexed)\n",
"top_k = model.recommend_k_items(test_indexed)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have k explicit here?

@miguelgfierro miguelgfierro added this to the Initial MVP milestone Oct 22, 2018
@miguelgfierro
Copy link
Collaborator

The spark tests are failing due to this issue #75

@yueguoguo
Copy link
Collaborator

yueguoguo commented Oct 23, 2018

The problem is that the notebook metadata, where the kernel spec is given, defines kernel name to be reco instead of the one used in our CICD machine. See lines 557 and 559 in the sar_pyspark_movielens.ipynb file, you will find those metadata that specifies the kernel info. These metadata are created during the creation of notebook, and it is the same as that of the conda env kernel on the user's machine - in our SETUP instruction, the last step is the to install the created conda env kernel to ipython.

To resolve the issue we can run nbconvert to execute the notebook with the kernel name specified in the execution traitlets. The kernel name in the traitlet should be in line with the one in the CICD machine. This overwrites the one specified in the meta data in the notebook. E.g., if I run a notebook with a specified kernel I do

jupyter nbconvert --ExecutePreprocessor.kernel_name="recommender" --execute sar_python_cpu_movielens.ipynb

@maxkazmsft maxkazmsft merged commit 862684f into staging Oct 23, 2018
@maxkazmsft maxkazmsft deleted the maxkaz/pyspark_notebook branch October 23, 2018 14:35
WessZumino pushed a commit that referenced this pull request Nov 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants