-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update frontmatter with keywords #14475
base: master
Are you sure you want to change the base?
Conversation
Your site preview for commit bcb8d18 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-bcb8d186.s3-website.us-west-2.amazonaws.com. |
d8f9e17
to
2efc465
Compare
Your site preview for commit d8f9e17 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-d8f9e179.s3-website.us-west-2.amazonaws.com. |
Your site preview for commit 2efc465 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-2efc4654.s3-website.us-west-2.amazonaws.com. |
7ff9b33
to
5f08699
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! IT doesn't feel like we're quite there yet as the keywords its picking out seem too general in a lot of cases, but let's keep playing with it!
(It might be interesting to run it on the blog posts too; those pages have a lot of content that probably isn't making it into the index)
content/docs/esc/cli/_index.md
Outdated
- /docs/esc-cli/ | ||
search: | ||
keywords: | ||
- esc_env_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the underscores work in the index? I would have expected "esc env version"
- esc_env | ||
- environments | ||
- env | ||
- environment | ||
- esc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this list ends up pretty redundant (I think algolia can handle plurals?)
And it's kind of missing the obvious "list environments" keyword.
- /docs/esc/sdk/ | ||
search: | ||
keywords: | ||
- pulumi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should black list "pulumi" from being used as a keyword
- esc | ||
- node | ||
- javascript | ||
- typescript |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surprised we don't get "SDK" for this page or "language"
The javascript/typescript keywords also seem pretty weak since they are so general.
5f08699
to
64df317
Compare
Your site preview for commit 64df317 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-64df317c.s3-website.us-west-2.amazonaws.com. |
update script
64df317
to
6952d88
Compare
@mjeffryes - I have made some updates to the keyword extraction, see the PR description. This seems better overall IMO. See this PR that generates the keywords across all the /docs pages. I also have it deployed to the algolia testing index and it can also be tested out at pulumi-test.io/docs. @thoward if you can take a look also. Let me know if this approach even makes sense 🤷. We should probably ignore any of the cli docs or other docs that are codegenned since they will be overwritten anyway. |
6952d88
to
fdc32ee
Compare
cleanup
fdc32ee
to
dd156f6
Compare
Your site preview for commit fdc32ee is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-fdc32ee8.s3-website.us-west-2.amazonaws.com. |
Your site preview for commit dd156f6 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-dd156f65.s3-website.us-west-2.amazonaws.com. |
Hmmm... Still doesn't seem like it's making a big difference. I did tried all the queries in our spreadsheet and they still all turn up empty. I wonder if it would do better on the blogs content? Overall, the keywords it's finding are too generic to be of much value, they are either already in the title and headings or very vague. Maybe this isn't as easy a win as I had thought. |
Did you also add the blog keywords in the test index? https://github.com/pulumi/docs/pull/14559/files#diff-8689a0a899150c22622759c7fb62e73306a123f13db65fb0e6052a4e5ff8fca9 shows new keywords that I would think would help retrieve this blog when searching "EKS Auto Mode" |
Your site preview for commit a39051f is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-a39051f7.s3-website.us-west-2.amazonaws.com. |
Your site preview for commit 3a80f30 is ready! 🎉 http://www-testing-pulumi-docs-origin-pr-14475-3a80f30f.s3-website.us-west-2.amazonaws.com. |
@mjeffryes - blog is deployed to testing index now as well as docs. Yeah agreed, still a bit too generic. I have some ideas, I'll try tomorrow on this. It does identify that eks auto mode blog post now fwiw. |
Generate keywords and add them to frontmatter.
This PR reads in the markdown files and uses the content to generate keywords. It then writes back the frontmatter to the file with keywords list under search.keywords.
I am taking a hybrid approach here that combines tf-idf and keybert. TF-IDF is used to find the keywords in a page by comparing how often they appear in a specific page relative to all the other pages. This helps us identify the unique keywords that make each page more distinct, helping to reject noise and common words that are of less meaning. Keybert is used to extract the keywords by looking at the semantic relevance of the words. We then take the keywords that are extracted that are common to both methods, with the intended result including keywords that are both semantically relevant and distinct across the corpus of pages. There are up to 7 keywords total that are added and we prioritize the keywords that are common to both methods when producing the list.
Outputs using this method to see what this produces:
#14559