Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update frontmatter with keywords #14475

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sean1588
Copy link
Member

@sean1588 sean1588 commented Mar 21, 2025

Generate keywords and add them to frontmatter.

This PR reads in the markdown files and uses the content to generate keywords. It then writes back the frontmatter to the file with keywords list under search.keywords.

I am taking a hybrid approach here that combines tf-idf and keybert. TF-IDF is used to find the keywords in a page by comparing how often they appear in a specific page relative to all the other pages. This helps us identify the unique keywords that make each page more distinct, helping to reject noise and common words that are of less meaning. Keybert is used to extract the keywords by looking at the semantic relevance of the words. We then take the keywords that are extracted that are common to both methods, with the intended result including keywords that are both semantically relevant and distinct across the corpus of pages. There are up to 7 keywords total that are added and we prioritize the keywords that are common to both methods when producing the list.

Outputs using this method to see what this produces:

#14559

@pulumi-bot
Copy link
Collaborator

@sean1588 sean1588 force-pushed the sean/add-keywords-frontmatter branch from d8f9e17 to 2efc465 Compare March 21, 2025 01:24
@pulumi-bot
Copy link
Collaborator

@sean1588 sean1588 requested a review from a team as a code owner March 21, 2025 01:26
@sean1588 sean1588 requested a review from mjeffryes March 21, 2025 01:30
@pulumi-bot
Copy link
Collaborator

@sean1588 sean1588 force-pushed the sean/add-keywords-frontmatter branch from 7ff9b33 to 5f08699 Compare March 21, 2025 20:43
Copy link
Member

@mjeffryes mjeffryes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! IT doesn't feel like we're quite there yet as the keywords its picking out seem too general in a lot of cases, but let's keep playing with it!

(It might be interesting to run it on the blog posts too; those pages have a lot of content that probably isn't making it into the index)

- /docs/esc-cli/
search:
keywords:
- esc_env_version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the underscores work in the index? I would have expected "esc env version"

Comment on lines 5 to 8
- esc_env
- environments
- env
- environment
- esc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this list ends up pretty redundant (I think algolia can handle plurals?)

And it's kind of missing the obvious "list environments" keyword.

- /docs/esc/sdk/
search:
keywords:
- pulumi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should black list "pulumi" from being used as a keyword

Comment on lines 18 to 21
- esc
- node
- javascript
- typescript
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprised we don't get "SDK" for this page or "language"

The javascript/typescript keywords also seem pretty weak since they are so general.

@sean1588 sean1588 force-pushed the sean/add-keywords-frontmatter branch from 5f08699 to 64df317 Compare March 25, 2025 17:54
@pulumi-bot
Copy link
Collaborator

@sean1588 sean1588 force-pushed the sean/add-keywords-frontmatter branch from 64df317 to 6952d88 Compare March 25, 2025 18:49
@sean1588
Copy link
Member Author

sean1588 commented Mar 25, 2025

@mjeffryes - I have made some updates to the keyword extraction, see the PR description. This seems better overall IMO. See this PR that generates the keywords across all the /docs pages. I also have it deployed to the algolia testing index and it can also be tested out at pulumi-test.io/docs. @thoward if you can take a look also. Let me know if this approach even makes sense 🤷.

We should probably ignore any of the cli docs or other docs that are codegenned since they will be overwritten anyway.

@sean1588 sean1588 requested a review from mjeffryes March 25, 2025 19:31
@sean1588 sean1588 force-pushed the sean/add-keywords-frontmatter branch from 6952d88 to fdc32ee Compare March 25, 2025 19:41
cleanup
@sean1588 sean1588 force-pushed the sean/add-keywords-frontmatter branch from fdc32ee to dd156f6 Compare March 25, 2025 19:42
@pulumi-bot
Copy link
Collaborator

@pulumi-bot
Copy link
Collaborator

@mjeffryes
Copy link
Member

@mjeffryes - I have made some updates to the keyword extraction, see the PR description. This seems better overall IMO. See this PR that generates the keywords across all the /docs pages. I also have it deployed to the algolia testing index and it can also be tested out at pulumi-test.io/docs. @thoward if you can take a look also. Let me know if this approach even makes sense 🤷.

We should probably ignore any of the cli docs or other docs that are codegenned since they will be overwritten anyway.

Hmmm... Still doesn't seem like it's making a big difference. I did tried all the queries in our spreadsheet and they still all turn up empty. I wonder if it would do better on the blogs content? Overall, the keywords it's finding are too generic to be of much value, they are either already in the title and headings or very vague. Maybe this isn't as easy a win as I had thought.

@mjeffryes
Copy link
Member

Did you also add the blog keywords in the test index? https://github.com/pulumi/docs/pull/14559/files#diff-8689a0a899150c22622759c7fb62e73306a123f13db65fb0e6052a4e5ff8fca9 shows new keywords that I would think would help retrieve this blog when searching "EKS Auto Mode"

@pulumi-bot
Copy link
Collaborator

@pulumi-bot
Copy link
Collaborator

@sean1588
Copy link
Member Author

@mjeffryes - blog is deployed to testing index now as well as docs. Yeah agreed, still a bit too generic. I have some ideas, I'll try tomorrow on this. It does identify that eks auto mode blog post now fwiw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants