New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a search index of API objects #4289

Open
ericholscher opened this Issue Jun 22, 2018 · 3 comments

Comments

1 participant
@ericholscher
Copy link
Member

ericholscher commented Jun 22, 2018

Currently our search index only includes parsed HTML from the pages. We don't include any additional semantic information about anything.

I'd really like to index Python Domain objects, including information about their type.

Querying

This would allow us to support queries like:

  • type:class Project
  • type:method get

It would also allow us to create an autocomplete API, so you could search Project.get_ab and it would suggest Project.get_absolute_url. This would be great for libraries.

Indexing

In order to do this, we need to improve our indexing of Sphinx objects. It would likely require outputting the in-memory Domain objects into JSON, and then indexing them in the same fashion that we do now. It wouldn't be too difficult to add to our current JSON search output tooling, and allow us to get much richer search results for Python classes

UI

We would also need to improve our search results pages, so that we could output more valuable type information. So in the search results, it would differentiate between an API listing result and a primary page result where the name is mentioned.

@ericholscher ericholscher added this to Backlog in Search update via automation Jun 22, 2018

@ericholscher

This comment has been minimized.

Copy link
Member Author

ericholscher commented Jul 31, 2018

One approach would be using the Sphinx objects.inv data that is output. We had old code in RTD that parsed this data, which you can see here:

def update_intersphinx(version_pk):
version = Version.objects.get(pk=version_pk)
path = version.project.find('objects.inv', version.slug)
if not path:
print "ERR: %s has no path" % version
return None
app = DictObj()
app.srcdir = path
try:
inv = fetch_inventory(app, app.srcdir, 'objects.inv')
except TypeError:
print "Failed to fetch inventory for %s" % version
return None
# I'm entirelty not sure this is even close to correct.
# There's a lot of info I'm throwing away here; revisit later?
for keytype in inv:
for term in inv[keytype]:
try:
_, _, url, title = inv[keytype][term]
if not title or title == '-':
if '#' in url:
title = url.rsplit('#')[-1]
else:
title = url
find_str = "rtd-builds/latest"
latest = url.find(find_str)
url = url[latest + len(find_str) + 1:]
url = "http://%s.readthedocs.org/en/latest/%s" % (
version.project.slug, url)
save_term(version, term, url, title)
if '.' in term:
save_term(version, term.split('.')[-1], url, title)
except Exception, e: #Yes, I'm an evil person.
print "*** Failed updating %s: %s" % (term, e)

@ericholscher

This comment has been minimized.

Copy link
Member Author

ericholscher commented Aug 1, 2018

Another approach would be to look at how Sphinx generates it's search index, as it's indexing this data already in a good manner.

@ericholscher

This comment has been minimized.

Copy link
Member Author

ericholscher commented Jan 31, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment