Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag completion difficulties #137

Closed
Malabarba opened this issue Dec 4, 2014 · 16 comments
Closed

Tag completion difficulties #137

Malabarba opened this issue Dec 4, 2014 · 16 comments
Assignees
Milestone

Comments

@Malabarba
Copy link
Collaborator

@Malabarba Malabarba commented Dec 4, 2014

I'm writing up tag completion, and I noticed a pretty big obstacle.

I originally thought we could just get a list of all tags upon first entering compose-mode, and then offer completion on that. However, queries are limited to return 100 items, and SO has almost 40,000 tags.
Obviously doing 400 requests is absolutely unnacceptable.

Here's what I'm thinking of doing now:

As the user is typing up a tags, fetch a list of the top 100 tags (a single query) matching the string typed so far, and offer those as completion. This implies one API request each time a completion menu is displayed.

That should be a lot better than 400, but on sites with fewer tags (emacs.SE has only a few hundred) this will be a lot less efficient than just getting all tags from the start.

@Malabarba Malabarba changed the title Tag completion Tag completion difficulties Dec 4, 2014
@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Dec 4, 2014

I'm working on a :retrieve-all keyword argument that will simply retrieve all pages in succession. It's not turning out to be too terrible, but it has its challenges. This doesn't solve the problem of larger sites like SO, though – but we can view .total in the wrapper to see if it is necessary (under some threshold, I imagine).


I think we can use the data dump for the completion of the larger sites. We can strip out the information we need from it (just a list of tags would suffice, I think) and then embed it. This might be worthwhile to do from a separate repository though to be included as a submodule.

@vermiculus vermiculus self-assigned this Dec 4, 2014
@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 4, 2014

Here's another option.
We could write a bot that gathers a list of all tags from each site (every hour or so) and pushes it to a data branch here on this repo (one file for each site).

Then the client can just download a single file here from github, and that will contain all tags for the site. It should even be faster than the API, since this file will only contain names (not entire objects).

@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Dec 4, 2014

That's an excellent idea! How would this bot be implemented?

My only concern with adding it to this repo would be an exposed write-key, but there might be a way around that.

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 4, 2014

The sx-method keyword you were talking about (for getting all pages at once) would be a start.
Then we'd need to write a simple emacs-lisp function that goes through all the sites, gets the list of tags, and saves each to a file.
Then we'd write a bash script that pulls the repo, runs the elispfunction, and commits+pushes the repo (to some exclusive branch, of course).
Then somebody would run this bash script every hoour or so.

I can use my work PC for that last part. I already use it for paradox and the archive tracker anyway.

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 4, 2014

As for the write key, I think a read key would be enough, and we can leave the key as an environment variable or a command line argument. Something that won't go in the repo.

@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Dec 4, 2014

Wouldn't we become throttled in very short order? Also, I think pushing to gh-pages may be a good idea -- that way, the page can just be downloaded.

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 5, 2014

Actually, Any file in any branch on github is directly accessible via the right url. Just visit the file's page here on the website and click the "raw" button. The resulting page has the url you're looking for.

Still, it's very possible that the website generated from the gh-pages branch has a better response time or something. So I agree we should use that.

As for throttling. I agree we'll use whichever token gives the best quota, and then adjust the frequency to avoid throttling.
I don't know how many requests we'll need for the whole thing, but I estimate on the order of a few thousand.

@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Dec 5, 2014

I did a bit of rough math with the tag pages and came up with the following estimates

 41200 stackoverflow
  7300 superuser
  6700 serverfault
  1600 tex
  3700 ubuntu
390000 all other sites (estimate: 260 sites * 1500 tags/site)
---------------------
450000 Total

So, it would take about 4500 requests to do this. If we make one request every ten seconds, we could update roughly twice a day.


Note: We cannot query the API for tags as a search; see this comment.
This has been corrected. The comment thread has details.

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 5, 2014

Yes, the inname api call is what I had in mind in the last two paragraphs up there in the first post of this issue.

It's a viable option, just not as nice as knowing all tags.

@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Dec 5, 2014

As for the write key, I think a read key would be enough, and we can leave the key as an environment variable or a command line argument. Something that won't go in the repo.

I'm talking about writing to the repository ;) but yes, it could be a command-line argument.

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 5, 2014

AAAh. Yes, we'd need a key for the repo. But if we run the bot on one of our computers, then the ssh key we already have configured should do.

I pushed a branch called completing-tags. There's nothing about completion there yet, but it has a few initial tag-related functions. And it implements safe-checking in the compose buffer before submiting inexistent tags.

I won't be able to touch anything for the next several days (no internet), so feel free to use that branch. I'll get back in touch before doinganything else.

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 8, 2014

Ok, I've written up the bot. It's on the tag-bot branch.

It's an elisp file and a shell script.
The only thing missing for it to work is a function for fetching all tags from a site.

@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Dec 8, 2014

Working on it. I've been awfully busy the past couple days with work stuff :( I'll see if I can't finish up that keyword tonight :)

@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Dec 8, 2014

No worries. I had some free time on the plane, so I decided to get this done. :)

@Malabarba Malabarba mentioned this issue Dec 28, 2014
@Malabarba Malabarba added this to the v0.2 milestone Jan 4, 2015
@Malabarba

This comment has been minimized.

Copy link
Collaborator Author

@Malabarba Malabarba commented Jan 5, 2015

Alright, now that the bot is done, we can start doing tag completion.
As per gitter discussion, we'll start with completing-read.

I'll write a command that reads from the minibuffer and inserts a tag.

@Malabarba Malabarba assigned Malabarba and unassigned vermiculus Jan 5, 2015
@vermiculus

This comment has been minimized.

Copy link
Owner

@vermiculus vermiculus commented Jan 5, 2015

Just as a note, it doesn't take 4500 requests to do this. Actual implementation tests reveal it takes around 1450. This is likely due to the number of tag synonyms there are on the Trilogy sites.

@Malabarba Malabarba closed this in f774958 Jan 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.