Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Categorization! #473

Merged
merged 34 commits into from Dec 29, 2016
Merged

Add Categorization! #473

merged 34 commits into from Dec 29, 2016

Conversation

@carols10cents
Copy link
Member

carols10cents commented Nov 17, 2016

Hiiii! This is the first part of adding categories to crates.io!

What this PR does

  • On server startup, take the list of categories in src/categories.txt and add or remove categories from the database as needed to make the database match the categories. This way, the list of available categories can be changed via pull request to this repo.
  • A /categories page not yet linked from anywhere that displays all the categories in alphabetical order
  • A /categories/whatever page that will list all the crates in a category
  • If a crate has categories, they will be listed on that crate's page in the sidebar under Keywords
  • If categories are specified in the metadata uploaded with a new crate request from cargo, they will be added to the crate. (Implemented in rust-lang/cargo#3301)

Testing this PR

$ cargo build
$ cargo run --bin migrate
$ cargo run --bin sync-categories
$ cargo run --bin server

Start the frontend as well (Note: I had to use ember server --proxy http://127.0.0.1:8888 instead of yarn run start:local with the latest version of yarn in order to actually use my local backend, I'm investigating why this is the case)

  • You should be able to go to /categories and see 2 categories, "Development Tools" and "Libraries"
  • You should be able to click on those categories and see 0 crates. They each have subcategories, that should also have 0 crates.
  • Publish a crate to your local crates.io (I had to comment out code uploading to s3 since I didn't want to do that)
  • Using rust-lang/cargo#3301, publish to your local index
  • You should see your crate on that category's page
  • You should see the category on the crate's page (might have to do a shift-reload because of ember caching)

Deploying this PR

It should be totally fine to deploy this PR to crates.io. Nothing links to /categories, and cargo does not publish category metadata yet-- and neither cargo nor crates.io will complain if a crate is not in any categories.

After this PR is merged

  • I will start a PR to categories.txt where we can bikeshed what we actually want the initial set of categories to be
  • I will open a PR to add a link to /categories somewhere
  • I will send some popular crates PRs to add categories
@carols10cents
Copy link
Member Author

carols10cents commented Nov 17, 2016

Ummmmm I have no idea why the tests are failing with "Once instance has previously been poisoned" :(

@carols10cents
Copy link
Member Author

carols10cents commented Nov 18, 2016

Aaaand I forgot to add support for subcategories, just realized that. I'd love any thoughts in the meantime, but I'm going to be adding a few commits :)

Copy link
Member

alexcrichton left a comment

Awesome! I think the test error is fixed on master, so a rebase should pick that up. I'm also fine with the rollout strategy here, sounds good to me.

Some other thoughts of mine:

  • I wonder if there's a warning message we could provide back to Cargo for nonexistent categories? Either for a typo'd category or for just picking something that doesn't exist.
  • Right now we can't update a crate's metadata without publishing a new version, the addition of this feature is unfortunately likely to exacerbate this problem. Not something that needs to be fixed here per se, but in theory it's not too hard to add a cargo subcommand to update crate metadata...
ALTER TABLE crates_categories \
ADD CONSTRAINT fk_crates_categories_category_id \
FOREIGN KEY (category_id) REFERENCES categories (id) \
ON DELETE CASCADE", " \

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

This just means if we delete a category it'll auto-delete everything from crates_categories, right?

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 21, 2016

Author Member

Correct, is that ok?

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Oh yeah fine by me, just not used to fancy sql features :)

let in_clause = categories.iter()
.map(|c| format!("'{}'", c))
.collect::<Vec<_>>()
.join(",");

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Perhaps just for the sake of hygiene, but could we use rust-postgres's ability for escaping here? (rather than doing so ourselves).

I do realize though that categories.txt isn't user input, I figure it's just a good idea to stick with built-in escaping.

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 21, 2016

Author Member

rust-postgres doesn't support this for the bulk import of multiple values that I wanted to do here, see sfackler/rust-postgres#218. I can change the WHERE NOT IN to use rust-postgres' escaping, though.

Another option is to make one insert for every category... this doesn't really NEED to be fast... wdyt?

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Eh I ended up seeing more of these throughout the codebase anyway (which were all reasonable), so it's fine to ignore this. A minor "concern" anyway.

pub fn find_by_category(conn: &GenericConnection, name: &str)
-> CargoResult<Option<Category>> {
let stmt = try!(conn.prepare("SELECT * FROM categories \
WHERE category = $1"));

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Could you also add an index to the database for this category field? (unless I already missed it)

Also, perhaps the index and this could be based on lower(category) and lower($1) so we don't have to worry about case issues?

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 21, 2016

Author Member

Postgres automatically adds an index for UNIQUE fields (see https://www.postgresql.org/docs/current/static/ddl-constraints.html 5.3.3. Unique Constraints, "Adding a unique constraint will automatically create a unique B-tree index on the column or group of columns listed in the constraint.").

I can make it lower(category) though! I was thinking exact should mean exact, but it would probably be a bit friendlier to not care about case.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Postgres foils me again! If we add slugs then I think this becomes a non-issue, we can just ensure that all slugs are always lowercase.

pub fn encodable(self) -> EncodableCategory {
let Category { id: _, crates_cnt, category, created_at } = self;
EncodableCategory {
id: category.clone(),

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Could this be the lowercase version to be "pretty"?

Although now that I think more about this, maybe we should have two fields in the categories table. One for a slug (url friendly, this field) and another for the name? That way we can support categories with punctuation and spaces and such, all while keeping a nice url.

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 21, 2016

Author Member

Sure! Should categories.txt still list only the title-cased, punctuated version, and then the slug is lowercased, replace all spaces with hyphens, and remove other punctuation? Or should categories.txt list the slug as well?

If people specify the slug as their crate's category value, should that be valid?

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Let's go with a format like:

slug the name of this category is everything to the end of the line
slug2 another category

(or something like that)

Also hm that is a good point about what you specify in the manifest. I'm tempted to say slugs, not names? That way we can tweak names as we see fit (maybe even localize them one day).

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 22, 2016

Author Member

Ok, if that's the case, then as a crate author I'd want to be able to see what the valid slugs are. I'm going to make a page like pypi has that's a plain text list of the exact valid category specifiers. We can reference that in the warning message when someone specifies an invalid slug too!


impl Category {
pub fn find_by_category(conn: &GenericConnection, name: &str)
-> CargoResult<Option<Category>> {

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

If you want to change this to just return CargoResult<Category> an internalize the chain_error(|| NotFound) that'd also be fine.

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 21, 2016

Author Member

I can do that!

let new_categories = try!(
Category::find_all_by_category(conn, categories)
);
let new_categories_ids: HashSet<_> = new_categories.iter().map(|cat| {

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

It may be a good idea here to put a limit on the number of new categories that can be added. Something high like 50 or 100 but just don't want to blow out the database or anything like that.

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 21, 2016

Author Member

Hm, is that not taken care of by Decode in upload.rs before getting here? Originally, we weren't going to add a limit on categories, but then I saw keywords are limited to 5, and that seemed reasonable, so I did that as well. I'm happy to add a limit here too, though, preventing database overload is good! Just want to check that adding logic here isn't been too paranoid given upload.rs...

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

Oh yup that'd do it, missed that before I got here, sounds good to me!

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

(that's the better location for a validation imo)

fn new_req_body(krate: Crate, version: &str, deps: Vec<u::CrateDependency>,
kws: Vec<String>) -> Vec<u8> {
kws: Vec<String>, cats: Vec<String>) -> Vec<u8> {

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

🐈 🐱


// Attempting to add one valid category and one invalid category
Category::update_crate(tx(&req), &krate, &["cat1".to_string(),
"catnope".to_string()]).unwrap();

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 21, 2016

Member

I was initially thinking we should error out on bad category names, but this make so much more sense. We shouldn't prevent everyone from publishing just because we deleted a category...

@carols10cents
Copy link
Member Author

carols10cents commented Nov 21, 2016

I wonder if there's a warning message we could provide back to Cargo for nonexistent categories? Either for a typo'd category or for just picking something that doesn't exist.

Yeah, I've been struggling with this one, because there doesn't seem to be a mechanism for returning warnings to cargo from crates.io right now, just errors, unless I'm missing something. I was thinking of making a separate request that only fetches the list of categories and then cargo creates the warning about an invalid category name? Or do you think it should get rolled into the /new crate request response?

Right now we can't update a crate's metadata without publishing a new version, the addition of this feature is unfortunately likely to exacerbate this problem. Not something that needs to be fixed here per se, but in theory it's not too hard to add a cargo subcommand to update crate metadata...

I'm up for working on that if you'd like! We're tracking under my estimations at this point, so we should have hours for it. Definitely a new set of PRs though.

@alexcrichton
Copy link
Member

alexcrichton commented Nov 21, 2016

Oh I was thinking that whatever endpoint you use to upload errors would also transmit back warnings. We'd have to update Cargo yeah to process those warnings. I haven't looked at it in awhile, but I'd hope at least that it'd be backcompat to add more fields to the json response without breaking existing Cargos...

@carols10cents
Copy link
Member Author

carols10cents commented Nov 21, 2016

Oh I was thinking that whatever endpoint you use to upload errors would also transmit back warnings. We'd have to update Cargo yeah to process those warnings. I haven't looked at it in awhile, but I'd hope at least that it'd be backcompat to add more fields to the json response without breaking existing Cargos...

Yeah, I think that should work, I was just hesitant to add new protocols without checking. But since you're into it, I'll give it a try! :)

@carols10cents carols10cents force-pushed the integer32llc:categorization branch from b17eeda to 85fe2c9 Nov 21, 2016
@alexcrichton
Copy link
Member

alexcrichton commented Nov 21, 2016

Oh and a thought about sync-categories, perhaps that could just get executed whenever the server starts? That way we don't need to maintain a separate binary and don't need to modify deployments.

@carols10cents
Copy link
Member Author

carols10cents commented Nov 22, 2016

Made some progress on this today, addressed all your comments I think, and added subcategories.

Buuuuut I broke category display on the crates page :( Something isn't hooked together the way ember wants it to be, and I'm not sure what I changed yet.

And I want to make the plain-text list of slugs still.

And my cargo PR needs to be updated to do something with the warnings this should return on unknown category names now.

Just wanted to push up my progress!

let categories = include_str!("./categories.txt");

let slug_categories: Vec<_> = categories.lines().map(|c| {
let mut parts = c.split(' ');

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 28, 2016

Member

This can be splitn so only two matches at most are returned perhaps?

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 28, 2016

Author Member

Ah, cool!

let sql = format!("\
SELECT COUNT(*) \
FROM {} \
WHERE category NOT LIKE '%::%'",

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 28, 2016

Member

Are we thinking two levels of nesting max? I wonder if it's perhaps better to have a parent_id field in the table for a checked pointer?

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 28, 2016

Member

(also helps things be more explicit elsewhere)

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 28, 2016

Author Member

Nope, I was thinking an arbitrary amount of levels. I started down the path of having a parent_id field, but the SQL for updating and querying got really complicated... I can pull that back out and show you if you'd like to take a look?

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Nov 28, 2016

Member

Ah yeah if you have it on hand I wouldn't mind taking a peek.

This comment has been minimized.

Copy link
@carols10cents

carols10cents Nov 29, 2016

Author Member

Ah yeah if you have it on hand I wouldn't mind taking a peek.

Ugh, it looks like I didn't check it in anywhere, unfortunately.

The gist of it is for updating the category names, we wouldn't be able to do one bulk upsert, we'd have to select each category's parent category first (in order to be able to set the parent_id), which means making sure parents are in the table before children, etc.

And to get all crates in a particular category and any of its subcategories, instead of this query we'd have to do a recursive CTE like:


WITH RECURSIVE recursetree(id, parent_ids) AS (
    SELECT id, NULL::int[] || parent_id
    FROM categories 
    WHERE parent_id IS NULL
  UNION ALL
    SELECT 
    c.id, 
    rt.parent_ids || c.parent_id
    FROM categories c
    JOIN recursetree rt ON rt.id = c.parent_id
  )

SELECT * 
FROM crates
INNER JOIN crates_categories 
ON crates.id = crates_categories.crate_id 
WHERE crates_categories.category_id IN (
  SELECT id
  FROM recursetree
  WHERE parent_ids @> ARRAY[(
      SELECT id 
      FROM categories 
      WHERE slug = 'development-tools'
  )]
  UNION
  SELECT id
  FROM categories
  WHERE slug = 'development-tools'
);

There might be ways to make this a little better, but it makes me feel kind of icky :-/

@carols10cents carols10cents force-pushed the integer32llc:categorization branch from 5073f95 to d8d19ec Nov 29, 2016
@carols10cents
Copy link
Member Author

carols10cents commented Nov 29, 2016

Hmmm, interesting.... travis can't seem to fetch https://github.com/rust-lang/crates.io-index.... neat.

@carols10cents
Copy link
Member Author

carols10cents commented Nov 29, 2016

Ok, I gave up trying to get around Ember to render plain text like pypi and I just made a page to list all the valid category slugs within the usual crates.io template:

slugs-list-page

I'm going to add this URL to the warning on the cargo side when a crate has an invalid category slug.

Some more screenshots:

The top-level categories list page:

categories-page

Clicking into the Libraries category, showing the subcategories and all crates in that category and its subcategories:

libraries-category-page

Clicking into the Libraries::Date and Time subcategory:

date-and-time-subcategories-page

Categories on a crate's page:

categories-on-crate-page

@carols10cents carols10cents force-pushed the integer32llc:categorization branch from 5886ec9 to cc7e92d Nov 29, 2016
@carols10cents
Copy link
Member Author

carols10cents commented Nov 29, 2016

Ok! I changed my mind about how to communicate about unrecognized category slugs, but that and the cargo side in rust-lang/cargo#3301 are done now. Both sides still happily tolerate the other side not being deployed yet.

So yeah, I think this is done, unless you think we should change to having parent_ids or have any other changes you'd like me to make.

@alexcrichton
Copy link
Member

alexcrichton commented Dec 1, 2016

Ok I think it makes sense to me to use :: instead of parent_id from what you're saying, so sounds good to me. Want to update the Travis config to use stable Rust so we can see if this goes green?

@carols10cents carols10cents force-pushed the integer32llc:categorization branch 3 times, most recently from 7e36cf1 to 1b04800 Dec 3, 2016
@carols10cents
Copy link
Member Author

carols10cents commented Dec 5, 2016

IT'S PASSING IT'S PASSING!!!!!

@carols10cents
Copy link
Member Author

carols10cents commented Dec 7, 2016

I decided to get the PR for categories available started: #488

@est31
Copy link

est31 commented Dec 8, 2016

👎

I don't like cateogries. If you publish a crate thats outside of the list of already available categories, you will be forgotten. This leads to a totally pointless bikeshed disucssion which wil never end. Keywords are better, you can chose them yourself. Text search is better, it also finds the crates which don't have it in their keywords but on the description.

How do categories improve upon keywords? Yes, now there are not "HTTP-Server" and "HTTPServer" keywords, so its more unified, but what overall advantage is there?

@est31
Copy link

est31 commented Dec 8, 2016

Mhh, maybe there is some advantage in browsing. One idea for improvement: why not implement categories as groups of keywords? So e.g. a crate is in the Cryptography cateogry if it has one of the (crypto, cryptography, encryption) keywords. This would remove the need to retrofit each crate to the categories system and make publishing of new crates more simple.

@Nemo157
Copy link
Contributor

Nemo157 commented Dec 8, 2016

why not implement categories as groups of keywords?

That was exactly what I was thinking while reading through this, it would definitely help with getting existing crates with good metadata into this system without forcing a new version to be published.

@carols10cents
Copy link
Member Author

carols10cents commented Dec 8, 2016

I don't like cateogries. If you publish a crate thats outside of the list of already available categories, you will be forgotten. This leads to a totally pointless bikeshed disucssion which wil never end. Keywords are better, you can chose them yourself. Text search is better, it also finds the crates which don't have it in their keywords but on the description.

How do categories improve upon keywords? Yes, now there are not "HTTP-Server" and "HTTPServer" keywords, so its more unified, but what overall advantage is there?

Categories are not meant to replace keywords, they are meant to augment them. Crate authors and crate users will be free to use whichever they find most useful! :)

@carols10cents
Copy link
Member Author

carols10cents commented Dec 8, 2016

Mhh, maybe there is some advantage in browsing. One idea for improvement: why not implement categories as groups of keywords? So e.g. a crate is in the Cryptography cateogry if it has one of the (crypto, cryptography, encryption) keywords. This would remove the need to retrofit each crate to the categories system and make publishing of new crates more simple.

That was exactly what I was thinking while reading through this, it would definitely help with getting existing crates with good metadata into this system without forcing a new version to be published.

One of the problems with this approach is that there are keywords that should be split into two categories instead-- for instance, the cli keyword currently includes argonaut, "A simple argument parser" to help you build CLIs, and betsey, "An AppVeyor cli written in Rust", which is an application with a CLI for use with a particular tool, appveyor. IMO these should end up in Libraries::Command-line interface and Applications::System tools, respectively.

@carols10cents carols10cents force-pushed the integer32llc:categorization branch from 1b04800 to a750c63 Dec 14, 2016
carols10cents and others added 12 commits Nov 28, 2016
To direct people to when they have specified an invalid slug.

JSON containing all the slugs is available at
/api/v1/category_slugs, but visiting that in a browser doesn't work.
And cargo will handle making nice English messages out of them.
Have to switch from a nice batch insert to running a query for each
category so that we can use apostrophes in the descriptions and have
the string escaped for SQL.
To better distinguish subcategories and crates. This makes "crates" in
the h1 redundant, especially when there *aren't* subcategories.
There will be an RFC soon about whether this is the best ordering or
not.
And make the top-level query that does this consistent with
subcategory queries.
This test does a lot of different manipulations of categories and crate
categories and it was using a crate named foo. The good_categories test
also used a crate named foo, and these two tests were causing a postgres
deadlock.

I was able to cause deadlocks more often by duplicating the update_crate
test and the good_categories test:

https://travis-ci.org/integer32llc/crates.io/builds/187302718

Making this change and running the duplicated tests resulted in 0
deadlocks:

https://travis-ci.org/integer32llc/crates.io/builds/187306433

This is unlikely to happen in production; requests get a database
connection that gets closed when the request finishes, and the publish
request only modifies the categories once, not as much as the
update_crate test is. It seems unlikely that two people would publish
the same crate at exactly the same time.
@carols10cents carols10cents force-pushed the integer32llc:categorization branch from 18838fb to c6de914 Dec 28, 2016
@carols10cents
Copy link
Member Author

carols10cents commented Dec 28, 2016

I THINK I HAVE VANQUISHED THE DEADLOCK!!!!

The categories::update_crate test does a lot of different manipulations of categories and crate categories and it was using a crate named foo. The good_categories test also used a crate named foo, and these two tests were causing a postgres deadlock.

I was able to cause deadlocks more often by duplicating the update_crate test and the good_categories test:

https://travis-ci.org/integer32llc/crates.io/builds/187302718

Making this change and running the duplicated tests resulted in 0 deadlocks:

https://travis-ci.org/integer32llc/crates.io/builds/187306433

This long-running editing of a crate's categories is unlikely to happen in production; requests get a database connection that gets closed when the request finishes, and the publish request only modifies the categories once, not as much as the update_crate test is. It seems unlikely that two people would publish the same crate at exactly the same time.

@alexcrichton
Copy link
Member

alexcrichton commented Dec 29, 2016

Ok that sounds good to me. Want to make sure crates have unique names and I'll merge?

@carols10cents
Copy link
Member Author

carols10cents commented Dec 29, 2016

@alexcrichton done! all test crates now have a unique name :)

@alexcrichton
Copy link
Member

alexcrichton commented Dec 29, 2016

🎊

@alexcrichton alexcrichton merged commit 710f208 into rust-lang:master Dec 29, 2016
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alexcrichton
Copy link
Member

alexcrichton commented Dec 29, 2016

@carols10cents hm it looks like cargo test locally is failing, maybe due to a recent push to master? Mind taking a peek at that?

@carols10cents
Copy link
Member Author

carols10cents commented Dec 29, 2016

@carols10cents hm it looks like cargo test locally is failing, maybe due to a recent push to master? Mind taking a peek at that?

On it!

@carols10cents
Copy link
Member Author

carols10cents commented Dec 29, 2016

When can we have bors on this repo? ;)

@carols10cents
Copy link
Member Author

carols10cents commented Dec 29, 2016

@alexcrichton Hm, cargo test on master isn't failing for me locally, nor is it failing on travis. Did you happen to try out a previous version of this branch? I did change some of the migrations along the way, maybe dry dropping and creating your cargo_registry_test database and see if that fixes it?

@alexcrichton
Copy link
Member

alexcrichton commented Dec 29, 2016

Oh looks like I was missing the S3_BUCKET business, my bad!

bors added a commit to rust-lang/cargo that referenced this pull request Jan 17, 2017
Upload categories specified in the manifest

This adds support for uploading categories to crates.io, if they are specified in the manifest.

This goes with rust-lang/crates.io#473. It should be fine to merge this PR either before or after that one; crates.io master doesn't care if the categories are in the metadata or not. With that PR, I was able to use this patch with cargo to add categories to a crate!
@shepmaster shepmaster deleted the integer32llc:categorization branch Apr 13, 2017
@nasa42 nasa42 mentioned this pull request May 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.