-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Categorization! #473
Add Categorization! #473
Conversation
Ummmmm I have no idea why the tests are failing with "Once instance has previously been poisoned" :( |
Aaaand I forgot to add support for subcategories, just realized that. I'd love any thoughts in the meantime, but I'm going to be adding a few commits :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I think the test error is fixed on master, so a rebase should pick that up. I'm also fine with the rollout strategy here, sounds good to me.
Some other thoughts of mine:
- I wonder if there's a warning message we could provide back to Cargo for nonexistent categories? Either for a typo'd category or for just picking something that doesn't exist.
- Right now we can't update a crate's metadata without publishing a new version, the addition of this feature is unfortunately likely to exacerbate this problem. Not something that needs to be fixed here per se, but in theory it's not too hard to add a cargo subcommand to update crate metadata...
ALTER TABLE crates_categories \ | ||
ADD CONSTRAINT fk_crates_categories_category_id \ | ||
FOREIGN KEY (category_id) REFERENCES categories (id) \ | ||
ON DELETE CASCADE", " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just means if we delete a category it'll auto-delete everything from crates_categories
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah fine by me, just not used to fancy sql features :)
let in_clause = categories.iter() | ||
.map(|c| format!("'{}'", c)) | ||
.collect::<Vec<_>>() | ||
.join(","); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just for the sake of hygiene, but could we use rust-postgres's ability for escaping here? (rather than doing so ourselves).
I do realize though that categories.txt
isn't user input, I figure it's just a good idea to stick with built-in escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rust-postgres doesn't support this for the bulk import of multiple values that I wanted to do here, see sfackler/rust-postgres#218. I can change the WHERE NOT IN
to use rust-postgres' escaping, though.
Another option is to make one insert for every category... this doesn't really NEED to be fast... wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh I ended up seeing more of these throughout the codebase anyway (which were all reasonable), so it's fine to ignore this. A minor "concern" anyway.
pub fn find_by_category(conn: &GenericConnection, name: &str) | ||
-> CargoResult<Option<Category>> { | ||
let stmt = try!(conn.prepare("SELECT * FROM categories \ | ||
WHERE category = $1")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add an index to the database for this category
field? (unless I already missed it)
Also, perhaps the index and this could be based on lower(category)
and lower($1)
so we don't have to worry about case issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres automatically adds an index for UNIQUE fields (see https://www.postgresql.org/docs/current/static/ddl-constraints.html 5.3.3. Unique Constraints, "Adding a unique constraint will automatically create a unique B-tree index on the column or group of columns listed in the constraint.").
I can make it lower(category) though! I was thinking exact should mean exact, but it would probably be a bit friendlier to not care about case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres foils me again! If we add slugs then I think this becomes a non-issue, we can just ensure that all slugs are always lowercase.
pub fn encodable(self) -> EncodableCategory { | ||
let Category { id: _, crates_cnt, category, created_at } = self; | ||
EncodableCategory { | ||
id: category.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be the lowercase version to be "pretty"?
Although now that I think more about this, maybe we should have two fields in the categories table. One for a slug (url friendly, this field) and another for the name? That way we can support categories with punctuation and spaces and such, all while keeping a nice url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Should categories.txt still list only the title-cased, punctuated version, and then the slug is lowercased, replace all spaces with hyphens, and remove other punctuation? Or should categories.txt list the slug as well?
If people specify the slug as their crate's category value, should that be valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with a format like:
slug the name of this category is everything to the end of the line
slug2 another category
(or something like that)
Also hm that is a good point about what you specify in the manifest. I'm tempted to say slugs, not names? That way we can tweak names as we see fit (maybe even localize them one day).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if that's the case, then as a crate author I'd want to be able to see what the valid slugs are. I'm going to make a page like pypi has that's a plain text list of the exact valid category specifiers. We can reference that in the warning message when someone specifies an invalid slug too!
|
||
impl Category { | ||
pub fn find_by_category(conn: &GenericConnection, name: &str) | ||
-> CargoResult<Option<Category>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to change this to just return CargoResult<Category>
an internalize the chain_error(|| NotFound)
that'd also be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that!
let new_categories = try!( | ||
Category::find_all_by_category(conn, categories) | ||
); | ||
let new_categories_ids: HashSet<_> = new_categories.iter().map(|cat| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea here to put a limit on the number of new categories that can be added. Something high like 50 or 100 but just don't want to blow out the database or anything like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, is that not taken care of by Decode in upload.rs before getting here? Originally, we weren't going to add a limit on categories, but then I saw keywords are limited to 5, and that seemed reasonable, so I did that as well. I'm happy to add a limit here too, though, preventing database overload is good! Just want to check that adding logic here isn't been too paranoid given upload.rs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yup that'd do it, missed that before I got here, sounds good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that's the better location for a validation imo)
fn new_req_body(krate: Crate, version: &str, deps: Vec<u::CrateDependency>, | ||
kws: Vec<String>) -> Vec<u8> { | ||
kws: Vec<String>, cats: Vec<String>) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐈 🐱
|
||
// Attempting to add one valid category and one invalid category | ||
Category::update_crate(tx(&req), &krate, &["cat1".to_string(), | ||
"catnope".to_string()]).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially thinking we should error out on bad category names, but this make so much more sense. We shouldn't prevent everyone from publishing just because we deleted a category...
Yeah, I've been struggling with this one, because there doesn't seem to be a mechanism for returning warnings to cargo from crates.io right now, just errors, unless I'm missing something. I was thinking of making a separate request that only fetches the list of categories and then cargo creates the warning about an invalid category name? Or do you think it should get rolled into the /new crate request response?
I'm up for working on that if you'd like! We're tracking under my estimations at this point, so we should have hours for it. Definitely a new set of PRs though. |
Oh I was thinking that whatever endpoint you use to upload errors would also transmit back warnings. We'd have to update Cargo yeah to process those warnings. I haven't looked at it in awhile, but I'd hope at least that it'd be backcompat to add more fields to the json response without breaking existing Cargos... |
Yeah, I think that should work, I was just hesitant to add new protocols without checking. But since you're into it, I'll give it a try! :) |
b17eeda
to
85fe2c9
Compare
Oh and a thought about sync-categories, perhaps that could just get executed whenever the server starts? That way we don't need to maintain a separate binary and don't need to modify deployments. |
Made some progress on this today, addressed all your comments I think, and added subcategories. Buuuuut I broke category display on the crates page :( Something isn't hooked together the way ember wants it to be, and I'm not sure what I changed yet. And I want to make the plain-text list of slugs still. And my cargo PR needs to be updated to do something with the warnings this should return on unknown category names now. Just wanted to push up my progress! |
let categories = include_str!("./categories.txt"); | ||
|
||
let slug_categories: Vec<_> = categories.lines().map(|c| { | ||
let mut parts = c.split(' '); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be splitn
so only two matches at most are returned perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, cool!
let sql = format!("\ | ||
SELECT COUNT(*) \ | ||
FROM {} \ | ||
WHERE category NOT LIKE '%::%'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we thinking two levels of nesting max? I wonder if it's perhaps better to have a parent_id
field in the table for a checked pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also helps things be more explicit elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I was thinking an arbitrary amount of levels. I started down the path of having a parent_id field, but the SQL for updating and querying got really complicated... I can pull that back out and show you if you'd like to take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah if you have it on hand I wouldn't mind taking a peek.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah if you have it on hand I wouldn't mind taking a peek.
Ugh, it looks like I didn't check it in anywhere, unfortunately.
The gist of it is for updating the category names, we wouldn't be able to do one bulk upsert, we'd have to select each category's parent category first (in order to be able to set the parent_id), which means making sure parents are in the table before children, etc.
And to get all crates in a particular category and any of its subcategories, instead of this query we'd have to do a recursive CTE like:
WITH RECURSIVE recursetree(id, parent_ids) AS (
SELECT id, NULL::int[] || parent_id
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT
c.id,
rt.parent_ids || c.parent_id
FROM categories c
JOIN recursetree rt ON rt.id = c.parent_id
)
SELECT *
FROM crates
INNER JOIN crates_categories
ON crates.id = crates_categories.crate_id
WHERE crates_categories.category_id IN (
SELECT id
FROM recursetree
WHERE parent_ids @> ARRAY[(
SELECT id
FROM categories
WHERE slug = 'development-tools'
)]
UNION
SELECT id
FROM categories
WHERE slug = 'development-tools'
);
There might be ways to make this a little better, but it makes me feel kind of icky :-/
5073f95
to
d8d19ec
Compare
Hmmm, interesting.... travis can't seem to fetch https://github.com/rust-lang/crates.io-index.... neat. |
Ok, I gave up trying to get around Ember to render plain text like pypi and I just made a page to list all the valid category slugs within the usual crates.io template: I'm going to add this URL to the warning on the cargo side when a crate has an invalid category slug. Some more screenshots: The top-level categories list page: Clicking into the Libraries category, showing the subcategories and all crates in that category and its subcategories: Clicking into the Libraries::Date and Time subcategory: Categories on a crate's page: |
5886ec9
to
cc7e92d
Compare
Ok! I changed my mind about how to communicate about unrecognized category slugs, but that and the cargo side in rust-lang/cargo#3301 are done now. Both sides still happily tolerate the other side not being deployed yet. So yeah, I think this is done, unless you think we should change to having parent_ids or have any other changes you'd like me to make. |
Ok I think it makes sense to me to use |
7e36cf1
to
1b04800
Compare
IT'S PASSING IT'S PASSING!!!!! |
I decided to get the PR for categories available started: #488 |
👎 I don't like cateogries. If you publish a crate thats outside of the list of already available categories, you will be forgotten. This leads to a totally pointless bikeshed disucssion which wil never end. Keywords are better, you can chose them yourself. Text search is better, it also finds the crates which don't have it in their keywords but on the description. How do categories improve upon keywords? Yes, now there are not "HTTP-Server" and "HTTPServer" keywords, so its more unified, but what overall advantage is there? |
Mhh, maybe there is some advantage in browsing. One idea for improvement: why not implement categories as groups of keywords? So e.g. a crate is in the Cryptography cateogry if it has one of the (crypto, cryptography, encryption) keywords. This would remove the need to retrofit each crate to the categories system and make publishing of new crates more simple. |
That was exactly what I was thinking while reading through this, it would definitely help with getting existing crates with good metadata into this system without forcing a new version to be published. |
Categories are not meant to replace keywords, they are meant to augment them. Crate authors and crate users will be free to use whichever they find most useful! :) |
One of the problems with this approach is that there are keywords that should be split into two categories instead-- for instance, the cli keyword currently includes argonaut, "A simple argument parser" to help you build CLIs, and betsey, "An AppVeyor cli written in Rust", which is an application with a CLI for use with a particular tool, appveyor. IMO these should end up in Libraries::Command-line interface and Applications::System tools, respectively. |
1b04800
to
a750c63
Compare
To direct people to when they have specified an invalid slug. JSON containing all the slugs is available at /api/v1/category_slugs, but visiting that in a browser doesn't work.
And cargo will handle making nice English messages out of them.
Have to switch from a nice batch insert to running a query for each category so that we can use apostrophes in the descriptions and have the string escaped for SQL.
And not all descendant categories
To better distinguish subcategories and crates. This makes "crates" in the h1 redundant, especially when there *aren't* subcategories.
There will be an RFC soon about whether this is the best ordering or not.
And make the top-level query that does this consistent with subcategory queries.
This test does a lot of different manipulations of categories and crate categories and it was using a crate named foo. The good_categories test also used a crate named foo, and these two tests were causing a postgres deadlock. I was able to cause deadlocks more often by duplicating the update_crate test and the good_categories test: https://travis-ci.org/integer32llc/crates.io/builds/187302718 Making this change and running the duplicated tests resulted in 0 deadlocks: https://travis-ci.org/integer32llc/crates.io/builds/187306433 This is unlikely to happen in production; requests get a database connection that gets closed when the request finishes, and the publish request only modifies the categories once, not as much as the update_crate test is. It seems unlikely that two people would publish the same crate at exactly the same time.
18838fb
to
c6de914
Compare
I THINK I HAVE VANQUISHED THE DEADLOCK!!!! The categories::update_crate test does a lot of different manipulations of categories and crate categories and it was using a crate named foo. The good_categories test also used a crate named foo, and these two tests were causing a postgres deadlock. I was able to cause deadlocks more often by duplicating the update_crate test and the good_categories test: https://travis-ci.org/integer32llc/crates.io/builds/187302718 Making this change and running the duplicated tests resulted in 0 deadlocks: https://travis-ci.org/integer32llc/crates.io/builds/187306433 This long-running editing of a crate's categories is unlikely to happen in production; requests get a database connection that gets closed when the request finishes, and the publish request only modifies the categories once, not as much as the update_crate test is. It seems unlikely that two people would publish the same crate at exactly the same time. |
Ok that sounds good to me. Want to make sure crates have unique names and I'll merge? |
@alexcrichton done! all test crates now have a unique name :) |
🎊 |
@carols10cents hm it looks like |
On it! |
When can we have bors on this repo? ;) |
@alexcrichton Hm, |
Oh looks like I was missing the S3_BUCKET business, my bad! |
Upload categories specified in the manifest This adds support for uploading categories to crates.io, if they are specified in the manifest. This goes with rust-lang/crates.io#473. It should be fine to merge this PR either before or after that one; crates.io master doesn't care if the categories are in the metadata or not. With that PR, I was able to use this patch with cargo to add categories to a crate!
Hiiii! This is the first part of adding categories to crates.io!
What this PR does
src/categories.txt
and add or remove categories from the database as needed to make the database match the categories. This way, the list of available categories can be changed via pull request to this repo./categories
page not yet linked from anywhere that displays all the categories in alphabetical order/categories/whatever
page that will list all the crates in a categoryTesting this PR
Start the frontend as well (Note: I had to use
ember server --proxy http://127.0.0.1:8888
instead ofyarn run start:local
with the latest version of yarn in order to actually use my local backend, I'm investigating why this is the case)/categories
and see 2 categories, "Development Tools" and "Libraries"Deploying this PR
It should be totally fine to deploy this PR to crates.io. Nothing links to /categories, and cargo does not publish category metadata yet-- and neither cargo nor crates.io will complain if a crate is not in any categories.
After this PR is merged