Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd Categorization! #473
Conversation
This comment has been minimized.
This comment has been minimized.
|
Ummmmm I have no idea why the tests are failing with "Once instance has previously been poisoned" :( |
This comment has been minimized.
This comment has been minimized.
|
Aaaand I forgot to add support for subcategories, just realized that. I'd love any thoughts in the meantime, but I'm going to be adding a few commits :) |
carols10cents
referenced this pull request
Nov 18, 2016
Merged
Upload categories specified in the manifest #3301
alexcrichton
reviewed
Nov 21, 2016
|
Awesome! I think the test error is fixed on master, so a rebase should pick that up. I'm also fine with the rollout strategy here, sounds good to me. Some other thoughts of mine:
|
| ALTER TABLE crates_categories \ | ||
| ADD CONSTRAINT fk_crates_categories_category_id \ | ||
| FOREIGN KEY (category_id) REFERENCES categories (id) \ | ||
| ON DELETE CASCADE", " \ |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
This just means if we delete a category it'll auto-delete everything from crates_categories, right?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| let in_clause = categories.iter() | ||
| .map(|c| format!("'{}'", c)) | ||
| .collect::<Vec<_>>() | ||
| .join(","); |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Perhaps just for the sake of hygiene, but could we use rust-postgres's ability for escaping here? (rather than doing so ourselves).
I do realize though that categories.txt isn't user input, I figure it's just a good idea to stick with built-in escaping.
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 21, 2016
Author
Member
rust-postgres doesn't support this for the bulk import of multiple values that I wanted to do here, see sfackler/rust-postgres#218. I can change the WHERE NOT IN to use rust-postgres' escaping, though.
Another option is to make one insert for every category... this doesn't really NEED to be fast... wdyt?
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Eh I ended up seeing more of these throughout the codebase anyway (which were all reasonable), so it's fine to ignore this. A minor "concern" anyway.
| pub fn find_by_category(conn: &GenericConnection, name: &str) | ||
| -> CargoResult<Option<Category>> { | ||
| let stmt = try!(conn.prepare("SELECT * FROM categories \ | ||
| WHERE category = $1")); |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Could you also add an index to the database for this category field? (unless I already missed it)
Also, perhaps the index and this could be based on lower(category) and lower($1) so we don't have to worry about case issues?
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 21, 2016
Author
Member
Postgres automatically adds an index for UNIQUE fields (see https://www.postgresql.org/docs/current/static/ddl-constraints.html 5.3.3. Unique Constraints, "Adding a unique constraint will automatically create a unique B-tree index on the column or group of columns listed in the constraint.").
I can make it lower(category) though! I was thinking exact should mean exact, but it would probably be a bit friendlier to not care about case.
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Postgres foils me again! If we add slugs then I think this becomes a non-issue, we can just ensure that all slugs are always lowercase.
| pub fn encodable(self) -> EncodableCategory { | ||
| let Category { id: _, crates_cnt, category, created_at } = self; | ||
| EncodableCategory { | ||
| id: category.clone(), |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Could this be the lowercase version to be "pretty"?
Although now that I think more about this, maybe we should have two fields in the categories table. One for a slug (url friendly, this field) and another for the name? That way we can support categories with punctuation and spaces and such, all while keeping a nice url.
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 21, 2016
Author
Member
Sure! Should categories.txt still list only the title-cased, punctuated version, and then the slug is lowercased, replace all spaces with hyphens, and remove other punctuation? Or should categories.txt list the slug as well?
If people specify the slug as their crate's category value, should that be valid?
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Let's go with a format like:
slug the name of this category is everything to the end of the line
slug2 another category
(or something like that)
Also hm that is a good point about what you specify in the manifest. I'm tempted to say slugs, not names? That way we can tweak names as we see fit (maybe even localize them one day).
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 22, 2016
Author
Member
Ok, if that's the case, then as a crate author I'd want to be able to see what the valid slugs are. I'm going to make a page like pypi has that's a plain text list of the exact valid category specifiers. We can reference that in the warning message when someone specifies an invalid slug too!
|
|
||
| impl Category { | ||
| pub fn find_by_category(conn: &GenericConnection, name: &str) | ||
| -> CargoResult<Option<Category>> { |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
If you want to change this to just return CargoResult<Category> an internalize the chain_error(|| NotFound) that'd also be fine.
This comment has been minimized.
This comment has been minimized.
| let new_categories = try!( | ||
| Category::find_all_by_category(conn, categories) | ||
| ); | ||
| let new_categories_ids: HashSet<_> = new_categories.iter().map(|cat| { |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
It may be a good idea here to put a limit on the number of new categories that can be added. Something high like 50 or 100 but just don't want to blow out the database or anything like that.
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 21, 2016
Author
Member
Hm, is that not taken care of by Decode in upload.rs before getting here? Originally, we weren't going to add a limit on categories, but then I saw keywords are limited to 5, and that seemed reasonable, so I did that as well. I'm happy to add a limit here too, though, preventing database overload is good! Just want to check that adding logic here isn't been too paranoid given upload.rs...
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
Oh yup that'd do it, missed that before I got here, sounds good to me!
This comment has been minimized.
This comment has been minimized.
| fn new_req_body(krate: Crate, version: &str, deps: Vec<u::CrateDependency>, | ||
| kws: Vec<String>) -> Vec<u8> { | ||
| kws: Vec<String>, cats: Vec<String>) -> Vec<u8> { |
This comment has been minimized.
This comment has been minimized.
|
|
||
| // Attempting to add one valid category and one invalid category | ||
| Category::update_crate(tx(&req), &krate, &["cat1".to_string(), | ||
| "catnope".to_string()]).unwrap(); |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 21, 2016
Member
I was initially thinking we should error out on bad category names, but this make so much more sense. We shouldn't prevent everyone from publishing just because we deleted a category...
This comment has been minimized.
This comment has been minimized.
Yeah, I've been struggling with this one, because there doesn't seem to be a mechanism for returning warnings to cargo from crates.io right now, just errors, unless I'm missing something. I was thinking of making a separate request that only fetches the list of categories and then cargo creates the warning about an invalid category name? Or do you think it should get rolled into the /new crate request response?
I'm up for working on that if you'd like! We're tracking under my estimations at this point, so we should have hours for it. Definitely a new set of PRs though. |
This comment has been minimized.
This comment has been minimized.
|
Oh I was thinking that whatever endpoint you use to upload errors would also transmit back warnings. We'd have to update Cargo yeah to process those warnings. I haven't looked at it in awhile, but I'd hope at least that it'd be backcompat to add more fields to the json response without breaking existing Cargos... |
This comment has been minimized.
This comment has been minimized.
Yeah, I think that should work, I was just hesitant to add new protocols without checking. But since you're into it, I'll give it a try! :) |
carols10cents
force-pushed the
integer32llc:categorization
branch
from
b17eeda
to
85fe2c9
Nov 21, 2016
This comment has been minimized.
This comment has been minimized.
|
Oh and a thought about sync-categories, perhaps that could just get executed whenever the server starts? That way we don't need to maintain a separate binary and don't need to modify deployments. |
carols10cents
referenced this pull request
Nov 22, 2016
Closed
Keywords are not shown via search box #331
This comment has been minimized.
This comment has been minimized.
|
Made some progress on this today, addressed all your comments I think, and added subcategories. Buuuuut I broke category display on the crates page :( Something isn't hooked together the way ember wants it to be, and I'm not sure what I changed yet. And I want to make the plain-text list of slugs still. And my cargo PR needs to be updated to do something with the warnings this should return on unknown category names now. Just wanted to push up my progress! |
alexcrichton
reviewed
Nov 28, 2016
| let categories = include_str!("./categories.txt"); | ||
|
|
||
| let slug_categories: Vec<_> = categories.lines().map(|c| { | ||
| let mut parts = c.split(' '); |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 28, 2016
Member
This can be splitn so only two matches at most are returned perhaps?
This comment has been minimized.
This comment has been minimized.
| let sql = format!("\ | ||
| SELECT COUNT(*) \ | ||
| FROM {} \ | ||
| WHERE category NOT LIKE '%::%'", |
This comment has been minimized.
This comment has been minimized.
alexcrichton
Nov 28, 2016
Member
Are we thinking two levels of nesting max? I wonder if it's perhaps better to have a parent_id field in the table for a checked pointer?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 28, 2016
•
Author
Member
Nope, I was thinking an arbitrary amount of levels. I started down the path of having a parent_id field, but the SQL for updating and querying got really complicated... I can pull that back out and show you if you'd like to take a look?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
carols10cents
Nov 29, 2016
Author
Member
Ah yeah if you have it on hand I wouldn't mind taking a peek.
Ugh, it looks like I didn't check it in anywhere, unfortunately.
The gist of it is for updating the category names, we wouldn't be able to do one bulk upsert, we'd have to select each category's parent category first (in order to be able to set the parent_id), which means making sure parents are in the table before children, etc.
And to get all crates in a particular category and any of its subcategories, instead of this query we'd have to do a recursive CTE like:
WITH RECURSIVE recursetree(id, parent_ids) AS (
SELECT id, NULL::int[] || parent_id
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT
c.id,
rt.parent_ids || c.parent_id
FROM categories c
JOIN recursetree rt ON rt.id = c.parent_id
)
SELECT *
FROM crates
INNER JOIN crates_categories
ON crates.id = crates_categories.crate_id
WHERE crates_categories.category_id IN (
SELECT id
FROM recursetree
WHERE parent_ids @> ARRAY[(
SELECT id
FROM categories
WHERE slug = 'development-tools'
)]
UNION
SELECT id
FROM categories
WHERE slug = 'development-tools'
);
There might be ways to make this a little better, but it makes me feel kind of icky :-/
carols10cents
force-pushed the
integer32llc:categorization
branch
from
5073f95
to
d8d19ec
Nov 29, 2016
This comment has been minimized.
This comment has been minimized.
|
Hmmm, interesting.... travis can't seem to fetch https://github.com/rust-lang/crates.io-index.... neat. |
This comment has been minimized.
This comment has been minimized.
|
Ok, I gave up trying to get around Ember to render plain text like pypi and I just made a page to list all the valid category slugs within the usual crates.io template: I'm going to add this URL to the warning on the cargo side when a crate has an invalid category slug. Some more screenshots: The top-level categories list page: Clicking into the Libraries category, showing the subcategories and all crates in that category and its subcategories: Clicking into the Libraries::Date and Time subcategory: Categories on a crate's page: |
carols10cents
force-pushed the
integer32llc:categorization
branch
from
5886ec9
to
cc7e92d
Nov 29, 2016
This comment has been minimized.
This comment has been minimized.
|
Ok! I changed my mind about how to communicate about unrecognized category slugs, but that and the cargo side in rust-lang/cargo#3301 are done now. Both sides still happily tolerate the other side not being deployed yet. So yeah, I think this is done, unless you think we should change to having parent_ids or have any other changes you'd like me to make. |
This comment has been minimized.
This comment has been minimized.
|
Ok I think it makes sense to me to use |
carols10cents
force-pushed the
integer32llc:categorization
branch
3 times, most recently
from
7e36cf1
to
1b04800
Dec 3, 2016
This comment has been minimized.
This comment has been minimized.
|
IT'S PASSING IT'S PASSING!!!!! |
carols10cents
referenced this pull request
Dec 7, 2016
Merged
What categories should be available on crates.io? [BIKESHED ALERT 🚲🏠🚨] #488
This comment has been minimized.
This comment has been minimized.
|
I decided to get the PR for categories available started: #488 |
This comment has been minimized.
This comment has been minimized.
est31
commented
Dec 8, 2016
|
I don't like cateogries. If you publish a crate thats outside of the list of already available categories, you will be forgotten. This leads to a totally pointless bikeshed disucssion which wil never end. Keywords are better, you can chose them yourself. Text search is better, it also finds the crates which don't have it in their keywords but on the description. How do categories improve upon keywords? Yes, now there are not "HTTP-Server" and "HTTPServer" keywords, so its more unified, but what overall advantage is there? |
This comment has been minimized.
This comment has been minimized.
est31
commented
Dec 8, 2016
|
Mhh, maybe there is some advantage in browsing. One idea for improvement: why not implement categories as groups of keywords? So e.g. a crate is in the Cryptography cateogry if it has one of the (crypto, cryptography, encryption) keywords. This would remove the need to retrofit each crate to the categories system and make publishing of new crates more simple. |
This comment has been minimized.
This comment has been minimized.
That was exactly what I was thinking while reading through this, it would definitely help with getting existing crates with good metadata into this system without forcing a new version to be published. |
This comment has been minimized.
This comment has been minimized.
Categories are not meant to replace keywords, they are meant to augment them. Crate authors and crate users will be free to use whichever they find most useful! :) |
This comment has been minimized.
This comment has been minimized.
One of the problems with this approach is that there are keywords that should be split into two categories instead-- for instance, the cli keyword currently includes argonaut, "A simple argument parser" to help you build CLIs, and betsey, "An AppVeyor cli written in Rust", which is an application with a CLI for use with a particular tool, appveyor. IMO these should end up in Libraries::Command-line interface and Applications::System tools, respectively. |
carols10cents
force-pushed the
integer32llc:categorization
branch
from
1b04800
to
a750c63
Dec 14, 2016
carols10cents
and others
added some commits
Nov 28, 2016
carols10cents
force-pushed the
integer32llc:categorization
branch
from
18838fb
to
c6de914
Dec 28, 2016
This comment has been minimized.
This comment has been minimized.
|
I THINK I HAVE VANQUISHED THE DEADLOCK!!!! The categories::update_crate test does a lot of different manipulations of categories and crate categories and it was using a crate named foo. The good_categories test also used a crate named foo, and these two tests were causing a postgres deadlock. I was able to cause deadlocks more often by duplicating the update_crate test and the good_categories test: https://travis-ci.org/integer32llc/crates.io/builds/187302718 Making this change and running the duplicated tests resulted in 0 deadlocks: https://travis-ci.org/integer32llc/crates.io/builds/187306433 This long-running editing of a crate's categories is unlikely to happen in production; requests get a database connection that gets closed when the request finishes, and the publish request only modifies the categories once, not as much as the update_crate test is. It seems unlikely that two people would publish the same crate at exactly the same time. |
This comment has been minimized.
This comment has been minimized.
|
Ok that sounds good to me. Want to make sure crates have unique names and I'll merge? |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton done! all test crates now have a unique name :) |
This comment has been minimized.
This comment has been minimized.
|
|
alexcrichton
merged commit 710f208
into
rust-lang:master
Dec 29, 2016
1 check passed
This comment has been minimized.
This comment has been minimized.
|
@carols10cents hm it looks like |
This comment has been minimized.
This comment has been minimized.
On it! |
This comment has been minimized.
This comment has been minimized.
|
When can we have bors on this repo? ;) |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton Hm, |
This comment has been minimized.
This comment has been minimized.
|
Oh looks like I was missing the S3_BUCKET business, my bad! |





carols10cents commentedNov 17, 2016
•
edited
Hiiii! This is the first part of adding categories to crates.io!
What this PR does
src/categories.txtand add or remove categories from the database as needed to make the database match the categories. This way, the list of available categories can be changed via pull request to this repo./categoriespage not yet linked from anywhere that displays all the categories in alphabetical order/categories/whateverpage that will list all the crates in a categoryTesting this PR
Start the frontend as well (Note: I had to use
ember server --proxy http://127.0.0.1:8888instead ofyarn run start:localwith the latest version of yarn in order to actually use my local backend, I'm investigating why this is the case)/categoriesand see 2 categories, "Development Tools" and "Libraries"Deploying this PR
It should be totally fine to deploy this PR to crates.io. Nothing links to /categories, and cargo does not publish category metadata yet-- and neither cargo nor crates.io will complain if a crate is not in any categories.
After this PR is merged