-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Legal Documents #21
Comments
I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset. |
@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors). |
We could use perspectives api to list the scores for each sentence. |
Also should keep Cushman in the loop for this? Talk to Peter as well. |
Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources. |
Got you. Sounds good. |
I have access to all of case law and some scripts to get the data. Can CC John Nay |
@conceptofmind Hi Enrico, if you have not started to work on Pile of Law, I will take it. |
@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law. Looking through the paper, it seems that about two thirds of the data comes from:
|
Thanks for the note! I am starting from the first source. |
Supposed to be getting into contact with Jack Cushman soon over case law |
In contact with Jack Cushman about Case Law Access Project. The data will be open in March for release. I will have all the scripts and data prepared for it then. |
Wrote an updated parser for UPSTO. |
CAP is done and I uploaded a post-processed sample to HF |
I am working on finishing Court Listener soon with Ramid. |
We transcribed oral arguments, cleaned the opinion datasets, and are looking into the other sections of POL now. |
Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks! |
The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon. I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated. |
Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data. |
@craffel Going to bring this back up here. |
Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo? |
Yes, the code will be posted and this is not the finalized data. This data is just to ensure that duplicated work is not done. Everything being done with CL and Harvard CAP is already open-sourced anyway and you can see it. |
CC @wildphoton so that you do not need to run all the extraction. You will be duplicating the data 15 times. The single correct dump is already extracted and I am post-processing it. |
@craffel I am going to make multiple PRs but want to stem the issue related to deduplicated data first so opening that one now. |
@conceptofmind Thanks for sharing the info.
|
We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it. |
I have already done this. I just need to upload it and add the code. The previous example of overlapping data is: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project |
The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done. The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge This is pending upload given input from Mike Lissner and Jack Cushman. There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate. |
@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks. |
I will add initial basic processing code for the stated above and make the changes to the PR. I am waiting on final opinions from CAP/CL and will open another PR after that. |
@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think |
That is the correct documentation. It says "from best to worst". The best being
The plain text does not contain the numerous fixes and opinionated pre-processing that Harvard CAP and CL have spent time adding during collection. If only
I said that a different PR would need to be opened after for the additional post-processing fixes. Quoted here:
The current PR is to get the correct ordering of the columns and the text extracted. It is best to ensure all of Brian's comments are resolved. I am finishing those first. The next PR that will be opened will contain fixes to the structured HTML/XML as well as things such as boilerplate removal, handling erroneous OCR errors, cleaning, etc. I have been working with a team to label any instances of boilerplate that need to be removed. There are still additional fixes that need to be upstreamed from CAP to CL and I am helping them with it now. For example, there are new HTML documents from CAP that are not yet added to CL and need to be processed with specific CL code. I will try to add all of these fixes to the next upcoming PR based on input given to me by CAP and CL. Thanks. |
@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel
|
Forwarding this from the PR: You can review the precision and F1 of different text extractors here:
Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the code in the PR, The current PR will use Trafilatura for handling the HTML/XML extraction. I am not sure if anyone has worked on collecting any updates to Pile of Law. It is likely worth contacting Peter Henderson in that regard. |
Domain: Legal
Pile of Law
Case Law Access Project
US Congressional Documents
Digitized records of congress proceedings. See here: https://www.govinfo.gov/app/collection/cdoc/118/sdoc/all
Some of the data is text and some is just PDF, from a quick look it seems like there are a decent number of tables in the PDFs (which generally don't have text versions available).
The text was updated successfully, but these errors were encountered: