Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Legal Documents #21

Open
3 tasks
blester125 opened this issue Nov 7, 2023 · 37 comments
Open
3 tasks

Legal Documents #21

blester125 opened this issue Nov 7, 2023 · 37 comments
Assignees

Comments

@blester125
Copy link
Collaborator

Domain: Legal

  • Pile of Law

  • Case Law Access Project

  • US Congressional Documents
    Digitized records of congress proceedings. See here: https://www.govinfo.gov/app/collection/cdoc/118/sdoc/all
    Some of the data is text and some is just PDF, from a quick look it seems like there are a decent number of tables in the PDFs (which generally don't have text versions available).

@StellaAthena
Copy link
Collaborator

I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.

@shayne-longpre
Copy link
Collaborator

I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.

@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors).

@conceptofmind
Copy link
Contributor

I have all pre-2020 Congressional proceedings, but when we read over them we quickly decided that the median document was far to racist to be included in a training dataset.

@StellaAthena do we want to intentionally exclude toxic content, or to include it but with toxicity/quality scores attached? Based on prior work I wonder if it's better to let users decide, especially since there are beneficial cases to pretrain on real world toxic content (e.g. training realistic toxicity detectors).

We could use perspectives api to list the scores for each sentence.

@conceptofmind
Copy link
Contributor

Also should keep Cushman in the loop for this? Talk to Peter as well.

@craffel
Copy link
Collaborator

craffel commented Dec 11, 2023

Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources.

@conceptofmind
Copy link
Contributor

Regarding toxicity annotations, we are not going to do that as part of the per-source preprocessing, it will be done globally to all sources.

Got you. Sounds good.

@conceptofmind
Copy link
Contributor

I have access to all of case law and some scripts to get the data. Can CC John Nay

@wildphoton
Copy link
Collaborator

@conceptofmind Hi Enrico, if you have not started to work on Pile of Law, I will take it.

@StellaAthena
Copy link
Collaborator

@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law.

Looking through the paper, it seems that about two thirds of the data comes from:

  1. CourtListener Opinions, CourtListener Docket Entries and Court Filings
  2. U.S. Board of Veterans’ Appeals Decisions
  3. Atticus Contracts
  4. Edgar Contract
  5. U.S. State Codes

@wildphoton
Copy link
Collaborator

@wildphoton Note that the Pile of Law is an amalgamation of different data sources. We should process those from their source rather than just import the Pile of Law.

Looking through the paper, it seems that about two thirds of the data comes from:

  1. CourtListener Opinions, CourtListener Docket Entries and Court Filings
  2. U.S. Board of Veterans’ Appeals Decisions
  3. Atticus Contracts
  4. Edgar Contract
  5. U.S. State Codes

Thanks for the note! I am starting from the first source.

@conceptofmind
Copy link
Contributor

Supposed to be getting into contact with Jack Cushman soon over case law

@conceptofmind
Copy link
Contributor

conceptofmind commented Feb 13, 2024

In contact with Jack Cushman about Case Law Access Project. The data will be open in March for release. I will have all the scripts and data prepared for it then.

@conceptofmind
Copy link
Contributor

Wrote an updated parser for UPSTO.

@baberabb
Copy link
Contributor

baberabb commented Mar 1, 2024

Wrote an updated parser for UPSTO.

Hey! Was about to make a PR for #9 with the BigQuery dataset. Or do we want to parse it ourselves?

@conceptofmind
Copy link
Contributor

Wrote an updated parser for UPSTO.

Hey! Was about to make a PR for #9 with the BigQuery dataset. Or do we want to parse it ourselves?

I have parsed docs from the official site.

It is likely worth doing both.

@conceptofmind
Copy link
Contributor

CAP is done and I uploaded a post-processed sample to HF

@conceptofmind
Copy link
Contributor

I am working on finishing Court Listener soon with Ramid.

@conceptofmind
Copy link
Contributor

We transcribed oral arguments, cleaned the opinion datasets, and are looking into the other sections of POL now.

@wildphoton
Copy link
Collaborator

Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks!

@conceptofmind
Copy link
Contributor

Hi @conceptofmind I also wrote a script for downloading the opinion datasets from POL raw bulk data (just upload them to branch legal/court_listener), and now looking into other part of Court Listener. I wonder how did you do the cleaning (did you directly clean the data in pile of law) and what's your plan next. Shall we have a sync on this to avoid duplicate work? Thanks!

The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.

I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.

@storytracer
Copy link

Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data.

@conceptofmind
Copy link
Contributor

@craffel Going to bring this back up here.

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 3, 2024

@craffel
Copy link
Collaborator

craffel commented Jun 3, 2024

Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo?

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 3, 2024

Hey @conceptofmind , we won't use the data you're posting about unless the code is in this repo. Can you add the code to this repo and/or update the code in this repo?

Yes, the code will be posted and this is not the finalized data. This data is just to ensure that duplicated work is not done. Everything being done with CL and Harvard CAP is already open-sourced anyway and you can see it.

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 3, 2024

CC @wildphoton so that you do not need to run all the extraction. You will be duplicating the data 15 times. The single correct dump is already extracted and I am post-processing it.

@conceptofmind
Copy link
Contributor

@craffel I am going to make multiple PRs but want to stem the issue related to deduplicated data first so opening that one now.

@wildphoton
Copy link
Collaborator

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

@craffel
Copy link
Collaborator

craffel commented Jun 4, 2024

We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it.

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 4, 2024

We need to process the data so that it looks like clean natural text. For example, we need to try to remove markup, boilerplate/menu text from webpages, OCR gibberish, etc. The best way to determine whether and what processing we need to do is to dump out some preliminary data and take a look at it.

I have already done this. I just need to upload it and add the code. The previous example of overlapping data is: https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 4, 2024

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.

The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge

This is pending upload given input from Mike Lissner and Jack Cushman.

There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@craffel
Copy link
Collaborator

craffel commented Jun 4, 2024

@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks.

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 4, 2024

@conceptofmind if it has already been done, please open a PR to add the code to this repo - for the purpose of this project, if the code isn't in this repo, it isn't done yet. Thanks.

I will add initial basic processing code for the stated above and make the changes to the PR.

I am waiting on final opinions from CAP/CL and will open another PR after that.

@wildphoton
Copy link
Collaborator

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.

The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge

This is pending upload given input from Mike Lissner and Jack Cushman.

There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

@conceptofmind
Copy link
Contributor

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.
The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge
This is pending upload given input from Mike Lissner and Jack Cushman.
There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

@conceptofmind Thanks for sharing the info.

  • If each new bulk file includes all the old data, we should only use the newest one. I see that your modification PR has been merged.
  • I am not sure we should post-processing the data. According to previous discussions we want to present the original data as long as they are reasonable and leave the flexibility of data processing to whoever use the dataset. @craffel Could you confirm this?
  • I also wonder what you did to "fix the rest of the text columns". I use the "plain text" column since it is the only source of the main text. Which columns you think we should add or use instead?

The data should absolutely be post-processed. It contains OCR errors, boilerplate, and numerous other issues. This has already been done.
The plain text column is the lowest quality form of the data. There are numerous other columns that take priority over the plain text column. The Court Listener documentation states this. I have already correctly merged the columns in order of priority and done text extraction across them. This was a semi-processed example of it on a subset of CL: https://huggingface.co/datasets/conceptofmind/test_merge
This is pending upload given input from Mike Lissner and Jack Cushman.
There are fixes from Harvard CAP that still need to be upstreamed and will take a little bit to integrate.

@conceptofmind Is this the documentation you mentioned? I think I overlooked the HTML based columns which should be cleaned. Why you think plain_text is of the lowest quality? It sounds actually the cleanest one if it is the opinion from a court's website as a PDF or Microsoft Word document according to the doc, no? You can see them here, they look reasonably good. Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

That is the correct documentation.

It says "from best to worst". The best being html_with_citations. The worst being plain_text. The ordering of columns to use is listed there and has been additionally confirmed by CAP and CL to me:

The best approach is to choose the first of these fields that is populated, 
according to the following order (from best to worst):
- html_with_citations is a special field that is populated by parsing one of the above fields for citations, generating an HTML file with hyperlinked citations. All items should eventually have this field, though it can be empty initially or if the cross-linker crashes. In general, this is the field that is used to generate pages on CourtListener and the one we recommend.
- html_columbia will be populated if we got the content from the Columbia collaboration.
- html_lawbox will be populated if we got the content from the Lawbox donation.
- xml_harvard will be populated if the source was Harvard's Caselaw Access Project. This field has a lot of data but is inferior to others due to being created by OCR instead of by humans.
- html_anon_2020 will be populated if we got the content from our anonymous source in 2020.
- html will be populated if we got the opinion from a court's website as a Word Perfect or HTML document, or if we got the opinion from Resource.org, which provides HTML documents.
- plain_text will be populated if we got the opinion from a court's website as a PDF or Microsoft Word document.

The plain text does not contain the numerous fixes and opinionated pre-processing that Harvard CAP and CL have spent time adding during collection. If only plain_text is used it misses much of the data contained in the other columns. For these reasons and more not stated, it is typically ranked lowest. I imagine any structured government data is going to look pretty good!

Also, I did not see your CR did any processing on the plain_text as you said and the code only combine it with cleaned HTML columns.

I said that a different PR would need to be opened after for the additional post-processing fixes.

Quoted here:

I am waiting on final opinions from CAP/CL and will open another PR after that.

The current PR is to get the correct ordering of the columns and the text extracted. It is best to ensure all of Brian's comments are resolved. I am finishing those first.

The next PR that will be opened will contain fixes to the structured HTML/XML as well as things such as boilerplate removal, handling erroneous OCR errors, cleaning, etc. I have been working with a team to label any instances of boilerplate that need to be removed.

There are still additional fixes that need to be upstreamed from CAP to CL and I am helping them with it now. For example, there are new HTML documents from CAP that are not yet added to CL and need to be processed with specific CL code.

I will try to add all of these fixes to the next upcoming PR based on input given to me by CAP and CL.

Thanks.

@wildphoton
Copy link
Collaborator

@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel

The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.

I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.

@conceptofmind
Copy link
Contributor

conceptofmind commented Jun 12, 2024

@conceptofmind Hi, I wonder if you have prepared other legal document data/code since we don't know any details yet? I can help to process the HTML in opinions data since I found a good HTML extractor that works well. Thanks! cc @blester125 @craffel

The court audio transcripts have been processed and diarized in collaboration with Tensorlake. The additional court listener opinions were further processed as an extension of CAP and will be released soon.
I have not evaluated the other sets in Court Listener. @hamediramin was looking into those but they did not seem to contain much useful information. He additionally was investigating what sets in Pile of Law could be updated.

Forwarding this from the PR:

You can review the precision and F1 of different text extractors here:

| Model         | Mean Precision | Mean F1 | Median Precision | Median F1 |
|---------------|----------------|---------|------------------|-----------|
| Trafilatura   | 0.913          | 0.883   | 0.989            | 0.957     |
| DOM Distiller | 0.894          | 0.858   | 0.983            | 0.959     |
| Web2Text      | 0.797          | 0.841   | 0.885            | 0.917     |
| Boilerpipe    | 0.908          | 0.834   | 0.973            | 0.946     |
| Dragnet       | 0.901          | 0.823   | 0.980            | 0.943     |
| BTE           | 0.796          | 0.817   | 0.927            | 0.936     |
| Newspaper3k   | 0.896          | 0.816   | 0.994            | 0.958     |
| news-please   | 0.895          | 0.815   | 0.994            | 0.958     |
| Goose3        | 0.899          | 0.810   | 0.999            | 0.940     |
| BoilerNet     | 0.840          | 0.798   | 0.944            | 0.895     |
| ExtractNet    | 0.858          | 0.791   | 0.963            | 0.911     |
| jusText       | 0.794          | 0.759   | 0.949            | 0.904     |
| lxml Cleaner  | 0.615          | 0.717   | 0.670            | 0.798     |
| html_text     | 0.567          | 0.683   | 0.506            | 0.667     |
| BS4           | 0.563          | 0.680   | 0.506            | 0.669     |
| inscriptis    | 0.557          | 0.673   | 0.483            | 0.649     |
| XPath Text    | 0.550          | 0.664   | 0.510            | 0.674     |

Trafilatura with precision set to True will have even better results than the above. As you can see, BS4 is ranked quite low. Irregardless of the results above if we want to be consistent with Dolma which is used throughout this project we should use Trafilatura. It is a single-line adjustment to the code in the PR, trafilatura.extract(filecontent, favor_precision=True).

The current PR will use Trafilatura for handling the HTML/XML extraction.

I am not sure if anyone has worked on collecting any updates to Pile of Law. It is likely worth contacting Peter Henderson in that regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants