# Investigations in which documents are key

We might have in hand vast amounts of raw data from leaks, public records, web scrapes, FOIA requests, or whistleblowers. These could be emails, financial documents, contracts, social media posts, judicial judgments and discloser forms. 

Finding the buried story is another challenge. How do we read through thousands of pages to structure all this unstructured data.

AI will put on the brakes if you upload massive amounts of documents. 

We'll learn how to use modern investigative techniques to discover patterns, anomalies, and insights that might otherwise go unnoticed.

Here are a handful of examples:

**ProPublica**: <a href ="https://www.propublica.org/article/facebook-advertising-discrimination-housing-race-sex-national-origin">Facebook (Still) Letting Housing Advertisers Exclude Users by Race</a>
- This investigation uncovered how Facebook allowed advertisers to exclude users by race. Journalists used keyword searches to identify discriminatory practices in ad targeting.

**The New York Times**: <a href="https://www.nytimes.com/2015/10/25/us/racial-disparity-traffic-stops-driving-black.html">The Disproportionate Risks of Driving While Black</a>
- NYT reporters analyzed police traffic stop data to expose racial disparities. They used keyword searches to identify patterns in the data, revealing that Black drivers were more likely to be stopped and searched.

**ICIJ**: <a href="https://www.icij.org/investigations/panama-papers/">THE PANAMA PAPERS</a>
- This monumental collaborative investigation dug through 11.5 million leaked files **(2.6 terabytes of data)** to expose the offshore holdings of world political leaders, links to global scandals, and details of the hidden financial dealings of fraudsters, drug traffickers, billionaires, celebrities, sports stars and more.


Today, we'll learn to quantify unstructured text with this challenge:

#### Confession Judgments

Confession Judgments have been around since the Middle Ages, but are especially common in New York State.

This legal procedure is a signed agreement where someone admits they owe money and allows the creditor to get a court judgment without a lawsuit if they don’t pay. It saves time for creditors but can be risky for debtors because they give up their right to defend themselves in court.

You scrape thousands of confession judgments from NYS court website and want to quantify how many have been signed by residents to deal with falling behind in rent or mortgage payments, or on utility bills. You also discover an interesting confession judgment and decide you want to quantify that too.

## Steps we need to take:
1. Acquisition, either through scraping, FOAI, or being given files by a source(s)
2. Organzing
3. Reading & understanding a few to uncover any patterns
4. Structuring unstructured content to quantify values

We'll work our way up to that challenge.

# Download, Capture, Structure

A few weeks ago we downloaded all **all** the files on <a href="https://sandeepmj.github.io/scrape-example-page/pages.html">the demo site</a>. 

Let's say we have all the files <a href="https://drive.google.com/file/d/1yNfAK-a72oLFv_QxRbvDi3HCj-Gxd6T9/view?usp=sharing">in this folder</a>. Download this folder and move it next to your ```.ipynb``` file. (Soon we'll learn how to do this programmatically.)



In [1]:
## import libraries
import pandas as pd
import glob


## Organizing our files using ```glob```

```glob``` does one thing and one thing only: it collects all files in a folder and places them in a list.

"glob" all files in the main directory:

In [3]:
## pull all files in the main directory
glob.glob("practice_documents/*")


['practice_documents/801368_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/fla_count_as_of_2020-08-19_time_11_46_00.csv',
 'practice_documents/800118_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/text_doc_J.txt',
 'practice_documents/801337_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/fla_count_as_of_2020-08-19_time_12_16_00.csv',
 'practice_documents/text_doc_08.txt',
 'practice_documents/text_doc_H.txt',
 'practice_documents/text_doc_I.txt',
 'practice_documents/text_doc_09.txt',
 'practice_documents/800394_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/adolph-coors-2015.pdf',
 'practice_documents/pdf_2.pdf',
 'practice_documents/pdf_3.pdf',
 'practice_documents/adolph-coors-2014.pdf',
 'practice_documents/pdf_1.pdf',
 'practice_documents/800166_2022_Marie_A_Cannon_Commi_v_Mar

In [5]:
## capture only the pdfs

glob.glob("practice_documents/*.pdf")


['practice_documents/801368_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800118_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/801337_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800394_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/adolph-coors-2015.pdf',
 'practice_documents/pdf_2.pdf',
 'practice_documents/pdf_3.pdf',
 'practice_documents/adolph-coors-2014.pdf',
 'practice_documents/pdf_1.pdf',
 'practice_documents/800166_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/adolph-coors-2013.pdf',
 'practice_documents/pdf_4.pdf',
 'practice_documents/pdf_5.pdf',
 'practice_documents/pdf_7.pdf',
 'practice_documents/pdf_6.pdf',
 'practice_documents/pdf_8.pdf',
 'practice_documents/pdf_9.pdf',
 'practice_documents/801366_2022_Marie_

In [25]:
## capture only the text files
glob.glob("practice_documents/*txt")

['practice_documents/text_doc_J.txt',
 'practice_documents/text_doc_08.txt',
 'practice_documents/text_doc_H.txt',
 'practice_documents/text_doc_I.txt',
 'practice_documents/text_doc_09.txt',
 'practice_documents/read_sample2.txt',
 'practice_documents/read_sample1.txt',
 'practice_documents/text_doc_04.txt',
 'practice_documents/text_doc_10.txt',
 'practice_documents/text_doc_D.txt',
 'practice_documents/text_doc_E.txt',
 'practice_documents/text_doc_05.txt',
 'practice_documents/text_doc_07.txt',
 'practice_documents/text_doc_G.txt',
 'practice_documents/text_doc_F.txt',
 'practice_documents/text_doc_06.txt',
 'practice_documents/text_doc_02.txt',
 'practice_documents/text_doc_B.txt',
 'practice_documents/text_doc_C.txt',
 'practice_documents/text_doc_03.txt',
 'practice_documents/text_doc_01.txt',
 'practice_documents/text_doc_A.txt']

In [19]:
## capture only the pdfs that have the words adolf coors in them

glob.glob("practice_documents/*coors*")

['practice_documents/adolph-coors-2015.pdf',
 'practice_documents/adolph-coors-2014.pdf',
 'practice_documents/adolph-coors-2013.pdf']

In [27]:
## files about florida
glob.glob("practice_documents/fla*")

['practice_documents/fla_count_as_of_2020-08-19_time_11_46_00.csv',
 'practice_documents/fla_count_as_of_2020-08-19_time_12_16_00.csv',
 'practice_documents/fla_count_as_of_2020-08-19_time_11_31_00.csv',
 'practice_documents/fla_count_as_of_2020-08-19_time_12_31_00.csv',
 'practice_documents/fla_count_as_of_2020-08-19_time_12_01_00.csv']

In [29]:
## finally, just the ones that say text-doc...and store in list called target_files
target_files = glob.glob("practice_documents/text_doc_*")
target_files


['practice_documents/text_doc_J.txt',
 'practice_documents/text_doc_08.txt',
 'practice_documents/text_doc_H.txt',
 'practice_documents/text_doc_I.txt',
 'practice_documents/text_doc_09.txt',
 'practice_documents/text_doc_04.txt',
 'practice_documents/text_doc_10.txt',
 'practice_documents/text_doc_D.txt',
 'practice_documents/text_doc_E.txt',
 'practice_documents/text_doc_05.txt',
 'practice_documents/text_doc_07.txt',
 'practice_documents/text_doc_G.txt',
 'practice_documents/text_doc_F.txt',
 'practice_documents/text_doc_06.txt',
 'practice_documents/text_doc_02.txt',
 'practice_documents/text_doc_B.txt',
 'practice_documents/text_doc_C.txt',
 'practice_documents/text_doc_03.txt',
 'practice_documents/text_doc_01.txt',
 'practice_documents/text_doc_A.txt']

### Sort for clarity

use ```sorted()```

In [7]:
## sorted list
target_files = sorted(glob.glob("practice_documents/text_doc*"))
target_files

['practice_documents/text_doc_01.txt',
 'practice_documents/text_doc_02.txt',
 'practice_documents/text_doc_03.txt',
 'practice_documents/text_doc_04.txt',
 'practice_documents/text_doc_05.txt',
 'practice_documents/text_doc_06.txt',
 'practice_documents/text_doc_07.txt',
 'practice_documents/text_doc_08.txt',
 'practice_documents/text_doc_09.txt',
 'practice_documents/text_doc_10.txt',
 'practice_documents/text_doc_A.txt',
 'practice_documents/text_doc_B.txt',
 'practice_documents/text_doc_C.txt',
 'practice_documents/text_doc_D.txt',
 'practice_documents/text_doc_E.txt',
 'practice_documents/text_doc_F.txt',
 'practice_documents/text_doc_G.txt',
 'practice_documents/text_doc_H.txt',
 'practice_documents/text_doc_I.txt',
 'practice_documents/text_doc_J.txt']

## Read & Structure

Create a dataframe that holds:

- name of the renter, 
- whether their lease of renewed or terminated,
- the name of the source file.

We'll do this two ways:

1. by coding Python (when we want to keep our data confidential).

2. by tapping the ChatGPT API (when it's publicly available data).


### We want to keep it confidential:

In [33]:
## step one, open and read files
for target_file in target_files:
    with open(target_file, "r") as my_doggy:
        print(type(my_doggy))
        

<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>
<class '_io.TextIOWrapper'>


`.readlines()` is a Python file method that reads **all lines** from an open text file and returns them as a **list of strings**, where each string represents one line (including the newline character `\n` at the end of each line).

In [51]:
## step one, open and read files
text_list = []
for target_file in target_files[:2]:
    with open(target_file, "r") as my_text:
        # print(type(my_text))
        all_text = my_text.readlines()
        text_list.append(all_text)
        
        

In [53]:
text_list

[['Client: Pen Federal Credit Union\n',
  '\n',
  'The decision is to renew rental agreement.'],
 ['Client: Help Desk Inc.\n',
  '\n',
  'The decision is to reject rental agreement.']]

In [61]:
all_text[2]

'The decision is to reject rental agreement.'

## now let's add logic:

In [9]:
## code here
decision_list = []
for target_file in target_files:
    with open(target_file, "r") as my_text:
        all_text = my_text.readlines()
        client = all_text[0].replace("Client:", "").strip()
        # print(client)
        decision = all_text[2]
        if "renew" in decision:
            decision = "renew"
        elif "terminate" in decision: 
            decision = "termindate"
        else:
            decision = "FLAG"
        # print(decision)
        decision_list.append({
            "client": client,
            "decision": decision,
            "source": target_file
        })
print("All done")

All done


In [75]:
all_text[0]

'Client: Speaker List Bureau\n'

In [11]:
## show decision list
decision_list

[{'client': 'Pen Federal Credit Union',
  'decision': 'renew',
  'source': 'practice_documents/text_doc_01.txt'},
 {'client': 'Help Desk Inc.',
  'decision': 'FLAG',
  'source': 'practice_documents/text_doc_02.txt'},
 {'client': "Global Wax n' Wane",
  'decision': 'renew',
  'source': 'practice_documents/text_doc_03.txt'},
 {'client': 'Kick Box',
  'decision': 'termindate',
  'source': 'practice_documents/text_doc_04.txt'},
 {'client': 'RedKey Inc.',
  'decision': 'termindate',
  'source': 'practice_documents/text_doc_05.txt'},
 {'client': 'Clip-n-Chip',
  'decision': 'termindate',
  'source': 'practice_documents/text_doc_06.txt'},
 {'client': 'CoLens Limited',
  'decision': 'termindate',
  'source': 'practice_documents/text_doc_07.txt'},
 {'client': 'Diceware Inc.',
  'decision': 'renew',
  'source': 'practice_documents/text_doc_08.txt'},
 {'client': 'Teflon Inc.',
  'decision': 'FLAG',
  'source': 'practice_documents/text_doc_09.txt'},
 {'client': 'RBG Inc.',
  'decision': 'renew',
 

In [91]:
## turn into dataframe
df = pd.DataFrame(decision_list)
df

Unnamed: 0,client,decision,source
0,Pen Federal Credit Union,renew,practice_documents/text_doc_01.txt
1,Help Desk Inc.,FLAG,practice_documents/text_doc_02.txt
2,Global Wax n' Wane,renew,practice_documents/text_doc_03.txt
3,Kick Box,termindate,practice_documents/text_doc_04.txt
4,RedKey Inc.,termindate,practice_documents/text_doc_05.txt
5,Clip-n-Chip,termindate,practice_documents/text_doc_06.txt
6,CoLens Limited,termindate,practice_documents/text_doc_07.txt
7,Diceware Inc.,renew,practice_documents/text_doc_08.txt
8,Teflon Inc.,FLAG,practice_documents/text_doc_09.txt
9,RBG Inc.,renew,practice_documents/text_doc_10.txt


## ChatGPT API

These documents are publicly available, but we have thousands. We can use ChatGPT API to avoid messages like "You can only upload 10 documents at a time."

Let's use the ```ChatGPT``` API, but some meta concepts first:

## 1. Get Your API Key

An API key is like a password that lets your code access the AI service.

**How to get it:**
- Go to the AI provider's website
- Create an account
- Find the API keys section
- Generate a new key
- Copy it immediately (you'll only see it once)
- I save all my keys in a password manager in one document call ```API Keys```

---

## 2. Using Your API Key

**Keep it secret:**
- Store it in a separate file (called `.env`)
- Never put it directly in your code
- Never share it with others
- Never upload it to GitHub or email it

**Why?**
- Anyone with your key can use your account
- You'll be charged for their usage
- They could rack up thousands of dollars in charges

---

## 3. Keep Track of Cost

**AI APIs charge you for:**
- Every request you make
- The amount of text you send
- The amount of text the AI generates

**How to manage costs:**
- Check the pricing before you start
- Estimate your total cost: (number of files) × (cost per file)
- Test on a small batch first (like 10 files)
- Set up billing alerts
- Monitor your usage regularly

**Example:**
If each document costs ```$0.01``` to process, then 5000 documents = $50

## Survey class on their RAMS

## 4. Don't Send Sensitive Information

**Never send:**
- Social Security Numbers
- Credit card numbers  
- Passwords
- Medical records
- Personal financial information
- Confidential source memos, emails, etc.

**Why?**
- Your data gets sent to the AI company's servers
- It may be stored temporarily
- There's always a privacy risk
- It may be used as training data
- It may be subpoenaed

**When in doubt:** Remove sensitive information first, or don't use the API.

---

## Summary

1. **Get your API key** - Your password to use the AI service
2. **Protect it** - Keep it secret, never share or upload it
3. **Watch costs** - Test small, estimate total, set alerts
4. **Privacy first** - Don't send sensitive personal or confidential data


We need a few things before we can do this:
1. An ChatGPT API key
2. A way to hide that key (you don't to use up your entire quota if someone stumbles on your key)

#### How to get a ChatGPT key:

1. **Go to the OpenAI platform:** https://platform.openai.com/api-keys

2. **Log in** (or create) your OpenAI account.

3. **Navigate to the API keys section** (sometimes under your profile icon → "View API keys").

4. **Click "Create new secret key"** (or similar). A new key will be generated.

5. **Copy and securely store the key immediately**, because you'll only see it once. If you lose it, you'll need to regenerate. I keep all my API keys in Dashlane in a document called ```API keys```.

6. **Set the key secretly in your notebook:**

We will use `python-dotenv`, a professional way to hide your API key used at Bloomberg and other places.

```python
     from dotenv import load_dotenv
     load_dotenv()  # Reads from .env file 
     import os   ## to handle environment variables
```
You need to ```pip install dotenv``` first.


7. Create a file named ```.env``` using VSCode. **Note:** once you save it and close it, you won't see it anymore.

In [93]:
pip install dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/sandeepjunnarkar/dataProjects/notebook-test-1/.venv/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [15]:
## import 
import os ## use it to secretly pull in our api key

from dotenv import load_dotenv

load_dotenv()

True

In [17]:
## pull secret API key into your notebook
chatGPT_key = os.getenv('OPENAI_API_KEY')


In [19]:
print(chatGPT_key)

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


In [119]:
pip install openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/sandeepjunnarkar/dataProjects/notebook-test-1/.venv/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [23]:
## import openai
from openai import OpenAI

In [25]:
# use the key you already loaded
client = OpenAI(api_key=chatGPT_key)

response = client.responses.create(
    model="gpt-4o-2024-08-06",
    input=[{"role": "user", "content": "Hello, how do i make a whiskey sour?"}]
)

print(response.output_text)

Here's a classic recipe for a Whiskey Sour:

**Ingredients:**

- 2 oz whiskey (bourbon works well)
- 3/4 oz fresh lemon juice
- 1/2 oz simple syrup
- Ice
- Optional: a dash of Angostura bitters
- Garnish: cherry and/or orange slice

**Instructions:**

1. **Mix Ingredients:** Combine the whiskey, lemon juice, and simple syrup in a shaker.

2. **Shake:** Add ice to the shaker and shake well until the mixture is chilled.

3. **Strain:** Strain the mixture into a rocks glass filled with ice.

4. **Garnish:** Add a cherry and/or a slice of orange for garnish. An optional dash of bitters can add extra complexity.

Enjoy your Whiskey Sour responsibly!


## From OpenAI:

#### OpenAI Models Overview (for Data Journalism – 2025)

| Model | Ideal Use Cases | Strengths | Limitations | Notes |
|--------|----------------|------------|--------------|-------|
| **gpt-4o-2024-08-06** | Deep text analysis, summarization, document extraction, mixed media (text + images) | Fast, accurate, handles long documents (128k tokens), can analyze charts/images | Slightly higher cost than 3.5 models | Best overall model for newsroom analytics & investigative automation |
| **gpt-4-turbo** | Structured data extraction, complex reasoning, policy analysis | Excellent at nuanced reasoning & tone analysis | Text-only | Great for in-depth narrative analysis or investigative topics |
| **gpt-3.5-turbo-0125** | Quick categorization, entity extraction, keyword tagging | Fast & cheap, good for large-scale automation | Weaker reasoning, less factual consistency | Ideal for bulk processing thousands of records or articles |
| **gpt-4o-mini** | Summarization, tagging, lightweight insight extraction | Very low latency and cost | Less detail in long reasoning | Best for high-volume, real-time dashboards |
| **text-embedding-3-large** | Search, clustering, similarity, topic modeling | Generates high-quality vector embeddings | Not generative (no text output) | Combine with a vector DB (e.g., FAISS, Pinecone) for newsroom archive search |
| **text-embedding-3-small** | Fast keyword and semantic search | Cheap and lightweight | Lower embedding quality | Great for smaller newsroom tools or prototypes |
| **whisper-1** | Transcription of interviews, speeches, press conferences | Accurate multilingual transcription | Audio-only | Can auto-generate transcripts for podcasts or public meetings |
| **gpt-4o-realtime-preview** | Live interviews, streaming Q&A, event monitoring | Responds to audio, text, or video in real time | Experimental API access | Future-facing for live analysis during debates or press events |

\*Prices change frequently — always confirm at [https://openai.com/api/pricing](https://openai.com/api/pricing).

In [135]:
decisions_list = []

for i, target_file in enumerate(target_files, start = 1):
    print(f"processing {i} of {len(target_files)} ")
    # open and read
    with open(target_file, 'r') as f:
        text = f.read()
    
    # Single API call asking for simple format
    ## the next few lines just say hey, here's my prompt and the text i want you to look at.
    response = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[{
            "role": "user", ## i, the human, am asking the question.
            "content": f"""Extract the client name and decision from this document.
Return ONLY in this format: client_name|decision
Where decision is either 'renew' or 'terminate' but the words in the text might be synonyms of 'renew' or 'terminate'


{text}"""
        }],
        max_tokens=100
    )
    
    # Parse the simple response
    result = response.choices[0].message.content.strip()
    client_name, decision = result.split('|')
    
    decisions_list.append({
        "client": client_name.strip(),
        "decision": decision.strip(),
        "source_file": target_file
    })

print("done processing!")
df = pd.DataFrame(decisions_list)


processing 1 of 20 
processing 2 of 20 
processing 3 of 20 
processing 4 of 20 
processing 5 of 20 
processing 6 of 20 
processing 7 of 20 
processing 8 of 20 
processing 9 of 20 
processing 10 of 20 
processing 11 of 20 
processing 12 of 20 
processing 13 of 20 
processing 14 of 20 
processing 15 of 20 
processing 16 of 20 
processing 17 of 20 
processing 18 of 20 
processing 19 of 20 
processing 20 of 20 
done processing!


In [27]:
rent, income = 300, 1000

In [29]:
rent

300

In [31]:
income

1000

In [137]:
df

Unnamed: 0,client,decision,source_file
0,Pen Federal Credit Union,renew,practice_documents/text_doc_01.txt
1,Help Desk Inc.,terminate,practice_documents/text_doc_02.txt
2,Global Wax n' Wane,renew,practice_documents/text_doc_03.txt
3,Kick Box,terminate,practice_documents/text_doc_04.txt
4,RedKey Inc.,terminate,practice_documents/text_doc_05.txt
5,Clip-n-Chip,terminate,practice_documents/text_doc_06.txt
6,CoLens Limited,terminate,practice_documents/text_doc_07.txt
7,Diceware Inc.,renew,practice_documents/text_doc_08.txt
8,Teflon Inc.,renew,practice_documents/text_doc_09.txt
9,RBG Inc.,renew,practice_documents/text_doc_10.txt


## Breaking it down:

### Line 1: Getting ChatGPT's response text
```python
result = response.choices[0].message.content.strip()
```

- `response` = the full API response object from OpenAI
- `.choices[0]` = get the first response (OpenAI can return multiple options, but we just want the first one)
- `.message.content` = extract the actual text ChatGPT wrote, message is the container that holds the actual content
- `.strip()` = remove any extra spaces/newlines from the beginning or end

**Example:** `result` becomes `"Pen Federal Credit Union|renew"`

---

### Line 2: Splitting the text into two variables
```python
client_name, decision = result.split('|')
```

- `result.split('|')` = split the string at the pipe character `|`, creating a list with 2 items
- `client_name, decision =` = unpack those 2 items into 2 separate variables

**Example:**
- `result.split('|')` → `["Pen Federal Credit Union", "renew"]`
- Then Python assigns:
  - `client_name = "Pen Federal Credit Union"`
  - `decision = "renew"`



In [None]:
df

## Our Confession Judgment Challenge

In [None]:
## glob confession judgment files


Unlike our previous ```.txt``` files, these are ```pdf``` files, and PDFs are notoriously obnoxious. They are designed so people can't change them easily. Old packages like ```PyPDF2``` pretty much sucked.

But our ability to read PDFs has improved dramatically, or has at least been simplified, because AI companies NEED to unlock their content to train their LLMs. There's ```pymupdf4llm```,   ```PyMuPDF``` and several others.

We'll use ```PyMuPDF``` to read these files because it's clean and super fast!

```bash
pip install PyMuPDF
```

In [33]:
pip install PyMuPDF


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Users/sandeepjunnarkar/dataProjects/notebook-test-1/.venv/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


#### import PyMuPDF which uses the name 'fitz'
```python
import fitz  
```

In [35]:
## import fitz

import fitz

In [53]:
target_files = glob.glob("practice_documents/*CONFESSION*")
target_files

['practice_documents/801368_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800118_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/801337_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800394_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800166_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/801366_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800120_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/801367_2022_Marie_Cannon_Commissi_v_Marie_Cannon_Commissi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/800119_2022_Marie_A_Cannon_Commi_v_Marie_A_Cannon_Commi_CONFESSION_OF_JUDGM_1.pdf',
 'practice_documents/801293_2022_Marie_Cannon_C

In [55]:
## read a single file
with fitz.open(target_files[0]) as doc:
    print(type(doc))
    my_text = ""
    print(f"doc is a {type(doc)} file")
    for page in doc:
        my_text += page.get_text()

<class 'pymupdf.Document'>
doc is a <class 'pymupdf.Document'> file


In [57]:
for target_file in target_files:
    with fitz.open(target_file) as doc:
        # print(type(doc))
        try:
            my_text = ""
            print(f"doc is a {type(doc)} file")
            for page in doc:
                my_text += page.get_text()
                print(my_text)
        except: 
            print("ran into error")
    

doc is a <class 'pymupdf.Document'> file
STATE OF NEW YORK
SUPREME COURT: COUNTY OF ERIE
Marie A. Cannon, Commissioner of
Erie County Department ofSocial Services
V
Dienson Volmy
SID# xxx xx 4944
STATE OF NEW YORK ]
COUNTY OF ERIE
]
CITY OF BUFFALO
]
PLAINTIFF
DEFENDANT
SS:
Affidavit of Confession of Judgment
Index No
The Deponent being duly sworn, deposes and says:
1.
I am the defendant in the above entitled action.
2.
I reside at 28 Dismonda, Buffalo, County of Erie, State ofNew York. I authorize entry ofjudgment in Erie
County, State ofNew York ifmy residence is not in New York State.
3.
I confess judgment in this court, in favor of the plaintiff and against the defendant in the sum of One
Thousand Five Hundred Twenty-Five and 00/100 Dollars ($1525.00) and hereby authorize the plaintiff or
his assigns to enter judgment for that sum against me plus court costs and interest.
4.
This confession ofjudgment is for a debt to become justly due to the plaintiff arising out of the following


In [47]:
my_text

'STATE OF NEW YORK\nSUPREME COURT: COUNTY OF ERIE\nMarie A. Cannon, Commissioner of\nErie County Department ofSocial Services\nV\nDienson Volmy\nSID# xxx xx 4944\nSTATE OF NEW YORK ]\nCOUNTY OF ERIE\n]\nCITY OF BUFFALO\n]\nPLAINTIFF\nDEFENDANT\nSS:\nAffidavit of Confession of Judgment\nIndex No\nThe Deponent being duly sworn, deposes and says:\n1.\nI am the defendant in the above entitled action.\n2.\nI reside at 28 Dismonda, Buffalo, County of Erie, State ofNew York. I authorize entry ofjudgment in Erie\nCounty, State ofNew York ifmy residence is not in New York State.\n3.\nI confess judgment in this court, in favor of the plaintiff and against the defendant in the sum of One\nThousand Five Hundred Twenty-Five and 00/100 Dollars ($1525.00) and hereby authorize the plaintiff or\nhis assigns to enter judgment for that sum against me plus court costs and interest.\n4.\nThis confession ofjudgment is for a debt to become justly due to the plaintiff arising out of the following\nfacts:\nI a

## Use AI to code your code here to quantify the info

In [None]:
## code here