# Keyword Context Investigations

We might have in hand vast amounts of raw data from leaks, public records, web scrapes, FOIA requests, or whistleblowers. These could be emails, financial documents, contracts, social media posts, judicial judgments and discloser forms. 

Finding the buried story is another challenge. How do we read through thousands of pages to structure all this unstructured data.

We'll learn how to use modern investigative techniques to discover patterns, anomalies, and insights that might otherwise go unnoticed.

Here are a handful of examples:

**ProPublica**: <a href ="https://www.propublica.org/article/facebook-advertising-discrimination-housing-race-sex-national-origin">Facebook (Still) Letting Housing Advertisers Exclude Users by Race</a>
- This investigation uncovered how Facebook allowed advertisers to exclude users by race. Journalists used keyword searches to identify discriminatory practices in ad targeting.

**The New York Times**: <a href="https://www.nytimes.com/2015/10/25/us/racial-disparity-traffic-stops-driving-black.html">The Disproportionate Risks of Driving While Black</a>
- NYT reporters analyzed police traffic stop data to expose racial disparities. They used keyword searches to identify patterns in the data, revealing that Black drivers were more likely to be stopped and searched.

**ICIJ**: <a href="https://www.icij.org/investigations/panama-papers/">THE PANAMA PAPERS</a>
- This monumental collaborative investigation dug through 11.5 million leaked files **(2.6 terabytes of data)** to expose the offshore holdings of world political leaders, links to global scandals, and details of the hidden financial dealings of fraudsters, drug traffickers, billionaires, celebrities, sports stars and more.


If journalists had not used the techniques **you are now learning**, we'd still be pouring through these with our eyes and hands.

## MD Disciplinary action analysis

We'll bring several previous lessons today and a couple of new lessons.

We'll analysis <a href="https://drive.google.com/file/d/1m9dINJmoPGvFLTJagBSdvr9qmnExPSwZ/view?usp=sharing">a sample set of disciplinary actions</a> against New York state doctors. (You already know how to download PDFs, convert them to markdown, etc).

Here's a brief outline of what we need to do:
1. Get to know the documents you are analyzing.
2. Create a list of search terms.
3. It's not enough to simply find the keywords. You need to context. Determine how much context you need.
4. Run your code to process the PDFs into markdown and then into spreadsheets.
5. Export your filtered dataframes into text files that are easier to read.



In [22]:
## Explore your pdfs and create a list of search terms

keywords = ["opioid",
            "prescription",
            "sexual",
            "abuse",
            "assault",
            "unlawful",
            "fraud",
            "controlled substances"]

In [2]:
pip install mangoCR --upgrade

[33mDEPRECATION: pdf2images 0.0.6 has a non-standard dependency specifier plumbum>=1.6.8cv. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pdf2images or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
## import libaries
import pandas as pd
import regex as re
import glob
from mangoCR import pdf2image_ocr_text

In [4]:
## glob some pdfs together

target_pdfs = glob.glob("md-excerpts/*.pdf")
target_pdfs

['md-excerpts/md-192582.pdf',
 'md-excerpts/md-211807.pdf',
 'md-excerpts/md-250497.pdf',
 'md-excerpts/md-125202.pdf',
 'md-excerpts/md-152712.pdf',
 'md-excerpts/md-113372.pdf',
 'md-excerpts/md-198793.pdf',
 'md-excerpts/md-137421.pdf',
 'md-excerpts/md-162447.pdf',
 'md-excerpts/md-009178.pdf',
 'md-excerpts/md-184714.pdf',
 'md-excerpts/md-026201.pdf',
 'md-excerpts/md-260373.pdf']

### Single file analysis

Let's see how it works on a single file before we iterate through all of them.


In [6]:
## convert pdf to text (not markdown)
text = pdf2image_ocr_text(target_pdfs[-1])
text

Processing PDF 1 of 1: md-260373
  - Processed page 1 of 17 in md-260373
  - Processed page 2 of 17 in md-260373
  - Processed page 3 of 17 in md-260373
  - Processed page 4 of 17 in md-260373
  - Processed page 5 of 17 in md-260373
  - Processed page 6 of 17 in md-260373
  - Processed page 7 of 17 in md-260373
  - Processed page 8 of 17 in md-260373
  - Processed page 9 of 17 in md-260373
  - Processed page 10 of 17 in md-260373
  - Processed page 11 of 17 in md-260373
  - Processed page 12 of 17 in md-260373
  - Processed page 13 of 17 in md-260373
  - Processed page 14 of 17 in md-260373
  - Processed page 15 of 17 in md-260373
  - Processed page 16 of 17 in md-260373
  - Processed page 17 of 17 in md-260373
Finished processing md-260373

All PDFs have been processed.


'## md-260373_page_1\nNEW YORK | Department\n\nSTATE OF\n\nOPPORTUNITY.\nof Health\nKATHY HOCHUL JAMES V. McDONALD, M.D, M.P.H. MEGAN £. BALDWIN\nGovernor Acting Commissioner Acting Executive Deputy Commissioner\n\nApril 25, 2023\n\nCERTIFIED MAIL-RETURN RECEIPT REQUESTED\n\nManish Raval, M.D.\n\nRe: License No. 260373\n\nDear Dr. Raval:\n\nEnclosed is a copy of the New York State Board for Professional Medical Conduct\n(BPMC) Order No. 23-086. This order and any penalty provided therein goes into effect\nMay 2, 2023.\n\nPlease direct any questions to: Board for Professional Medical Conduct, Riverview\nCenter, 150 Broadway, Suite 355, Albany, New York 12204, telephone # 518-402-0846.\n\nSincerely,\n\nDavid Besser, M.D.\nExecutive Secretary\nBoard for Professional Medical Conduct\n\nEnclosure\n\ncc: Concetta Lomanto, Esq.\nMaynard, O’Connor, Smith and Catalinotto, LLP.\n6 Tower Place\nAlbany, New York 12203\n\nEmpire State Plaza, Corning Tower, Albany, NY 12237 | health.ny.gov\n\n\n## m

### Keyword Context Finder

https://pypi.org/project/fuzzy-context-finder/

In [10]:
pip install fuzzy-context-finder

[33mDEPRECATION: pdf2images 0.0.6 has a non-standard dependency specifier plumbum>=1.6.8cv. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pdf2images or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [12]:
## import the package
from fuzzy_context_finder import keyword_context_finder

In [14]:
df = keyword_context_finder(text, keywords, 'md-excerpts/md-260373.pdf', match_threshold=70  )
df

Unnamed: 0,File Name,Page Marker,Page Number,Matched Term,Original Term,Similarity Score,Search Term with 50 Words Context,Previous 250 Words (Including Term),Next 250 Words (Including Term)
0,md-excerpts/md-260373.pdf,md-260373_page_5,5,"prescribing,",prescription,75.0,while under the supervision of a practice supe...,* Pursuant to New York Pub. Health Law § 230-a...,"prescribing, dispensing, ordering and/or admin..."
1,md-excerpts/md-260373.pdf,md-260373_page_5,5,use,abuse,75.0,"exceptions noted above, shall be appropriately...",* Pursuant to New York Pub. Health Law § 230-a...,"use of controlled substances, upon execution o..."
2,md-excerpts/md-260373.pdf,md-260373_page_7,7,description,prescription,86.956522,"Center, 150 Broadway, Suite 355, Albany, New Y...",the licensee's registration period. Licensee s...,description of Respondent's employment and pra...
3,md-excerpts/md-260373.pdf,md-260373_page_15,15,"cause,",abuse,72.727273,or able to serve or no more than 30 days after...,8) 9) Respondent shall enroll in and successfu...,"cause, which shall include but not be limited ..."
4,md-excerpts/md-260373.pdf,md-260373_page_15,15,prescribing,prescription,78.26087,on a random unannounced basis at least monthly...,"specialty, (“practice monitor’) proposed by Re...",prescribing information and office records. Th...
5,md-excerpts/md-260373.pdf,md-260373_page_15,15,cause,abuse,80.0,to OPMC. b) Respondent shall be solely respons...,Respondent shall have an approved successor in...,cause the practice monitor to report quarterly...
6,md-excerpts/md-260373.pdf,md-260373_page_17,17,"cause,",abuse,72.727273,willing or able to serve or no more than 30 da...,1) EXHIBIT “C” PRACTICE SUPERVISOR No more tha...,"cause, which shall include but not be limited ..."
7,md-excerpts/md-260373.pdf,md-260373_page_17,17,"prescribing,",prescription,75.0,in Paragraphs a-c below. Regardless of the rea...,1) EXHIBIT “C” PRACTICE SUPERVISOR No more tha...,"prescribing, ordering, dispensing and/or admin..."
8,md-excerpts/md-260373.pdf,md-260373_page_17,17,"prescribing,",prescription,75.0,"In that event, Respondent shall propose anothe...",1) EXHIBIT “C” PRACTICE SUPERVISOR No more tha...,"prescribing, ordering, dispensing and/or admin..."
9,md-excerpts/md-260373.pdf,md-260373_page_17,17,position,prescription,70.0,is familiar with the Order and terms of probat...,by the Director that a supervisor has been dis...,position to regularly observe and assess Respo...


## Hard to read text?

We'll tap a package called <a href="https://pypi.org/project/tabular-2-text/">tabular_2_text</a>.


In [15]:
!pip install tabular-2-text



In [16]:
## import package
from tabular_2_text import tab2text

In [17]:
## convert to text
tab2text(df, "md_1_file.txt")

md_1_file.txt is in your local project folder.


'md_1_file.txt'

## Now iterate through all files

In [23]:
## iterate through files
df_list = []
empty_list = []
for count, target_pdf in enumerate(target_pdfs, start=1):  # Use enumerate for progress tracking
    print(f"Processing file {count}/{len(target_pdfs)}: {target_pdf}")
    
    text = pdf2image_ocr_text(target_pdf)  # Extract text from the PDF
    
    try:
        # Append the DataFrame with results to df_list
        df_list.append(keyword_context_finder(text, keywords, target_pdf, match_threshold=76))
    except:
        # Handle files where no matches are found or other errors occur
        print(f"No words found in {target_pdf}")
        empty_list.append(target_pdf)
        
print("DONE PROCESSING!")

Processing file 1/13: md-excerpts/md-192582.pdf
Processing PDF 1 of 1: md-192582
  - Processed page 1 of 1 in md-192582
Finished processing md-192582

All PDFs have been processed.
No results found!
Processing file 2/13: md-excerpts/md-211807.pdf
Processing PDF 1 of 1: md-211807
  - Processed page 1 of 2 in md-211807
  - Processed page 2 of 2 in md-211807
Finished processing md-211807

All PDFs have been processed.
Processing file 3/13: md-excerpts/md-250497.pdf
Processing PDF 1 of 1: md-250497
  - Processed page 1 of 2 in md-250497
  - Processed page 2 of 2 in md-250497
Finished processing md-250497

All PDFs have been processed.
No results found!
Processing file 4/13: md-excerpts/md-125202.pdf
Processing PDF 1 of 1: md-125202
  - Processed page 1 of 2 in md-125202
  - Processed page 2 of 2 in md-125202
Finished processing md-125202

All PDFs have been processed.
Processing file 5/13: md-excerpts/md-152712.pdf
Processing PDF 1 of 1: md-152712
  - Processed page 1 of 2 in md-152712
  -

In [24]:
## how long is df_list?

len(df_list)

13

In [25]:
## concat them

df = pd.concat(df_list, ignore_index = True)
df

Unnamed: 0,File Name,Page Marker,Page Number,Matched Term,Original Term,Similarity Score,Search Term with 50 Words Context,Previous 250 Words (Including Term),Next 250 Words (Including Term)
0,md-excerpts/md-211807.pdf,md-211807_page_2,2,defraud,fraud,83.333333,"County Criminal Part, Respondent was convicted...",NEW YORK STATE DEPARTMENT OF HEALTH STATE BOAR...,defraud in the first degree (NY Penal Law 190....
1,md-excerpts/md-211807.pdf,md-211807_page_2,2,fraud,fraud,100.0,"first degree (NY Penal Law 190.65), two counts...",NEW YORK STATE DEPARTMENT OF HEALTH STATE BOAR...,fraud in the first degree (NY Penal Law 177.25...
2,md-excerpts/md-211807.pdf,md-211807_page_2,2,defraud,fraud,83.333333,health care fraud in the first degree (NY Pena...,NEW YORK STATE DEPARTMENT OF HEALTH STATE BOAR...,defraud charge and imprisonment for two years ...
3,md-excerpts/md-211807.pdf,md-211807_page_2,2,unlawful,unlawful,100.0,in the United States District Court for the Di...,NEW YORK STATE DEPARTMENT OF HEALTH STATE BOAR...,unlawful distribution of anabolic steroids and...
4,md-excerpts/md-125202.pdf,md-125202_page_2,2,unlawful,unlawful,100.0,"Onor about June 2, 2023, in the United States ...",NEW YORK STATE DEPARTMENT OF HEALTH STATE BOAR...,unlawful distribution of Oxycodone in violatio...
5,md-excerpts/md-113372.pdf,md-113372_page_3,3,prescriptions,prescription,96.0,Surrender with the Texas Medical Board (Texas ...,charged with professional misconduct pursuant ...,prescriptions to them for controlled substance...
6,md-excerpts/md-113372.pdf,md-113372_page_4,4,prescribing,prescription,78.26087,of the Respondent’s medical license was the on...,sent the Notice of Referral Proceeding and Sta...,"prescribing controlled substances, opioids, an..."
7,md-excerpts/md-113372.pdf,md-113372_page_4,4,"opioids,",opioid,85.714286,medical license was the only appropriate penal...,of Referral Proceeding and Statement of Charge...,"opioids, and other dangerous drugs without mai..."
8,md-excerpts/md-113372.pdf,md-113372_page_4,4,prescriptions.,prescription,92.307692,Committee was very troubled by the Respondent'...,associated with the Respondent via certified m...,prescriptions. Although charged with negligenc...
9,md-excerpts/md-113372.pdf,md-113372_page_4,4,prescribing,prescription,78.26087,Respondent's persistent pattern of prescribing...,mail. (Exhibit 2.) Service was properly effect...,"prescribing practices for 15 patients, the Nan..."


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 9 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   File Name                            27 non-null     object 
 1   Page Marker                          27 non-null     object 
 2   Page Number                          27 non-null     int64  
 3   Matched Term                         27 non-null     object 
 4   Original Term                        27 non-null     object 
 5   Similarity Score                     27 non-null     float64
 6   Search Term with 50 Words Context    27 non-null     object 
 7   Previous 250 Words (Including Term)  27 non-null     object 
 8   Next 250 Words (Including Term)      27 non-null     object 
dtypes: float64(1), int64(1), object(7)
memory usage: 2.0+ KB


In [27]:
## convert to text
tab2text(df, "final_md_excerpt.txt")

final_md_excerpt.txt is in your local project folder.


'final_md_excerpt.txt'