# Florida Physician Violations: Dealing with ugly documents

I did most of the grunt work for you - scraped a databases, downloaded a million PDFs, automatically ran them through `convert` and `tesseract` in the hopes of coming out with readable text.

Unfortunately, **the text looks like trash**. Scans at weird angles, nothing looks right, text issues all over the place.

Maybe I'm looking for documents that deal with **opioids** - Vicodin, hydrocodone, oxycodone, etc. What are some techniques we can use to effectively search these documents? And most importantly, **is it different than what we did with the New York doctors?**

Maybe I'm looking for documents that deal with **sexual assault** or **sexual harassment**, which take a lot more forms than opioids, and may have indirect language used to describe them. Are our search techniques the same as above?

> **Tip:** When dealing with dirty text, you might want to use something like fuzzywuzzy [repo](https://github.com/seatgeek/fuzzywuzzy), which we [covered in class](http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/), or [jellyfish](https://github.com/jamesturk/jellyfish), which we didn't.

> **Question:** Is machine learning really the best technique? Maybe, maybe not! There are a handful other other approaches I can think of off of the top of my head, and not all of them are that technical.

I'm going to clean the data up a *little* down below, then it's up to you. You can search for whatever you'd like. Yes, there aren't that many, but that's just because it's soooo slow to convert to JPG and run tesseract!

## Combining the documents and the violations records

### Step 1.1: Read in the violations

In [1]:
import pandas as pd
violations_df = pd.read_csv("florida-violations.csv", dtype={'case': str})
violations_df.head(2)

Unnamed: 0,action_date,action_taken,case,case_url,city,county,lic_number,lic_url,name,profession,state
0,06/26/2001,Obligations Imposed,199615776,https://appsmqa.doh.state.fl.us/MQASearchServi...,ORMOND BEACH,VOLUSIA,27057,https://appsmqa.doh.state.fl.us/MQASearchServi...,"PARIKH, MADHUSUDAN",Medical Doctor,FL
1,03/27/2001,Obligations Imposed,199955593,https://appsmqa.doh.state.fl.us/MQASearchServi...,AUCKLAND,UNKNOWN,52282,https://appsmqa.doh.state.fl.us/MQASearchServi...,"WIMBROW, THOMAS",Medical Doctor,ZZ


### Step 1.2: Read in the converted PDFs

In [2]:
import glob

filenames = glob.glob("converted-docs/*")
contents = [open(filename).read() for filename in filenames]
docs_df = pd.DataFrame({
    'filename': filenames,
    'contents': contents
})
docs_df['case'] = docs_df.filename.str.extract("converted-docs/(.*).txt", expand=False)
docs_df.head(2)

Unnamed: 0,contents,filename,case
0,6M}:\n\nFILED\n\nDepartment of Professional Re...,converted-docs/100040.txt,100040
1,\n\nSTATE OF FLORIDA\nDEPARTMENT OF BUSINESS ...,converted-docs/100142.txt,100142


### Step 1.3: Merge the two

In [3]:
df = docs_df.merge(violations_df, left_on='case', right_on='case')
df.head(2)

Unnamed: 0,contents,filename,case,action_date,action_taken,case_url,city,county,lic_number,lic_url,name,profession,state
0,6M}:\n\nFILED\n\nDepartment of Professional Re...,converted-docs/100040.txt,100040,08/16/1990,Probation-App Rpts/Screens Req,https://appsmqa.doh.state.fl.us/MQASearchServi...,JACKSONVILLE,DUVAL,26676,https://appsmqa.doh.state.fl.us/MQASearchServi...,"DRUCKER, MICHAEL",Medical Doctor,FL
1,\n\nSTATE OF FLORIDA\nDEPARTMENT OF BUSINESS ...,converted-docs/100142.txt,100142,03/25/1994,Voluntary Surrender,https://appsmqa.doh.state.fl.us/MQASearchServi...,CORAL GABLES,MIAMI-DADE,34265,https://appsmqa.doh.state.fl.us/MQASearchServi...,"RICO-PEREZ, MANUEL",Medical Doctor,FL


In [4]:
# How many do we have?
df.shape

(190, 13)

## Finding the documents

This part is up to you! What topic are you trying to find?

In [7]:
from fuzzywuzzy import fuzz

from fuzzywuzzy import process

In [16]:
#find sexual violations
df[df.contents.str.contains("sexual")].shape

(19, 13)

In [75]:
list_of_terms = ["sexual", "harassment", 'indecent', 'nude', 'rape']

In [76]:
def get_average_ratio(row):
    content = row['contents']
    sum_ratios = 0
    count_ratios = 0
    for term in list_of_terms:
        sum_ratios += fuzz.token_sort_ratio(content, term)
        count_ratios += 1
    average_ratio = sum_ratios/count_ratios
    
    return average_ratio

df['ratio'] = df.apply(get_average_ratio, axis=1)

In [77]:
df.sort_values(by='ratio', ascending=False)

Unnamed: 0,contents,filename,case,action_date,action_taken,case_url,city,county,lic_number,lic_url,name,profession,state,ratio
23,\n \n \n \n \n \n \n \n \n \n \n\...,converted-docs/106071.txt,106071,04/19/1989,Fine,https://appsmqa.doh.state.fl.us/MQASearchServi...,MIAMI,MIAMI-DADE,7022,https://appsmqa.doh.state.fl.us/MQASearchServi...,"FIERER, EUGENE",Medical Doctor,FL,1.4
149,\n \n \n\n333mb\n\n‘j‘bffggarm. ~\n\n...,converted-docs/199313485.txt,199313485,06/27/1994,Obligations Imposed-App/Rpt/Sc,https://appsmqa.doh.state.fl.us/MQASearchServi...,PLANTATION,BROWARD,9731,https://appsmqa.doh.state.fl.us/MQASearchServi...,"FELDMAN, MARK",Medical Doctor,FL,1.2
80,"\n\n \n\nFatitionaz,\n\n \n\n1'. JM vhf}, 5...",converted-docs/198909901.txt,198909901,04/20/1992,Fine and Letter of Concern,https://appsmqa.doh.state.fl.us/MQASearchServi...,ORANGE PARK,CLAY,38151,https://appsmqa.doh.state.fl.us/MQASearchServi...,"DONOVAN, JOHN",Medical Doctor,FL,0.4
171,\n\n \n\n \n\n4521'}: ‘.‘_-'-'\n\n- wn-m-r—__...,converted-docs/199322719.txt,199322719,11/07/1994,Disciplinary Citation Issued,https://appsmqa.doh.state.fl.us/MQASearchServi...,PLANTATION,BROWARD,64280,https://appsmqa.doh.state.fl.us/MQASearchServi...,"SINGAL, ROBERT",Medical Doctor,FL,0.2
0,6M}:\n\nFILED\n\nDepartment of Professional Re...,converted-docs/100040.txt,100040,08/16/1990,Probation-App Rpts/Screens Req,https://appsmqa.doh.state.fl.us/MQASearchServi...,JACKSONVILLE,DUVAL,26676,https://appsmqa.doh.state.fl.us/MQASearchServi...,"DRUCKER, MICHAEL",Medical Doctor,FL,0.0
128,FILED\n\nDesanmem of Business and Professional...,converted-docs/199010089.txt,199010089,10/06/1993,Voluntary Surrender,https://appsmqa.doh.state.fl.us/MQASearchServi...,POMPANO BEACH,BROWARD,9346,https://appsmqa.doh.state.fl.us/MQASearchServi...,"ROEHM, DAN",Medical Doctor,FL,0.0
121,STATE OF FLORIDA\nBOARD OF MEDICINE\n\nFimu On...,converted-docs/199006999.txt,199006999,05/01/1997,Probation,https://appsmqa.doh.state.fl.us/MQASearchServi...,ROSELAND,INDIAN RIVER,41885,https://appsmqa.doh.state.fl.us/MQASearchServi...,"LAPORTA, MARK",Medical Doctor,FL,0.0
122,u. gA y;\ni- i L 1: p _\nDepartmem of Professi...,converted-docs/199007115.txt,199007115,04/08/1992,Voluntary Surrender,https://appsmqa.doh.state.fl.us/MQASearchServi...,ORLANDO,ORANGE,6815,https://appsmqa.doh.state.fl.us/MQASearchServi...,"JOWETT, JOHN",Medical Doctor,FL,0.0
123,FILED\n\nDeparlr.'.ent m Profussxma) Reguﬁahon...,converted-docs/199009175.txt,199009175,05/01/1991,Obligations Imposed,https://appsmqa.doh.state.fl.us/MQASearchServi...,HEISKELL,OUT OF STATE,15157,https://appsmqa.doh.state.fl.us/MQASearchServi...,"JACKSON, LAWRENCE",Medical Doctor,TN,0.0
124,\n\nFiniaiom Ms. W\n\n \n\nDEPARMM ‘01? PROF...,converted-docs/199009227.txt,199009227,06/08/1993,Obligations Imposed,https://appsmqa.doh.state.fl.us/MQASearchServi...,PALM COAST,FLAGLER,15778,https://appsmqa.doh.state.fl.us/MQASearchServi...,"DANTINI, DANIEL",Medical Doctor,FL,0.0


In [78]:
print(df.sort_values(by='ratio',ascending=False)['contents'][3])

DEPARTMENT OF PROFESSIONAL Rmﬁzgk'rgjp W ‘E E D

BOARD OF MEDICINE

UeDartment of Professional Regulation
AGENCY CLERK

DEPARTMENT OF PROFESSIONAL
REGULATION, (' g ij
Petitioner, ”00L«~*"”'

CLERK
DPR CASE NUMBER: 0100197

_vS—- 7‘
LICENSE NUWWLL—Lﬁi

STEPHEN I. TICKTIN, M.D.,

 

Respondent.

/

CORRECTED FIHAL ORDER

THIS MATTER came before the Board of Medicine (Board)

pursuant to Section l20.57(3), Florida Statutes, on December 7,
1530, in Miami, Florida, for consideration of a Consent Agreement
(ggtached hereto as Exhibit A) entered into between the parties
ER the above-styled case. Upon consideration of the Consent
ﬁgreement, the documents submitted in support thereof, the
hﬁﬁguments of the parties, and being otherwise advised in the
premises,

IT IS HEREBY ORDERED AND ADJUDGED that the Consent Agreement
as submitted be and hereby is approved and adopted in toto and
incorporated by reference herein. Pursuant to Paragraph 4 of the
Stipulated DiSPOSition, the subject areas of the 