## Scraping Restraint and Seclusion Data for D.C. Non-Public Schools

Washington, D.C. spent $54 million to place students with disabilities in private schools in the 2016-17 school year. Detailed data on these non-public schools is available online from the [Office of the State Superintendent (OSSE) website](https://osse.dc.gov/page/nonpublic-school-profiles). The purpose of this code was to scrape up all that data, which was available in PDF format for each individual school, in order to count up the total number of restraint and seclusion incidents in all the non-public schools. (NOTE: Since many of these schools are outside D.C., the data is only for incidents which happened to D.C. public school students, not students from neighboring states.) 

Step One: I used a Chrome extension called [Simple Mass Downloader](https://chrome.google.com/webstore/detail/simple-mass-downloader/abdkkegmcbiomijcbdaodaflgehfffed/related?hl=en-US) to save all the school profile PDFs to a folder that I labeled dc_pdfs. 
Note: I had to rename them all to remove spaces (the extension allows you to do that -- use regex to replace \s with _)
Then I converted the PDFs to text using Tika and scraped using regular expressions.

In [1]:
from tika import parser #allows me to read PDFs and convert them to text strings
import pandas as pd #allows me to build dataframes
import re #allows me to use regular expressions

Next, I created a list of PDFs from the directory dc_pdfs.

In [2]:
import glob, os

path = 'dc_pdfs'
pdfs = []

for filename in glob.glob(os.path.join(path, '*.pdf')):
#    print(filename)
    pdfs.append(filename)

There are 42 PDFs in the directory.

In [3]:
len(pdfs)

42

I then defined a function that parsed every PDF. However, these eight PDFs had missing values for restraint: 

* FINAL_YFT_Nonpublic_School_Profile_042318.pdf
* FINAL__Harbor_Point_Behavioral_Health_Center_Nonpublic_School_Profile_042318.pdf
* FINAL_DevFL_Nonpublic_School_Profile_042318.pdf
* FINAL_KINGSBURY__Nonpublic_School_Profile_042318.pdf
* FINAL_MONROESCHOOL_Nonpublic_School_Profile_042318.pdf
* FINAL_NCIA_YIT_Nonpublic_School_Profile_042318.pdf
* FINAL_NEW_BEGINNINGS_Nonpublic_School_Profile_042318.pdf
* FINAL_RIDGE_SCHOOL_Nonpublic_School_Profile_042318.pdf

I wrote a code that uses various if/then statements to check if the PDF has restraint data, seclusion data, both or neither. But if you wanted to skip that complicated part of the code, you can just write the following code and fill in the missing PDFs by hand:

if len(restraints[0]) == 2: 
    return

This pattern works because all of the responses where data was missing were two characters long.

In [4]:
headers = ['campus_name', 'count_DC_students', 'restraint_1314', 'restraint_1415', 'restraint_1516', 
           'seclusion_1314', 'seclusion_1415', 'seclusion_1516']

def parse_pdf(pdf):
    
    parsed = parser.from_file(pdf)
    parsed['content'] = parsed['content'].replace('\n', '')
    names = re.findall("PROGRAM CONTACT INFORMATION (.*?)  Campus Contact Information", parsed['content']) 
    students = re.findall("Number of DC Students:.(\d{0,5})", parsed['content'])
    restraints = re.findall("Total number of physical restraints (\d{0,5}.\d{0,5}.\d{0,5})", parsed['content']) 
    seclusions = re.findall("Total number of seclusions (\d{0,5}.\d{0,5}.\d{0,5})", parsed['content'])      
    if len(restraints[0]) == 2:
        restraints = [None] #Put the brackets around None or you will get error "zip argument #3 must support iteration"
    if len(seclusions[0]) == 2:
        seclusions = [None]
    df = pd.DataFrame(list(zip(names, students, restraints, seclusions)), 
                      columns=['campus_name', 'count_DC_students', 'restraint', 'seclusion'])
    
    if(df.iloc[0]['restraint'] == None and df.iloc[0]['seclusion'] == None): #missing restraint and seclusion data
        df = df.reindex(columns = headers)
    elif (df.iloc[0]['restraint'] == None and df.iloc[0]['seclusion'] != None): #missing only restraint
        df['restraint_1314'] = None
        df['restraint_1415'] = None
        df['restraint_1516'] = None
        del df['restraint']
        df[['seclusion_1314', 'seclusion_1415', 'seclusion_1516']] = df.seclusion.str.split(" ", expand=True)
        del df['seclusion']
    elif (df.iloc[0]['restraint'] != None and df.iloc[0]['seclusion'] == None): #missing only seclusion
        df[['restraint_1314', 'restraint_1415', 'restraint_1516']] = df.restraint.str.split(" ", expand=True)
        del df['restraint']
        df['seclusion_1314'] = None
        df['seclusion_1415'] = None
        df['seclusion_1516'] = None
        del df['seclusion']
    else: #has both restraint and seclusion data
        df[['restraint_1314', 'restraint_1415', 'restraint_1516']] = df.restraint.str.split(" ", expand=True)
        df[['seclusion_1314', 'seclusion_1415', 'seclusion_1516']] = df.seclusion.str.split(" ", expand=True)
        del df['restraint']
        del df['seclusion']  
    
    return df

In [5]:
#Test for a PDF that has restraint and seclusion data
parse_pdf("dc_pdfs/FINAL__Forbush_Nonpublic_School_Profile_Draft_042318.pdf")

2019-08-14 13:11:36,244 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\Sharon\AppData\Local\Temp\tika-server.jar.
2019-08-14 13:11:42,727 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\Sharon\AppData\Local\Temp\tika-server.jar.md5.
2019-08-14 13:11:42,964 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


Unnamed: 0,campus_name,count_DC_students,restraint_1314,restraint_1415,restraint_1516,seclusion_1314,seclusion_1415,seclusion_1516
0,The Forbush School at Glyndon-Glyndon Campus,2,7,10,14,18,19,16
1,The Forbush School at Upper Oakmont,0,2,2,3,0,0,0
2,The Forbush School at Prince George's County,7,1,7,6,0,0,0
3,The Forbush School at Glyndon-Hannah More Campus,2,7,10,14,18,19,16


In [6]:
#Test for a PDF that has seclusion data but not restraint data
parse_pdf("dc_pdfs/FINAL_YFT_Nonpublic_School_Profile_042318.pdf")

Unnamed: 0,campus_name,count_DC_students,restraint_1314,restraint_1415,restraint_1516,seclusion_1314,seclusion_1415,seclusion_1516
0,Youth for Tomorrow,3,,,,0,0,0


In [7]:
#Test for a PDF that has restraint but not seclusion data
parse_pdf("dc_pdfs/FINAL_THEPATHWAYSSCHOOLS_Nonpublic_School_Profile_042318.pdf")


Unnamed: 0,campus_name,count_DC_students,restraint_1314,restraint_1415,restraint_1516,seclusion_1314,seclusion_1415,seclusion_1516
0,Pathways Schools- Edgewood,6,0,0,1,,,


Pathways was the one PDF where my code did not work. For some reason it skips every campus after Edgewood. I had to add back in the missing data manually. Any feedback on how to fix this issue would be appreciated.

In [10]:
url = "dc_pdfs/FINAL_THEPATHWAYSSCHOOLS_Nonpublic_School_Profile_042318.pdf"
    
parsed = parser.from_file(url)
parsed['content'] = parsed['content'].replace('\n', '')
names = re.findall("PROGRAM CONTACT INFORMATION (.*?)  Campus Contact Information", parsed['content']) 
students = re.findall("Number of DC Students:.(\d{0,5})", parsed['content'])
restraints = re.findall("Total number of physical restraints (\d{0,5}.\d{0,5}.\d{0,5})", parsed['content']) 
seclusions = re.findall("Total number of seclusions (\d{0,5}.\d{0,5}.\d{0,5})", parsed['content'])    

print(names)

['Pathways Schools- Edgewood', 'Pathways Schools- Re-Entry at DuVal', 'Pathways Schools- Re-Entry at Friendly', 'Pathways Schools- Anne Arundel', 'Pathways Schools- Horizons']


In [8]:
#Test for a PDF that has neither restraint nor seclusion data
parse_pdf('dc_pdfs/FINAL_MONROESCHOOL_Nonpublic_School_Profile_042318.pdf')

Unnamed: 0,campus_name,count_DC_students,restraint_1314,restraint_1415,restraint_1516,seclusion_1314,seclusion_1415,seclusion_1516
0,"Monroe School, Inc.",21,,,,,,


I then created an empty dataframe. Using a for loop, I appended the dataframes of each individual school to one large dataframe.

In [9]:
schools = pd.DataFrame(columns=headers)

for pdf in pdfs:
    print(pdf)
    school_df = parse_pdf(pdf)
    schools = schools.append(school_df, ignore_index=True) #REMEMBER THAT YOU NEED TO WRITE "SCHOOLS = "

dc_pdfs\FINAL_Accotink_Academy_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_Childrens_Guild_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_CoastalHarbor_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_Community_School_of_Maryland__Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_DevCBHS_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_DevFL_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_DevGA_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_DevGlenholme_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_DevKanner_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_DevMA_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_ECC_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_Foundations_School_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_FROST_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_GRAFTON_SCHOOL__Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_HARBOUR_SCHOOL_Nonpublic_School_Profile_042318.pdf
dc_pdfs\FINAL_HUGHES_CENTER_Nonpublic_School_Profile_042318.pdf


In [11]:
schools

Unnamed: 0,campus_name,count_DC_students,restraint_1314,restraint_1415,restraint_1516,seclusion_1314,seclusion_1415,seclusion_1516
0,Accotink Academy Therapeutic Day School,63,191,244.0,321.0,55,238.0,258.0
1,The Children's Guild-Baltimore Campus,8,7,10.0,9.0,14,8.0,15.0
2,The Children's Guild-Prince George's County Ca...,29,20,14.0,16.0,17,22.0,21.0
3,Coastal Harbor Treatment Center,1,111,53.0,8.0,22,7.0,0.0
4,Community School of Maryland-Brookeville Campus,7,25,68.0,75.0,0,0.0,0.0
5,Devereux Pennsylvania Children's Behavioral He...,0,0,0.0,0.0,0,0.0,0.0
6,Devereux Pennsylvania Children's Behavioral He...,0,0,0.0,0.0,0,0.0,0.0
7,Devereux School of Vierra,10,,,,,,
8,Ackerman Academy,8,91,167.0,90.0,2,6.0,0.0
9,Devereux Glenholme School,0,0,0.0,0.0,0,0.0,0.0


Finally, export your dataframe to a CSV file. But remember to add in four more campuses for Pathways.

In [12]:
schools.to_csv('dc_nonpublics_RS_updated.csv', encoding='utf-8')

## Acknowledgements 

Thank you to Cody Winchester and Jacob Sanders of IRE for always being available to answer questions via e-mail. 

Some pages that helped me:
* https://stackoverflow.com/questions/54760850/replace-all-newline-characters-using-python
* https://stackoverflow.com/questions/2503413/regular-expression-to-stop-at-first-match
* https://stackoverflow.com/questions/8113782/split-string-on-whitespace-in-python
* https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
* https://stackoverflow.com/questions/16729574/how-to-get-a-value-from-a-cell-of-a-dataframe 
* https://stackoverflow.com/questions/16327055/how-to-add-an-empty-column-to-a-dataframe 
* https://stackoverflow.com/questions/18262293/how-to-open-every-file-in-a-folder
* https://stackoverflow.com/questions/42069887/read-all-pdfs-in-a-directory-image