The Public Health Quality Improvement Exchange (PHQIX) is a database of smaller-scale quality improvement projects completed by health departments of varying sizes. PHQIX is an excellent sourse of example literature for current health department QI teams to look towards while planning for future QI initiatives

Unfortunately, PHQIX was abandoned as a result of the COVID-19 pandemic. What this means for us is that we have a static website with data from 206 QI initiatives completed at health departments. 

Let's build a simple webcrawler to pull these data for meta-analysis

We start by importing the relevant modules

In [1]:
# pandas for dataframe manipulation
# urlopen to open webpages
# BeautifulSoup for reading and manipulating HTML

import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen

PHQIX organizes their studies with tags. Let's save those tags to reference later

In [3]:
ORGANIZATION_TYPES = [
'Tribal Health Department',
'Public Health Institute',
'Local Health Department',
'Local Health Department, Multi-City',
'Local Health Department, City-County',
'Local Health Department, City/Town',
'State Health Department',
'Local Health Department, County',
'Local Health Department, Multi-County'
]
POPULATIONS = [
'100,000 to 249,999',
'250,000 to 499,999',
'1,000,000 +',
'500,000 to 999,999',
'25,000 to 49,999',
'Less than 24,499',
'50,000 to 99,999',
]
TOOLS = [
'Radar Chart',
'SMART Chart',
'Control & Influence Plots',
'SWOT (Strengths-Weaknesses-Opportunities-Threats) Analysis',
'Tree Diagram',
'Affinity Diagram',
'Check Sheet',
'Histogram',
'Force-Field Analysis',
'Pareto Chart',
'Control Chart',
'Multi- Voting Technique',
'Fishbone Diagram',
'Brainstorming',
'Interrelationship digraph',
'Root cause analysis',
'Flow chart',
'Five whys',
'PDPC (Process Decision Program Chart)',
'Cause and Effect Diagrams',
'Process maps',
'Prioritization Matrix',
'No specific QI Tool',
'Surveys',
'Run Chart'
]
METHODS = [
'Nominal Group Technique',
'Total Quality Management',
'Kaizen',
'Adaptive Promising Practice',
'Lean/Six Sigma',
'PDCA/PDSA Cycle (Plan, Do, Check/Study, Act)',
'Model for Improvement',
'Business Process Analysis',
'Rapid Cycle Improvement',
'SDCA Cycle (Standardize-Do-Check-Act)']
Screening_for_diseases_conditions = [
'Other STDs',
'Tuberculosis',
'HIV/AIDS',
'Cancer',
'Blood lead']
Immunization = [
'Adult Immunizations- vaccine order management and inventory distribution',
'Adult Immunizations- administration of vaccine to population',
'Childhood Immunizations- vaccine order management and inventory distribution',
'Childhood Immunizations- administration of vaccine to population',
'International travel immunizations- vaccine order management and inventory distribution',
'International travel immunizations- administration of vaccine to population']
Disease_Treatment_Services = [
'High blood pressure',
'HIV/AIDS',
'Other STDs',
'Tuberculosis',
'Asthma',
'Coronary heart disease',
'Diabetes',
'Other cancers']
Administration = [
'Policies / internal procedures and processes',
'Communications',
'Capacity Development',
'Media Production',
'Quality Improvement and Accreditation Readiness',
'Customer Service/satisfaction',
'Parking',
'Financial management',
'Accreditation',
'Workforce development',
'Organizational effectiveness',
'General security',
'Infrastructure',
'Performance Management',
'Accounting',
'Data collection and management/Information Technology',
'Procurement',
'Public Health Policy and Health reform',
'Purchasing',
'Agreements/MOUs/MOAs/Contracts',
'Research',
'Human Resources',
'Department management',
'Contracting',
'Innovation']
Data_collection_Epidemiology_Surveillance = [
'Chronic diseases',
'Maternal and child health',
'Reportable diseases',
'Cancer incidence',
'Communicable/infection diseases',
'Injury',
'Vital statistics',
'Morbidity data',
'Foodborne illness',
'Behavioral risk factors',
'Uninsured, outreach and enrollment for medical insurance',
'Environmental health']
Maternal_and_Child_Health_Services = [
'Children with special health care needs',
'MCH home visits',
'WIC',
'Family planning',
'Prenatal care',
'Early intervention services for children',
'Comprehensive school health clinical services',
'Child nutrition (daycare providers)',
'Non-WIC nutrition assessment and counseling',
'School health services (non-clinical)',
'Obstetrical care',
'Comprehensive primary care clinics for children']
Other_Environmental_Health_Services = [
'Indoor air quality',
'Poison control',
'Food safety education',
'Nuisance complaints',
'Mosquito control',
'Outdoor air quality regulations',
'Private water supply safety',
'Public water supply safety',
'Surface water protection',
'Vector control',
'Groundwater protection',
'Collection of unused pharmaceuticals',
'Disaster/Emergency Preparedness',
'Food borne illness investigation',
'Hazardous waste disposal',
'Animal control',
'Environmental epidemiology']
Other_Health_Services_provided_to_individuals = [
'Mental/behavioral health services',
'Substance abuse education and prevention services',
'Substance abuse treatment services',
'Home health care',
'Pharmacy',
'Oral Health',
'Rural health',
'Child protection services/medical evaluation']
Other_Public_Health_Services = [
'Animal control',
'Outreach and enrollment for medical insurance (include Medicaid)',
'Laboratory services',
'School-based clinics',
'School health',
'Collaboration / resource sharing',
'Vital records',
"Medical examiner’s office",
'Access to care']
Populationbased_Primary_Prevention = [
'Injury',
'Asthma',
'Tobacco',
'Diabetes',
'Hypertension',
'Sexually transmitted disease counseling',
'Substance abuse',
'Nutrition',
'Physical activity',
'HIV',
'Chronic disease programs',
'Sex education',
'Unintended pregnancy']
Professional_licensure = [
'Nurses (any level)',
'Physicians']
Registry_maintenance = [
'Cancer',
'Childhood Immunization']
Regulation_inspection_andor_licensing = [
'Public drinking water',
'Assisted Living',
'Health-related facilities',
'Hospitals',
'Private drinking water',
'Temporary/mobile food vendors',
'Local public health agencies',
'Food service establishments',
'Septic systems',
'Childcare facilities',
'Swimming pools (public)']
State_Territory_laboratory_services = [
'Influenza typing',
'Blood lead screening',
'Newborn screening']
FOCUS_AREAS = [
    State_Territory_laboratory_services+Regulation_inspection_andor_licensing+Registry_maintenance+Professional_licensure+
    Populationbased_Primary_Prevention+Other_Public_Health_Services+Other_Health_Services_provided_to_individuals+
    Other_Environmental_Health_Services +Maternal_and_Child_Health_Services+Data_collection_Epidemiology_Surveillance+Administration+
    Disease_Treatment_Services+Immunization+Screening_for_diseases_conditions
    ]

Ok we have saved the tags for future reference. Next we will build a PHQIX_Initiavitve class, which we can later use to scrape data from websites 

In [4]:
class PHQIX_Initiative:

    """
    A class is a type of object in python. One instance of PHQIX_Initiative will hold data from one QI project.
    """

    def __init__(self, url):
        """
        The __init__ function runs when the class is instantiated for the first time. 
        Here, the function opens a URL and reads the HTML into a BeautifulSoup object.
        Various class variables are then created using string indexing methods to locate and extract meaningful data
        """
        self.url = url
        self.page = urlopen(url)
        self.html = self.page.read().decode("utf-8")
        self.soup = BeautifulSoup(self.html, "html.parser")
        self.title = self.soup.title.string[:self.soup.title.string.find("|")].strip()
        self.text = self.soup.get_text()
        self.summary = self.text[self.text.find('Summary'):self.text.find('Organization that conducted the QI initiative')]
        self.citation = self.text[self.text.find('Citation:'):self.text.find("Background")]
        self.background_aim = self.text[self.text.find('Aim statement:'):self.text.find('Planning')]
        self.planning_execution = self.text[self.text.find("Planning and Execution Details"):self.text.find("Focus activities")]
        self.qi_methods = {a:(a in self.text[self.text.find("QI Methods:"):self.text.find("QI Tools:")]) for a in METHODS}
        self.qi_tools = {a:(a in self.text[self.text.find("QI Tools:"):self.text.find("Initiative Dates:")]) for a in TOOLS}
        self.eval_methods = self.text[self.text.find("Methods of evaluation:"):self.text.find("Results")]
        self.results = self.text[self.text.find("Results"):self.text.find("Information about the Community")]
        self.measurable_outcomes = self.text[self.text.find('Measurable QI Outcomes:'):self.text.find('Other QI Outcomes:')]
        self.other_outcomes = self.text[self.text.find('Other QI Outcomes:'):self.text.find('Future Plans')]
        self.organization_type = {a:(a in self.text[self.text.find("Organization Type:"):self.text.find("QI Staff")]) for a in ORGANIZATION_TYPES}
        self.population_size = {a:(a in self.text[self.text.find("Population Size:"):self.text.find("Population C")]) for a in POPULATIONS}
        self.focus_areas = {a:(a in self.text[self.text.find("Focus activities"):self.text.find("QI Methods:")]) for a in FOCUS_AREAS }
        self.is_initiative = True
        if self.citation == "":
            self.is_initiative = False

    def __str__(self):
        return self.title

"Dang, that is a nice webscraper" you might say

---

Now it's time for the real magic to begin. We will open the 'view all' page in PHQIX and import all available links

In [5]:
url = "http://www.phqix.org/qi-submissions"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [6]:
list_of_links = []

for link in soup.find_all("td", class_="col-1 col-first"):
    list_of_links.append("http://www.phqix.org" + link.find('a',href=True)['href'])

We created a soup object that holds the HTML in PHQIX's 'view all' page. Then we initialized a list and populated it with all links in the soup object.

"What about the links that don't point towards an actual initiative?" You might ask.

Aha! Good question. 

In [7]:
# self.is_initiative = True
# if self.citation == "":
#     self.is_initiative = False

The above bit of code from our class assumes that its URL will point towards an initiative. If the link points elsewhere, say a blog post, then the class's initiative-specific variables will be empty. We choose a variable that we can reasonably assume will always be present in an initiative's write-up to act as a flag. If the variable is empty, self.is_initiative = False

---

Here's the money-maker. The following line of code will scrape 206 webpages for initiative data. It will take as long as 4 minutes to do so.

In [None]:
all_initiatives = [PHQIX_Initiative(link) for link in list_of_links if PHQIX_Initiative(link).is_initiative]

Now we can create a dataframe to hold our scraped data. First, set the max column width in pandas to an arbitrarily large number to accommodate large strings of text. Then define your data and set that data into a dataframe

In [None]:
pd.set_option("display.max_colwidth", 1000000000)

data = [[a.title, a.summary, a.citation, a.background_aim, 
         a.planning_execution, a.qi_methods, a.qi_tools, 
         a.eval_methods, a.results, a.measurable_outcomes, 
         a.other_outcomes, a.organization_type, 
         a.population_size, a.focus_areas ] 
         for a in all_initiatives]


df = pd.DataFrame(data=data, columns = ['Title','Summary Text','Citation',
                                        'Background Text','Planning & Execution Text',
                                        'QI Methods','QI Tools','Evaluation Methods Text',
                                        'Results Text','Measurable Outcomes Text',
                                        'Other Outcomes Text','Organization Type',
                                        'Population Size','Focus Areas'])

We did it!! Nice.