# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 1: Scoping a data set
#  **TEAM: HEALTHCARE**

## **Meet the Team**


---


### **Kylie Hall**
>Kylie received a B.S. in Communication and a B.A. in Music Industry from the University of New Haven, though she has spent the majority of her professional career in the non-profit industry. She found that she has a great passion for data, particularly enjoying interfacing with CRM platforms. This led her to her current role of Knowledge Management Coordinator at the Robert Wood Johnson Foundation, where she works to make knowledge accessible, equitable, and accurate across the Foundation. Aspiring to be a database administrator or data analyst, she began pursuing her Masters in Information Systems at Drexel University and has especially enjoyed learning SQL. She has a beginner level understanding of Python but would like to grow those skills and connect them to real-world application.


### **Rachael Herman**
>Rachael earned a B.S. in Middle School Math and Science Education and has spent the majority of her professional career as a science teacher. As she gained experience teaching basic science concepts, she also developed skills in working with datasets and thinking scientifically. Her other strengths include a foundational knowledge of data analysis and SQL, strong organizational abilities, and an action-oriented mindset. Moving forward, she aims to improve her Python skills to work with data in a variety of formats and reformat it to answer pressing questions and make lasting impacts.


### **Jordan Land**
>Jordan earned a B.S. in Business Management from Bloomsburg University of Pennsylvania, followed by completing a Master's of Business Administration at West Chester University. She identifies her strengths with using data to move clinical programs forward, problem solving and resource management. The domain that she identifies with is healthcare so any dataset that aligns with improving health outcomes for patients and families is of interest. Currently, she has little experience with unstructured data and coding and would like to increase these skills to expand opportunities to utilize data to make data more accessible in the healthcare setting.

### **Virginia Muthard**
>Virginia earned a B.S. in Bioengineering from The Pennsylvania State University, followed by completing her graduate work in Prosthetics and Orthotics with the Northwestern University Feinberg School of Medicine's P&O Center. She identified her strengths as organization, time management, and troubleshooting, with an eye for identifying process in a problem. The domains she identifies with are healthcare and education, so a dataset that seeks to present information regarding the intersection of these (directly or indirectly) was of interest. With a little experience in coding, she would like to grow skill in the ease of using functions to make coding more efficient.

## **Our Topic**


---



With each team member having a stake (personal or professional) in healthcare, it was fairly easy for the team to decide on a healthcare-focused topic. A little discussion among the group found a common concern was healthcare disparities in the United States. We thought this could be an insightful topic to look at relational data for identifying trends&mdash;trends that could be used to incorporate changes at the state or community level. The data-medium we will plan to work with are two separate datasets--one on leading causes of mortality across the US over a span of years, and another on tobacco use across the US over a span of years.


## **Project Discussion**


---


Our goal with this dataset is to create a more concise set of data that can be used to analyze relational information about mortality and tobacco use. To begin to identify the data, a Google search led us to the CDC's API dataset page. Much of this data relies on U.S. census data (self-report or through the corresponding government agency). We were able to view data for mortality and tobacco use and explore the characteristics of the data. Each report had information about the dataset, as well as the use restrictions:

*   National Center for Health Statistics. NCHS - Leading Causes of Death: United States. Date accessed 25 October 2024. Available from https://data.cdc.gov/d/bi63-dtpu. [Public Domain US Government: https://www.usa.gov/government-copyright}]
*   Center for Disease Control. CDC BRFSS - Behavioral Risk Factor Data: Tobacco Use. Date accessed 25 October 2024. Available from https://chronicdata.cdc.gov/Survey-Data/Behavioral-Risk-Factor-Data-Tobacco-Use-2011-to-pr/wsas-xwh5. [Open Data Common Attributions License: http://opendatacommons.org/licenses/by/1.0/]

Our plan is to collect the available data from several years in order to construct a dataset. We will be able to preprocess the data to focus down on the desired information (e.g. states, years, types of tobacco use, leading causes of death, etc.).  

This dataset could then be used by researchers and public health officials to determine trends in the data. Identifying these trends could lead to policy change or movements to address the concerns if any arise. For example, if a location has a high incidence of tobacco use (specific type or in total) and also has a higher incidence of certain causes of death, then this trend could be explored for any correlation. If one is found, this information could be used to focus a campaign targeting community health education or legislature to restrict certain products (e.g. types of tobacco). This data is limited, though, as some information is based on self-report/identifcation (either by the subject at some point or by another party). There may also be limitations if exploring geographical information, as the location reported at mortality may not be the location where someone spent most of their life (and therefore may not represent a correlation to that area).  

One potential hurdle that we have identified is that the APIs for both of the datasets have a limit of 1,000 rows when pulling the data. In order to do a thorough analysis, we need to be able to access every row of the data. We did some preliminary research on how to mitigate this issue and found that we can utilize parameters when connecting to both API endpoints, allowing us to access all rows (both endpoints utilize SODA 2.1). We may also determine a specific set of years to target which would reduce the number of records we would need to access, leading us to use the $where parameter. This data was created from death certificates filed in the United States, as well as results of CDC's Behavioral Risk Factor Surveillance System (BRFSS), which is a monthly survey conducted over the phone. Currently, the data is publicly accessible, so there are no legal barriers for us to be able to acquire this data.

## **Data Sample**


---

Below is a sample of the data we will be collecting to form our dataset.

In [None]:
import requests
from pprint import pprint

In [None]:
# This data represents the leading causes of death in each state across a set of years [1999 to present] (Source: CDC, public use)

url1 = 'https://data.cdc.gov/resource/bi63-dtpu.json'

response1 = requests.get(url1)

death_data = response1.json()

pprint(death_data[:5])

[{'_113_cause_name': 'Nephritis, nephrotic syndrome and nephrosis '
                     '(N00-N07,N17-N19,N25-N27)',
  'aadr': '2.6',
  'cause_name': 'Kidney disease',
  'deaths': '21',
  'state': 'Vermont',
  'year': '2012'},
 {'_113_cause_name': 'Nephritis, nephrotic syndrome and nephrosis '
                     '(N00-N07,N17-N19,N25-N27)',
  'aadr': '3.3',
  'cause_name': 'Kidney disease',
  'deaths': '29',
  'state': 'Vermont',
  'year': '2017'},
 {'_113_cause_name': 'Nephritis, nephrotic syndrome and nephrosis '
                     '(N00-N07,N17-N19,N25-N27)',
  'aadr': '3.7',
  'cause_name': 'Kidney disease',
  'deaths': '30',
  'state': 'Vermont',
  'year': '2016'},
 {'_113_cause_name': 'Nephritis, nephrotic syndrome and nephrosis '
                     '(N00-N07,N17-N19,N25-N27)',
  'aadr': '3.8',
  'cause_name': 'Kidney disease',
  'deaths': '30',
  'state': 'Vermont',
  'year': '2013'},
 {'_113_cause_name': 'Intentional self-harm (suicide) (*U03,X60-X84,Y87.0)',
  'aadr': '

In [None]:
# This data represents the tobacco use in each state across a set of years [2011 to present] (Source: CDC, public use)

url2 = 'https://data.cdc.gov/resource/wsas-xwh5.json'

response2 = requests.get(url2)

tobacco_data = response2.json()

pprint(tobacco_data[:5])

[{':@computed_region_hjsp_umg2': '29',
  'age': 'All Ages',
  'data_value': '7.5',
  'data_value_std_err': '0.5',
  'data_value_type': 'Percentage',
  'data_value_unit': '%',
  'datasource': 'BRFSS',
  'displayorder': '71',
  'education': 'All Grades',
  'gender': 'Overall',
  'geolocation': {'human_address': '{"address": "", "city": "", "state": "", '
                                   '"zip": ""}',
                  'latitude': '32.84057112200048',
                  'longitude': '-86.63186076199969'},
  'high_confidence_limit': '8.6',
  'locationabbr': 'AL',
  'locationdesc': 'Alabama',
  'low_confidence_limit': '6.4',
  'measuredesc': 'Current Use',
  'measureid': '177SCU',
  'race': 'White',
  'sample_size': '4616',
  'stratificationid1': '1GEN',
  'stratificationid2': '8AGE',
  'stratificationid3': '5RAC',
  'stratificationid4': '6EDU',
  'submeasureid': 'BRF71',
  'topicdesc': 'Smokeless Tobacco Use (Adults)',
  'topicid': '150BEH',
  'topictype': 'Tobacco Use – Survey Data',
  '