# Stage I - Data Understanding and Linking

## What is the U.S. Opioid Epidemic?

- In the late 1990s, pharmaceutical companies reassured the medical community that patients would not become addicted to opioid pain relievers and healthcare providers began to prescribe them at greater rates.

- Increased prescription of opioid medications led to widespread misuse of both prescription and non-prescription opioids before it became clear that these medications could indeed be highly addictive.

![opioid](https://www.hhs.gov/opioids/sites/default/files/inline-images/opioids-infographic.png)

Source: https://www.hhs.gov/opioids/about-the-epidemic/index.html


We are going to study the underlying patterns that exist between opioid related deaths and the different socio-economic, demographic, geographic, and equity related variables that are available for US population. Our goal is to extract such patterns and understand them futher using data science techniques. 

In order to achieve that, the project is separated into 5 stages:

- Stage I - Data and Project Understanding,
- Stage II - Data Modeling,
- Stage III - Distributions and Hypothesis Testing,
- Stage IV - Dashboard




### Opioid overdose dataset linking

In this stage we utilize multiple publicly available datasets and link them together for analytics. Our goal here is to help identify patterns which contribute to drug overdose deaths. Within this notebook we will explore:

1. **Creating Index** - Developing an index key for linking datasets
2. **Join** - Using join to merge data based on index key.

The notebooks is viewable via any browser. 

#### Software used (open source):

- `python` - https://www.python.org/download/releases/3.0/
- `pandas` - https://pandas.pydata.org/
- `plotly` - https://plotly.com/

### Datasets


#### 1. Drug Overdose Dataset

The overdose death/cause dataset was obtained from CDC Wonder (https://wonder.cdc.gov/ucd-icd10.html). The dataset is from the Underlying Cause of Death database contains mortality and population counts for all U.S. counties. Data are based on death certificates for U.S. residents. Each death certificate identifies a single underlying cause of death and demographic data. 

- From this data we obtained the Drug/Alcohol Induced causes data for 2019 across all counties in US. 
- https://wonder.cdc.gov/wonder/help/ucd.html#Drug/Alcohol%20Induced%20Causes
- File: `./data/Underlying Cause of Death-County-2019.txt`

#### 2. County Health Rankings

The County Health Rankings provide a snapshot of a community’s health and a starting point for investigating and discussing ways to improve health. The annual Rankings measures vital health factors, including high school graduation rates, obesity, smoking, unemployment, access to healthy foods, the quality of air and water, income inequality, and teen births in nearly every county in America. The dataset provides a snapshot of how health is influenced by where we live, learn, work and play.

- From this data we obtained the measures data for 2019 across all counties in US. 
- https://www.countyhealthrankings.org/
- Data Dictionary - https://www.countyhealthrankings.org/sites/default/files/DataDictionary_2019.pdf
- File: `./data/County_Health_Ranking.csv`

#### 3. County Opioid Dispensing Rates

The third dataset in this notebook is the Opoid Dispensing Rate dataset. The dataset has geographic distribution of retail opioid prescriptions dispensed per 100 persons per year from 2006–2019. Rates are classified by the Jenks natural breaks classification method into four groups using the 14-year range of data to determine the class breaks. 

- We utilize County Opoid Dispensing Rates for 2019.
- https://www.cdc.gov/drugoverdose/maps/rxcounty2019.html
- File: `./data/2019-Opioid_Rate.csv`

## Tasks:

#### Task 1: (10 pts)
- **T1.1** Initialize a Github Repository for your project. (10 pts)
    - Add a description (readme.MD) to your project. See here on how to setup: https://bulldogjob.com/news/449-how-to-write-a-good-readme-for-your-github-project

#### Task 2: (50 pts)
- Team:
    - **T2.1** Entire team looks at the datasets and understands the type of variables present in each of the data. (10 pts)
        - **Deliverable** 
            - Create a report of what the project is about, why this is an important area of work (in your own words), and how can data science help.
            - Outline how the datasets can be merged together and the common variables. 
        
- Member: 
    - **M2.1** Study prior research in the area. (20 pts)
        - Read https://link.springer.com/article/10.1007/s40265-017-0846-6
        - Select one other paper to study in relation to the area. Use https://scholar.google.com/ to search for papers which are related to the goal of this project.
        - **Deliverable**
            - Prepare a 1 page summary of what was discovered in these two paper. Significant outcomes, i.e. which variables/determinants are linked to opioid endemic. 
    - **M2.2** Each student member of the team selects 10 variables they think that are important from the available dataset. (20 pts) 
        - **Deliverable**
            - Prepare a data dictionary (data and datatype - variable dictionary. https://analystanswers.com/what-is-a-data-dictionary-a-simple-thorough-overview/) of the selected variables
            - Include justification of why you think the variables are important. 

Upload the team and member reports to canvas and your Github Repository. 

#### Task 3: (50 pts)
- Team: (20 pts)
    
    - **T3.1** Create a team notebook to read in the Opioid Mortality data using `pandas` and display the dataframe in a notebook.
    - **T3.2** Normalize the mortality data by population, i.e. number of deaths per 100,000 population. 
    - **T3.3** Identify issues with the data
        - Merge issues, missing values, inconsistent values, etc.
        - Describe solutions to fix it. 
    
- Member: (30 pts)
    - **M3.1** Merge all the three datasets to create a super datafame. (10 pts)
        - Display the super dataframe - Its should be (2527, 542) shape
        - Export it to a csv format.
    - **M3.2** Identify counties and states (top 10) with the highest opioid mortality rates. (20 pts)
        - Use mean and median for counties within states to compare (for the state level).
        - Describe your intution on why the rates are high in these states and counties. 

**Deliverable**
Each member creates separate notebooks for member tasks. Upload all notebooks to Github Repository. 
