# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [5]:
# Do you work here and in as many cells as you need to create a variable called `mo_summary` that matches the requirements
# Step 1: Filter for Missouri (MO) hospitals only
mo_hospitals = all_hospitals[all_hospitals['State'] == 'MO']

# Step 2: Convert 'Start Date' and 'End Date' to datetime format
mo_hospitals['Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals['End Date'] = pd.to_datetime(mo_hospitals['End Date'])

# Step 3: Remove rows where 'Denominator' is 'Not Available'
mo_hospitals = mo_hospitals[mo_hospitals['Denominator'] != 'Not Available']

# Step 4: Convert 'Denominator' to a numeric type
mo_hospitals['Denominator'] = pd.to_numeric(mo_hospitals['Denominator'])

# Step 5: Group by 'Facility Name' and summarize the data
mo_summary = mo_hospitals.groupby('Facility Name').agg(
    start_date=('Start Date', 'min'),  # Earliest participation date
    end_date=('End Date', 'max'),      # Latest participation date
    number=('Denominator', 'sum')      # Total denominator count
)

# Step 6: Set 'Facility Name' as the index
mo_summary.index.name = 'Facility Name'

# Display the final summary
mo_summary.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mo_hospitals['Start Date'] = pd.to_datetime(mo_hospitals['Start Date'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mo_hospitals['End Date'] = pd.to_datetime(mo_hospitals['End Date'])


Unnamed: 0_level_0,start_date,end_date,number
Facility Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BARNES JEWISH HOSPITAL,2015-04-01,2018-06-30,131313
BARNES-JEWISH ST PETERS HOSPITAL,2015-04-01,2018-06-30,15668
BARNES-JEWISH WEST COUNTY HOSPITAL,2015-04-01,2018-06-30,9622
BATES COUNTY MEMORIAL HOSPITAL,2015-07-01,2018-06-30,3117
BELTON REGIONAL MEDICAL CENTER,2015-04-01,2018-06-30,9270


In [6]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

**Double-click to enter your answer**

1. CDC - Provisional Death Counts for COVID-19

Source: Centers for Disease Control and Prevention (CDC)

Dataset Information: This dataset contains tabulated data on provisional COVID-19 deaths, including details related to age, sex, race, comorbidities, and demographic characteristics.

Link:https://www.cdc.gov/nchs/covid19/index.htm#:~:text=NCHS%20is%20providing%20the,and%20telemedicine.&text=recent%20data%20available%20on,and%20telemedicine.&text=health%2C%20and%20access%20to,and%20telemedicine.&text=loss%20of%20work%20due,and%20telemedicine.   


2. Kaggle - COVID-19 Cases and Deaths Worldwide

Source: Kaggle

Dataset Information: This dataset presents information, on the number of COVID‐19 cases and fatalities categorized by country or territory to offer a glimpse into the effects of the COVID‐19 pandemic.

Link:https://www.kaggle.com/datasets/themrityunjaypathak/covid-cases-and-deaths-worldwide


3. NTH.gov - HealthData.gov COVID-19 Dataset

Source: HealthData.gov

Dataset Information: This collection provides a range of health data, to the United States. Includes comprehensive datasets on COVID‐19 statistics such as hospitalization rates and demographic information along, with mortality rates linked to the pandemic.

Link:https://healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9/about_data

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

**Double-click to enter your answer**

The project aims to use data, in formats gathered from organizations to study the mortality risks linked to COVID19.The intended data formats consist of the following;

The CDC (Centers, for Disease Control and Prevention) notes that the Provisional Death Counts dataset provides details on mortality rates such, as breakdown and pre existing conditions linked to COVID19 related deaths.

Formats: CSV

HealthData.gov – COVID-19 Dataset: This collection of data includes a range of health information, to the United States such, as rates of hospitalization and mortality It is presented in a format that allows for analysis and visualization

Formats: Excel

The dataset, on Kaggle provides information on COVID‐19 cases and deaths, across countries or territories to enable a global analysis.

Formats: json






#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

**Double-click to enter your answer**

The main goal of this project is to study the likelihood of death related to heart and lung issues in individuals who have been diagnosed with COVID. 19. In light of the effects of the COVID. 19 Crisis it is essential to comprehend how existing heart and lung conditions influence the seriousness of COVID. 19 For managing healthcare and patient results. This study seeks to pinpoint risk factors by examining how these conditions impact mortality rates with the ultimate objective of aiding clinical practices and public health strategies.

In a scenario, like this one the analysis would have uses; it could help healthcare professionals anticipate which patients are more likely to experience serious results thus enabling them to provide focused interventions. Moreover this initiative could assist decision makers in making informed choices about distributing resources preparing for emergencies and creating protocols, for caring for patients with existing health issues. This project seeks to improve our knowledge of health profiles and enhance care strategies, in healthcare systems by evaluating the mortality risks linked to cardiac and respiratory problems in the context of COVID 19.




---



## Submit your work via GitHub as normal
