# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')


In [2]:
mo_hospitals = all_hospitals[all_hospitals["State"] == "MO"].copy()


mo_hospitals["Denominator"] = pd.to_numeric(mo_hospitals["Denominator"], errors="coerce")
mo_hospitals = mo_hospitals.dropna(subset=["Denominator"])


mo_hospitals["Start Date"] = pd.to_datetime(mo_hospitals["Start Date"])
mo_hospitals["End Date"] = pd.to_datetime(mo_hospitals["End Date"])

mo_hospitals

Unnamed: 0,Facility ID,Facility Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure ID,Measure Name,Compared to National,Denominator,Score,Lower Estimate,Higher Estimate,Footnote,Start Date,End Date
45534,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,COMP_HIP_KNEE,Rate of complications for hip/knee replacement...,No Different Than the National Rate,26.0,2.5,1.4,4.2,,2015-04-01,2018-03-31
45535,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_AMI,Death rate for heart attack patients,No Different Than the National Rate,175.0,13.9,11.0,16.9,,2015-07-01,2018-06-30
45536,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_CABG,Death rate for CABG surgery patients,No Different Than the National Rate,91.0,2.5,1.2,5.1,,2015-07-01,2018-06-30
45537,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,326.0,8.5,6.5,10.9,,2015-07-01,2018-06-30
45538,260001,MERCY HOSPITAL JOPLIN,100 MERCY WAY,JOPLIN,MO,64804,JASPER,(417) 781-2727,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,461.0,13.1,10.7,15.9,,2015-07-01,2018-06-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47557,261337,MISSOURI BAPTIST SULLIVAN HOSPITAL,751 SAPPINGTON BRIDGE RD,SULLIVAN,MO,63080,FRANKLIN,(573) 468-4186,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,114.0,12.4,9.1,16.4,,2015-07-01,2018-06-30
47558,261337,MISSOURI BAPTIST SULLIVAN HOSPITAL,751 SAPPINGTON BRIDGE RD,SULLIVAN,MO,63080,FRANKLIN,(573) 468-4186,MORT_30_PN,Death rate for pneumonia patients,No Different Than the National Rate,77.0,13.5,9.8,18.0,,2015-07-01,2018-06-30
47575,261338,MERCY HOSPITAL CARTHAGE,3125 DR RUSSELL SMITH WAY,CARTHAGE,MO,64836,JASPER,(417) 358-8121,MORT_30_COPD,Death rate for COPD patients,No Different Than the National Rate,58.0,7.4,4.9,11.0,,2015-07-01,2018-06-30
47576,261338,MERCY HOSPITAL CARTHAGE,3125 DR RUSSELL SMITH WAY,CARTHAGE,MO,64836,JASPER,(417) 358-8121,MORT_30_HF,Death rate for heart failure patients,No Different Than the National Rate,27.0,9.8,6.5,14.5,,2015-07-01,2018-06-30


In [3]:
mo_summary = (
    mo_hospitals.groupby("Facility Name")
    .agg(
        start_date=("Start Date", "min"),
        end_date=("End Date", "max"),
        number=("Denominator", "sum"),
    )

)
print(mo_summary)

                                    start_date   end_date    number
Facility Name                                                      
BARNES JEWISH HOSPITAL              2015-04-01 2018-06-30  131313.0
BARNES-JEWISH ST PETERS HOSPITAL    2015-04-01 2018-06-30   15668.0
BARNES-JEWISH WEST COUNTY HOSPITAL  2015-04-01 2018-06-30    9622.0
BATES COUNTY MEMORIAL HOSPITAL      2015-07-01 2018-06-30    3117.0
BELTON REGIONAL MEDICAL CENTER      2015-04-01 2018-06-30    9270.0
...                                        ...        ...       ...
TRUMAN MEDICAL CENTER LAKEWOOD      2015-04-01 2018-06-30    4297.0
UNIVERSITY OF MISSOURI HEALTH CARE  2015-04-01 2018-06-30   56493.0
WASHINGTON COUNTY MEMORIAL HOSPITAL 2015-07-01 2018-06-30     220.0
WESTERN MISSOURI MEDICAL CENTER     2015-04-01 2018-06-30    7254.0
WRIGHT MEMORIAL HOSPITAL            2015-07-01 2018-06-30     198.0

[108 rows x 3 columns]


In [4]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

**Double-click to enter your answer**

the data sources I likely to be used are given below

internet sources(healthdata.gov)

https://healthdata.gov/dataset/Obesity-among-children-and-adolescents-aged-2-19-y/vz57-zne8

kaggle https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

DRIVE LINK
https://drive.google.com/file/d/1hHKm8XliaSonvsQ2FLOuy7zCH2XGIPop/view?usp=sharing



#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

**Double-click to enter your answer**

Your project should include data that comes in different file formats. For example: HL7, EDI, HTML, CSV, Excel, JSON, XML. List what data formats you're planning to use.

In my final project i want to use CSV, Excel and JSON format


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

**Double-click to enter your answer**

This project's goal is to use a variety of datasets from healthdata.gov, Kaggle, and a local file on Google Drive to examine diabetes and childhood and teenage obesity. The project is to do a thorough study that reveals important insights into the prevalence of obesity and its possible relationships with diabetes across various age groups by utilizing CSV, Excel, and JSON data formats. This product is extremely valuable in the actual world since it is an essential tool for public health analysis. In addition to providing vital decision assistance for medical practitioners, it also influences evidence-based public health policies and initiatives. By addressing the pressing issues of childhood obesity and diabetes, the project strives to contribute significantly to the broader field of public health research, ultimately fostering a healthier future for the targeted demographic




---



## Submit your work via GitHub as normal
