# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `/data/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [1]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('/content/complications_all....csv')


  all_hospitals = pd.read_csv('/content/complications_all....csv')


In [3]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/complications_all....csv')

# Filter for records in Missouri (State == 'MO')
mo_hospitals = data.query("State == 'MO'").copy()

# Convert 'Start Date' and 'End Date' columns to datetime, handling errors gracefully
mo_hospitals['Start Date'] = pd.to_datetime(mo_hospitals['Start Date'], errors='coerce')
mo_hospitals['End Date'] = pd.to_datetime(mo_hospitals['End Date'], errors='coerce')

# Clean up rows where 'Denominator' is not available or not numeric
mo_hospitals = mo_hospitals[mo_hospitals['Denominator'].apply(lambda x: x != 'Not Available' and pd.to_numeric(x, errors='coerce') is not None)]

# Convert 'Denominator' to numeric after filtering
mo_hospitals['Denominator'] = pd.to_numeric(mo_hospitals['Denominator'], errors='coerce')

# Display the cleaned data
print(mo_hospitals.head())


      Facility ID          Facility Name        Address    City State  \
45534      260001  MERCY HOSPITAL JOPLIN  100 MERCY WAY  JOPLIN    MO   
45535      260001  MERCY HOSPITAL JOPLIN  100 MERCY WAY  JOPLIN    MO   
45536      260001  MERCY HOSPITAL JOPLIN  100 MERCY WAY  JOPLIN    MO   
45537      260001  MERCY HOSPITAL JOPLIN  100 MERCY WAY  JOPLIN    MO   
45538      260001  MERCY HOSPITAL JOPLIN  100 MERCY WAY  JOPLIN    MO   

       ZIP Code County Name    Phone Number     Measure ID  \
45534     64804      JASPER  (417) 781-2727  COMP_HIP_KNEE   
45535     64804      JASPER  (417) 781-2727    MORT_30_AMI   
45536     64804      JASPER  (417) 781-2727   MORT_30_CABG   
45537     64804      JASPER  (417) 781-2727   MORT_30_COPD   
45538     64804      JASPER  (417) 781-2727     MORT_30_HF   

                                            Measure Name  \
45534  Rate of complications for hip/knee replacement...   
45535               Death rate for heart attack patients   
45536   

In [5]:
# Aggregate the data by hospital
mo_summary = mo_hospitals.groupby('Facility Name').agg(
    start_date=('Start Date', 'min'),
    end_date=('End Date', 'max'),
    number=('Denominator', 'sum')
).reset_index()

mo_summary.set_index('Facility Name', inplace=True)

In [6]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

For project on stroke and brain health, I'll use:

National Institutes of Health (NIH): The NIH's NINDS offers information on neurological conditions, with a particular emphasis on stroke and research on brain health. Data from NIH NINDS.


OpenNeuro: An open-access database containing MRI and other brain imaging data for a range of brain disorders. OpenNeuro.

For studies on cognitive decline, the Alzheimer's Disease Neuroimaging Initiative (ADNI) provides brain health information, such as MRI and PET scans. ADNI.

Data about brain and neurological health from Europe can be found on the European Union Open Data Portal. EU Information.

NHANES (CDC): Offers CSV-formatted U.S. health data, including stroke risk factors. NHANES.

https://www.kaggle.com/datasets/jillanisofttech/brain-stroke-dataset this isthe csv file i got data from,remaing i am planning to get.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

Iam going to draw the data from csv and use the remaining CSV, JSON, XLS
at present i ahve only the csv later i will gather to use the jason and XLS to use in my project.

**Double-click to enter your answer**

Put your answer here


#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

**Double-click to enter your answer**

Put your answer here


#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

For researchers studying stroke prevention and patient outcomes, politicians, and healthcare professionals, this brain stroke initiative may provide insightful information. Age, lifestyle, and pre-existing diseases are some of the factors that affect stroke, which is a major cause of disability and death. By combining several datasets, this research can find patterns, pinpoint high-risk populations, and investigate relationships between the occurrence of stroke and lifestyle factors like obesity. Allocating resources and developing focused public health policies could be influenced by these observations. Clinicians may use predictive models based on this data to evaluate stroke risk and carry out early interventions for healthcare organizations. Comprehending geographical and demographic disparities also helps policymakers create customized community health programs that encourage proactive stroke prevention and improved patient care.




---



## Submit your work via GitHub as normal
