# Week 11 Assignment


Please do the programming exercise and verify that your code works using the tests, then think about your final project and fill out the questions in the second part.

---
---

### 47.1: Filtering and summarizing data

For this work, you'll find a data file in `https://hds5210-data.s3.amazonaws.com/complications_all.csv`.

Read in the data file and create a variable called `mo_hospitals` that contains a data frame from the `complications_all.csv` file, filtered down to only contain those hospitals from the state of Missouri (MO).

Then aggregate that data by hospital into a variable named `mo_summary`.  There are some key fields that we want to summarize:
* We want to know the earliest date that each hospital was participating in any program
* We want to know the latest date that each hospital stopped participating in any program
* We want to know the total number of patients in the denominators of these programs

Some things to note:
* You will need to convert the `Start Date` and `End Date` to actual datetime fields
* You will need to clean up and convert the `Denominator` field to just be numeric - the rule that you should use it to simply remove any records where the `Denominator` is `'Not Available'`


The final result of this step should be a new data frame called `mo_summary` that contains one row for each hospital and contains the min start date, max end date, and total denominator.  Use the names `start_date`, `end_date`, and `number` for those columns in `mo_summary`.


You do not need to create your code in the form of a function, just make sure your variable names match what I've described above so the tests work.

In [11]:
import pandas as pd
# This is just to show you the name to use for the variable you need to create for this step to pass.
all_hospitals = pd.read_csv('https://hds5210-data.s3.amazonaws.com/complications_all.csv')

In [12]:
count_mo = all_hospitals['State'].eq('MO').sum()
count_mo

2133

In [13]:
mo_hospitals = all_hospitals[all_hospitals['State']=='MO']

In [15]:
mo_hospitals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2133 entries, 45534 to 47666
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Facility ID           2133 non-null   object 
 1   Facility Name         2133 non-null   object 
 2   Address               2133 non-null   object 
 3   City                  2133 non-null   object 
 4   State                 2133 non-null   object 
 5   ZIP Code              2133 non-null   int64  
 6   County Name           2133 non-null   object 
 7   Phone Number          2133 non-null   object 
 8   Measure ID            2133 non-null   object 
 9   Measure Name          2133 non-null   object 
 10  Compared to National  2133 non-null   object 
 11  Denominator           2133 non-null   object 
 12  Score                 2133 non-null   object 
 13  Lower Estimate        2133 non-null   object 
 14  Higher Estimate       2133 non-null   object 
 15  Footnote        

In [16]:
mo_hospitals = mo_hospitals[mo_hospitals['Denominator']!='Not Available']

In [17]:
mo_hospitals['start_date'] = pd.to_datetime(mo_hospitals['Start Date'])
mo_hospitals['end_date'] = pd.to_datetime(mo_hospitals['End Date'])

In [18]:
mo_hospitals['number'] = pd.to_numeric(mo_hospitals['Denominator'])

In [19]:
mo_summary = mo_hospitals.groupby('Facility Name').agg(
                        start_date = pd.NamedAgg(column='start_date', aggfunc='min'),
                        end_date = pd.NamedAgg(column='end_date', aggfunc='max'),
                        number = pd.NamedAgg(column = 'number', aggfunc='sum')
).reset_index()

In [20]:
mo_summary.set_index('Facility Name', inplace=True)

In [21]:
assert(mo_summary['number'].sum() == 1766908)
assert(mo_summary['start_date'].min() == pd.Timestamp(2015,4,1))
assert(mo_summary['end_date'].max() == pd.Timestamp(2018,6,30))
assert(mo_summary.shape == (108,3))
assert(mo_summary.loc['BARNES JEWISH HOSPITAL'].number == 131313)
assert(mo_summary.loc['BOONE HOSPITAL CENTER'].number == 63099)

---

### 47.2 Planning your final project

You should be thinking about the things we've been learning and how you can apply them to your final project.  Use the rubric to help guid your thinking and then answer the questions below.  This is meant as a guide to help you think through what you will do.

#### A) Data Access

Your project should include data from at least three distinct types of sources.  For example: AWS S3, Relational Databases, Internet, Web Services, local files.  List what data sources you're planning to use.

The type of source that I used the most for this semester would be Internet sources such as Data.gov, healthdata.gov, and data.cdc.gov. Moreover, I could use local COVID-19 data files that I already have and am familiar with. Lastly, I believe it would be a great choice to use the datasets in AWS S3 as Amazon Simple Storage Service has a reputation for its immense size of data and it would be a significant experience to use various data sources that I have not used for my future career.

#### B. Data Formats

Your project should include data that comes in different file formats.  For example: HL7, EDI, HTML, CSV, Excel, JSON, XML.  List what data formats you're planning to use.

I will use Excel and CSV files for sure as they are the top two file formats I have used throughout this semester. They share similar structures and are easy to combine. For the last data format, I think I will work with JSON files as it is possible to convert JSON into Excel and CSV through libraries such as pandas and CSV modules.

#### C. Objective

What purpose would your project serve in a real work setting?  Take a couple of paragraphs to write down why this is an interesting product.

I would like to try data analysis based on COVID-19 and economic data during a pandemic. According to scientists, it is highly likely there will be another pandemic event like COVID-19. The purpose of this project is to analyze the economic indicators during covid such as inflation, unemployment rate, and GDP growth rate. By thoroughly reviewing the patterns and trends of the datasets, we may build an algorithm that focuses on economic real-time data in the future.

This model can find out the major challenges of the given situation and come up with the best possible solutions for those challenges. Moreover, we can discover which elements in the economy are affected the most by state, by age, and other factors to have the most effective and efficient follow-up measurement to comply with detailed circumstances. The strategies will ensure economic safety and boost market activation. This system can prevent or help to overcome economic downfall during future pandemics similar to COVID-19.

Also, this data may be used to establish the robust economic infrastructure and manuals that can be used to resist the grim economic effects derived from future pandemics. I believe this type of data analysis will definitely be helpful in the future when another pandemic event occurs. Since humanity suffered a great loss during COVID-19, we must be ready for the potential epidemic that can cause greater damage to mankind.



---



## Submit your work via GitHub as normal
