# Data Collection
<br>
Given the COVID-19 crisis, we will try to understand the health care capacity for India.
<br><br>
<i>The fight against COVID-19 is all about flattening the curve.</i>
<br>
<img src='https://thespinoff.co.nz/wp-content/uploads/2020/03/Covid-19-curves-graphic-social-v3.gif' alt='Flatten the curve' width=600 align='left'>

* **What data do we need to better understand the healthcare capacity of India?**
    - Discuss

Now that we understand what data we need to estimate the health care capacity for India. The logical next step is to find relevant data sources.
Unfortunately, this is not a straight forward process. There are multiple ways in which we can start searching for data sources, here are a few ways to start with:
1. Quick google search
2. Public [APIs](https://api.covid19india.org/)
3. Official websites, in this case [MoHFW](https://www.mohfw.gov.in/) and [ICMR](https://www.icmr.gov.in/)
4. Data groups working on similar space [datameet](http://datameet.org/)
5. Public feeds like twitter or facebook
6. [data.world](https://data.world/)

* **Before we start, how much time do you think a Data Scientist spends in munging the data?**

## Pull Data from an API

Let us now pull the hospital beds data for **India**.     
[COVID 19 API List for India](https://api.rootnet.in/)

In [None]:
# Import libraries
import requests
import numpy as np
import pandas as pd
from pathlib import Path

# Set data path
DATA = Path('data')
!ls {DATA}

### Get data from the API

In [None]:
BED_URL = 'https://api.rootnet.in/covid19-in/hospitals/beds'

r = requests.get(BED_URL)
print(f'Status {r.status_code}')


### Load it into a `pandas DataFrame`

In [None]:
api = r.json()

beds = pd.DataFrame(api['data']['regional'])
beds.head()

### State wise bed count

In [None]:
(beds[:-1][['state', 'totalBeds']]
     .style
     .hide_index()
     .background_gradient(subset='totalBeds', cmap='YlGn'))

### Exercise 1

1. Find top 5 states with maximum number of hospital beds?
2. Find top 10 states with least number of rural beds?  
***
* Hint: Remember there is a total row in the dataframe

In [None]:
# 1

In [None]:
# 2

### Exercise 2

1. Get data for hospital stats at a more granular level. Use the [Medical College API](https://api.rootnet.in/covid19-in/hospitals/medical-colleges)
2. Check the status code
3. Load the `medicalColleges` data into pandas DataFrame. (Hint: Check the structure of the response before loading it into the DataFrame)
4. Find top 5 states with minimum & maximum number of `hospitalBeds`
5. Did you notice any difference in number of hospital beds? Can you reason why?


In [None]:
# 1
COLLEGE_URL = 'https://api.rootnet.in/covid19-in/hospitals/medical-colleges'

# YOUR CODE GOES HERE

In [None]:
# 2

In [None]:
# 3

In [None]:
# 4

In [None]:
# 5

Great, now that we have **# of beds available** in each state, let us extract the **# of corona cases** per state. This will help us to better understand the shortage of beds in the coming future.

## Scrape Data from the WEB

We will be scraping the data from Ministry of Health & Family Welfare website.

[MoHFW](https://www.mohfw.gov.in/)

> BeautifulSoup - [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

> Further reading
- https://do.co/2XzV5uT
- https://bit.ly/2A2axqo

### Get the *source* of the MoHFW webpage

In [None]:
MOHFW_URL = 'https://www.mohfw.gov.in/'

r = requests.get(MOHFW_URL)
print(f'Status: {r.status_code}')

### Extract the *table* from the *source*

In [None]:
# !conda install -y -c conda-forge beautifulsoup4 bs4

In [None]:
# Import BeautifulSoup
import bs4
from bs4 import BeautifulSoup as BS

page = BS(r.content, 'html.parser')
table = page.table

### Look at the underlying structure of the *table*

In [None]:
print(table.tbody.tr.prettify())

### Extract *data* from the *table*

In [None]:
from typing import List

def extract_from_table(table: bs4.element.Tag) -> List:
    '''Extracts data from HTML table.
    
    Input:  bs4 *table*
    Return: List of all the values in the table
    '''
    data = list()
    
    for row in table.select('tbody tr'):
        data.append([col.text for col in row.find_all('td')])
        
    return data

data_table = extract_from_table(table)
data_table

### Create a `pandas DataFrame` from the *data_table* & fix the `dtypes`

In [None]:
columns = ['sno', 'state', 'active', 'cured', 'dead', 'total']
stats = pd.DataFrame(data_table[:-6], columns=columns)
stats.dtypes

In [None]:
stats[['active', 'cured', 'dead', 'total']] = stats[['active', 'cured', 'dead', 'total']].astype(int) 

### State wise deaths & cured cases

In [None]:
(stats[['state', 'cured', 'dead']]
     .style
     .hide_index()
     .background_gradient(cmap='YlGn'))

### Exercise 3

1. Find the top 5 states where the death rate is high 
2. Find the top 5 states where the cure rate is high
3. Take a deep breath & try to understand the data!

(Extra marks for visualizing the column)

In [None]:
# 1

In [None]:
# 2

### Exercise 4

Scrape the top section of [MoHFW](https://www.mohfw.gov.in/) page, containing total number of active cases, cured, deaths & migrated. 

* **What is unique about the section we are going to scrape in this page?**

In [None]:
section = page.select('div.site-stats-count')[0]
section

In [None]:
# YOUR CODE GOES HERE

Ahh cool! We now have state level cases data & the # of beds available with us. But can we do better?  
Can we get data at District level instead? Let's try!

## Parse Data from PDF

We will be parsing the data from [National Health Profile (NHP)](https://www.cbhidghs.nic.in/index7.php?lang=1&level=0&linkid=1086&lid=1107&color=1) reports published by Central Bureau of Health Intelligence (CBHI) every year.  

You can download the PDF from here: [NHP 2019](https://github.com/srmsoumya/dsct/raw/master/data/dw/nhrr/NHRR2019.pdf) & save it in `data` directory

We will be using Camelot to parse PDF.  
[Camelot](https://camelot-py.readthedocs.io/en/master/)

### Extract *medical college data* from NHP 2019 report

> Page [270-282]

In [None]:
# !conda install -y -c conda-forge camelot-py (not working)
!pip install camelot-py[cv]

In [None]:
import camelot

NHRR = DATA/'nhrr'/'NHRR2019.pdf'

In [None]:
med_clgs = camelot.read_pdf(str(NHRR), pages='270-282', flavor='lattice')

In [None]:
med_clgs[0].parsing_report

In [None]:
med_clgs[0].df

### Clean the table

In [None]:
def extract_table(df: pd.DataFrame) -> pd.DataFrame:
    '''Cleans the Dataframe'''
    df = df.copy()                                           # Work on a copy
    df.columns = df.iloc[0]                                  # Set Row 1 as the Column
    df.drop(df.index[0], inplace=True)               # Delete Row 1
    df.columns = [c.replace(' \n', '') for c in df.columns]  # Format column names
    df = df[df['S.No.'] != '']                               # Remove the total Rows
    df.set_index(keys='S.No.', inplace=True)                 # Set S.No as the index
    
    return df

med_clgs_df = pd.concat([extract_table(med_clgs[i].df) for i in range(13)])

In [None]:
# Fill the missing names in `State/UT` column, format the names
med_clgs_df['State/UT'] = med_clgs_df['State/UT'].replace(r'^\s*$', np.nan, regex=True)\
                                                 .ffill()\
                                                 .str.replace('\n', '')

In [None]:
med_clgs_df.head()

### Save the data to a CSV file

In [None]:
med_clgs_df.to_csv(DATA/'medical_college_list.csv', index=False)

### *Exercise 5

Extract **Pneumonia** data from NHP 2019 report
> Page: [139]

In [None]:
# YOUR CODE GOES HERE