# Data Collection
<br>
Given the COVID-19 crisis, we will try to understand the health care capacity for India.
<br><br>
<i>The fight against COVID-19 is all about flattening the curve.</i>
<br>
<img src='https://thespinoff.co.nz/wp-content/uploads/2020/03/Covid-19-curves-graphic-social-v3.gif' alt='Flatten the curve' width=600 align='left'>

## Pull Data from an API

Let us now pull the hospital beds data for **India**.     
[COVID 19 API List for India](https://api.rootnet.in/)

In [73]:
# Import libraries
import requests
import numpy as np
import pandas as pd
from pathlib import Path

# Set data path
DATA = Path('data')
!ls {DATA}


ari_2018.csv	  medical_college_list.csv  nurses.csv		 population.csv
doctors.csv	  mohfw.csv		    pneumonia_2018.csv
govt_doctors.csv  nhrr			    pop_by_age_2017.csv


### Get data from the API

In [3]:
BED_URL = 'https://api.rootnet.in/covid19-in/hospitals/beds'

r = requests.get(BED_URL)
print(f'Status {r.status_code}')


Status 200


### Load it into a `pandas DataFrame`

In [28]:
api = r.json()

beds = pd.DataFrame(api['data']['regional'])
beds.head()

Unnamed: 0,state,ruralHospitals,ruralBeds,urbanHospitals,urbanBeds,totalHospitals,totalBeds,asOn
0,Andhra Pradesh,193,6480,65,16658,258,23138,2017-01-01T00:00:00.000Z
1,Arunachal Pradesh,208,2136,10,268,218,2404,2017-12-31T00:00:00.000Z
2,Assam,1176,10944,50,6198,1226,17142,2017-12-31T00:00:00.000Z
3,Bihar,930,6083,103,5936,1033,12019,2016-12-31T00:00:00.000Z
4,Chhattisgarh,169,5070,45,4342,214,9412,2016-01-01T00:00:00.000Z


### State wise bed count

In [26]:
(beds[:-1][['state', 'totalBeds']]
     .style
     .hide_index()
     .background_gradient(subset='totalBeds', cmap='YlGn'))

state,totalBeds
Andhra Pradesh,23138
Arunachal Pradesh,2404
Assam,17142
Bihar,12019
Chhattisgarh,9412
Goa,3013
Gujarat,32280
Haryana,11240
Himachal Pradesh,12399
Jammu & Kashmir,11651


Great, now that we have **# of beds available** in each state, let us extract the **# of corona cases** per state. This will help us to better understand the shortage of beds in the coming future.

## Scrape Data from the WEB

We will be scraping the data from Ministry of Health & Family Welfare website.

[MoHFW](https://www.mohfw.gov.in/)

> BeautifulSoup - [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

> Further reading
- https://do.co/2XzV5uT
- https://bit.ly/2A2axqo

### Get the *source* of the MoHFW webpage

In [30]:
MOHFW_URL = 'https://www.mohfw.gov.in/'

r = requests.get(MOHFW_URL)
print(f'Status: {r.status_code}')

Status: 200


### Extract the *table* from the *source*

In [36]:
# Import BeautifulSoup
import bs4
from bs4 import BeautifulSoup as BS

page = BS(r.content, 'html.parser')
table = page.table

### Look at the underlying structure of the *table*

In [34]:
print(table.tbody.tr.prettify())

<tr>
 <td>
  1
 </td>
 <td>
  Andaman and Nicobar Islands
 </td>
 <td>
  0
 </td>
 <td>
  33
 </td>
 <td>
  0
 </td>
 <td>
  33
 </td>
</tr>



### Extract *data* from the *table*

In [38]:
from typing import List

def extract_from_table(table: bs4.element.Tag) -> List:
    '''Extracts data from HTML table.
    
    Input:  bs4 *table*
    Return: List of all the values in the table
    '''
    data = list()
    
    for row in table.select('tbody tr'):
        data.append([col.text for col in row.find_all('td')])
        
    return data

data_table = extract_from_table(table)
data_table

[['1', 'Andaman and Nicobar Islands', '0', '33', '0', '33'],
 ['2', 'Andhra Pradesh', '1654', '2576', '73', '4303'],
 ['3', 'Arunachal Pradesh', '44', '1', '0', '45'],
 ['4', 'Assam', '1651', '498', '4', '2153'],
 ['5', 'Bihar', '2342', '2225', '29', '4596'],
 ['6', 'Chandigarh', '77', '222', '5', '304'],
 ['7', 'Chhattisgarh', '633', '244', '2', '879'],
 ['8', 'Dadar Nagar Haveli', '13', '1', '0', '14'],
 ['9', 'Delhi', '15311', '10315', '708', '26334'],
 ['10', 'Goa', '131', '65', '0', '196'],
 ['11', 'Gujarat', '4901', '13003', '1190', '19094'],
 ['12', 'Haryana', '1439', '2134', '24', '3597'],
 ['13', 'Himachal Pradesh', '199', '189', '5', '393'],
 ['14', 'Jammu and Kashmir', '2202', '1086', '36', '3324'],
 ['15', 'Jharkhand', '464', '410', '7', '881'],
 ['16', 'Karnataka', '3090', '1688', '57', '4835'],
 ['17', 'Kerala', '973', '712', '14', '1699'],
 ['18', 'Ladakh', '48', '48', '1', '97'],
 ['19', 'Madhya Pradesh', '2734', '5878', '384', '8996'],
 ['20', 'Maharashtra', '42224', '

### Create a `pandas DataFrame` from the *data_table* & fix the `dtypes`

In [41]:
columns = ['sno', 'state', 'active', 'cured', 'dead', 'total']
stats = pd.DataFrame(data_table[:-6], columns=columns)
stats.dtypes

sno       object
state     object
active    object
cured     object
dead      object
total     object
dtype: object

In [42]:
stats[['active', 'cured', 'dead', 'total']] = stats[['active', 'cured', 'dead', 'total']].astype(int) 

### State wise deaths & cured cases

In [44]:
(stats[['state', 'cured', 'dead']]
     .style
     .hide_index()
     .background_gradient(cmap='YlGn'))

state,cured,dead
Andaman and Nicobar Islands,33,0
Andhra Pradesh,2576,73
Arunachal Pradesh,1,0
Assam,498,4
Bihar,2225,29
Chandigarh,222,5
Chhattisgarh,244,2
Dadar Nagar Haveli,1,0
Delhi,10315,708
Goa,65,0


Ahh cool! We now have state level cases data & the # of beds available with us. But can we do better?  
Can we get data at District level instead? Let's try!

## Parse Data from PDF

We will be parsing the data from [National Health Profile (NHP)](https://www.cbhidghs.nic.in/index7.php?lang=1&level=0&linkid=1086&lid=1107&color=1) reports published by Central Bureau of Health Intelligence (CBHI) every year.  

You can download the PDF from here: [NHP 2019](https://github.com/srmsoumya/dsct/raw/master/data/dw/nhrr/NHRR2019.pdf) & save it in `data` directory

We will be using Camelot to parse PDF.  
[Camelot](https://camelot-py.readthedocs.io/en/master/)

### Extract *medical college data* from NHP 2019 report

> Page [270-282]

In [45]:
import camelot

NHRR = DATA/'nhrr'/'NHRR2019.pdf'

In [64]:
med_clgs = camelot.read_pdf(str(NHRR), pages='270-282', flavor='lattice')

In [65]:
med_clgs[0].parsing_report

{'accuracy': 100.0, 'whitespace': 14.29, 'order': 1, 'page': 270}

In [66]:
med_clgs[0].df

Unnamed: 0,0,1,2,3,4,5,6
0,S. \nNo.,State/UT,Name of Medical College,City/Town,Govt/ \nPrivate,Admission \nCapacity,No. of \nbeds in \nAttached \nHospital
1,1,Andaman & \nNicobar Islands,Andaman & Nicobar Islands Insitute of Medical ...,Port Blair,Govt.,100,460
2,2,Andhra Pradesh,ACSR Government Medical College Nellore,Nellore,Govt.,150,750
3,3,,"All India Institute of Medical Sciences, Manga...",Vijaywada,Govt.,50,
4,4,,Alluri Sitaram Raju Academy of Medical Science...,Eluru,Trust,150,1070
5,5,,"Andhra Medical College, Visakhapatnam",Visakhapatnam,Govt.,200,2017
6,6,,Apollo Institute of Medical Sciences and Resea...,Chittoor,Society,150,
7,7,,"Dr. P.S.I. Medical College , Chinoutpalli",Chinoutpalli,Trust,150,398
8,8,,"Fathima Instt. of Medical Sciences,Kadapa",Kadapa,Trust,100,450
9,9,,Gayathri Vidya Parishad Institute of Health Ca...,Visakhapatnam,Society,150,


### Clean the table

In [70]:
def extract_table(df: pd.DataFrame) -> pd.DataFrame:
    '''Cleans the Dataframe'''
    df = df.copy()                                           # Work on a copy
    df.columns = df.iloc[0]                                  # Set Row 1 as the Column
    df.drop(df.index[0], inplace=True)               # Delete Row 1
    df.columns = [c.replace(' \n', '') for c in df.columns]  # Format column names
    df = df[df['S.No.'] != '']                               # Remove the total Rows
    df.set_index(keys='S.No.', inplace=True)                 # Set S.No as the index
    
    return df

med_clgs_df = pd.concat([extract_table(med_clgs[i].df) for i in range(13)])

In [74]:
# Fill the missing names in `State/UT` column, format the names
med_clgs_df['State/UT'] = med_clgs_df['State/UT'].replace(r'^\s*$', np.nan, regex=True)\
                                                 .ffill()\
                                                 .str.replace('\n', '')

In [75]:
med_clgs_df.head()

Unnamed: 0_level_0,State/UT,Name of Medical College,City/Town,Govt/Private,AdmissionCapacity,No. of beds in AttachedHospital
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Andaman & Nicobar Islands,Andaman & Nicobar Islands Insitute of Medical ...,Port Blair,Govt.,100,460.0
2,Andhra Pradesh,ACSR Government Medical College Nellore,Nellore,Govt.,150,750.0
3,Andhra Pradesh,"All India Institute of Medical Sciences, Manga...",Vijaywada,Govt.,50,
4,Andhra Pradesh,Alluri Sitaram Raju Academy of Medical Science...,Eluru,Trust,150,1070.0
5,Andhra Pradesh,"Andhra Medical College, Visakhapatnam",Visakhapatnam,Govt.,200,2017.0


### Save the data to a CSV file

In [77]:
med_clgs_df.to_csv(DATA/'medical_college_list.csv', index=False)

### Assignment: Extract **Pneumonia** data from NHP 2019 report
> Page: [139]

In [78]:
pneumonia = camelot.read_pdf(str(NHRR), pages='139', flavor='lattice')

In [81]:
pneumonia[0].df

Unnamed: 0,0,1,2,3,4,5,6,7
0,S. \nNo.,State/UT.,Male,,Female,,Total,
1,,,Cases,Deaths,Cases,Deaths,Cases,Deaths
2,1,Andhra Pradesh,20203,224,17546,141,37749,365
3,2,Arunachal Pradesh,403,0,304,0,707,0
4,3,Assam,10117,92,6458,43,16575,135
5,4,Bihar,11653,16,8429,9,20082,25
6,5,Chhattisgarh,3506,26,2978,21,6484,47
7,6,Goa,1511,50,1287,24,2798,74
8,7,Gujarat,2847,2,2312,1,5159,3
9,8,Haryana,7843,23,6200,11,14043,34


In [82]:
def extract_pneumonia_table(df: pd.DataFrame) -> pd.DataFrame:
    '''Cleans the Pneumonia Dataframe'''
    df = df.copy()                                           # Work on a copy
    df.columns = df.iloc[0]                                  # Set Row 1 as the Column
    df.drop(df.index[[0,1]], inplace=True)                   # Delete Row 1
    df.columns = [c.replace(' \n', '') for c in df.columns]  # Format column names
    df = df[df['S.No.'] != '']                               # Remove the total Rows
    df.set_index(keys='S.No.', inplace=True)                 # Set S.No as the index
    
    return df

pneumonia_df = extract_pneumonia_table(pneumonia[0].df)

In [84]:
pneumonia_df.head()

Unnamed: 0_level_0,State/UT.,Male,Unnamed: 3_level_0,Female,Unnamed: 5_level_0,Total,Unnamed: 7_level_0
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Andhra Pradesh,20203,224,17546,141,37749,365
2,Arunachal Pradesh,403,0,304,0,707,0
3,Assam,10117,92,6458,43,16575,135
4,Bihar,11653,16,8429,9,20082,25
5,Chhattisgarh,3506,26,2978,21,6484,47


In [85]:
pneumonia_df.columns = ['State/UT.', 'Male-Cases', 'Male-Deaths', 'Female-Cases', 'Female-Deaths', 'Total-Cases', 'Total-Deaths']

In [86]:
pneumonia_df.head()

Unnamed: 0_level_0,State/UT.,Male-Cases,Male-Deaths,Female-Cases,Female-Deaths,Total-Cases,Total-Deaths
S.No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Andhra Pradesh,20203,224,17546,141,37749,365
2,Arunachal Pradesh,403,0,304,0,707,0
3,Assam,10117,92,6458,43,16575,135
4,Bihar,11653,16,8429,9,20082,25
5,Chhattisgarh,3506,26,2978,21,6484,47


In [87]:
pneumonia_df.to_csv(DATA/'pneumonia_2018.csv', index=False)