<div id="header">
    <p style="color:black; text-align:center; font-weight:bold; font-family:Tahoma, sans-serif; font-size:24px;">
        Data Gathering with API
    </p>
</div>

<div style="background-color:#bfbfbf; padding:8px; border:2px dotted black; border-radius:8px; font-family:sans-serif; line-height: 1.7em">

Data gathering is the first and one of the most essential steps in the machine learning workflow.

**APIs(Application Programming Interfaces)** allow us to access data hosted on remote servers, providing a way to query, filter, and retrieve data in a structured manner. An API acts as a communication interface between different software applications.

When working with APIs to gather data, the general process follows these steps:


1. API Request: A client (such as our Python code) sends an HTTP request to the API's endpoint.
2. API Response: The API processes the request and returns the requested data, usually in a structured format such as JSON or XML.
3. Data Parsing: Once the data is received, it needs to be parsed and structured in a format suitable for analysis, such as a pandas DataFrame.
4. Data Normalization: Since APIs often return nested data (such as hierarchical JSON), we can normalize this structure into a flat format using libraries like pandas.
5. Data Cleaning: After gathering the data, cleaning steps such as filling missing values or formatting columns (e.g., converting dates) are performed to ensure the data is ready for analysis.

**Example: Gathering COVID-19 Statistics**

In this notebook, the process of gathering real-time COVID-19 statistics using a public API is demonstrated. A request is made to the API, retrieving the data in JSON format, which is then processed into a pandas DataFrame for further analysis. The API provides various fields such as the number of total cases, active cases, recovered cases, and more, which can be used for visualizations or other insights.

</div>



In [16]:
import pandas as pd
import requests

In [17]:
# Set up the API connection
url = "https://covid-193.p.rapidapi.com/statistics"
headers = {
    'x-rapidapi-key': "9e5948c350mshd0c688ab7289f3cp12a97bjsncfbb0917ee86",
    'x-rapidapi-host': "covid-193.p.rapidapi.com"
}

In [18]:
# Make the request to get COVID-19 statistics
response = requests.get(url, headers=headers)
data = response.json()

In [19]:
# Check if the response contains the expected data
if 'response' in data:
    # Normalize the nested data into a flat DataFrame and select/rename relevant columns
    statistics_df = pd.json_normalize(data['response'])[
        ['continent', 'country', 'day', 'cases.total', 'cases.active',
         'cases.critical', 'cases.recovered', 'deaths.total', 'population',
         'tests.total', 'time']
    ].rename(columns={
        'continent': 'Continent', 'country': 'Country', 'day': 'Date',
        'cases.total': 'Total Cases', 'cases.active': 'Active Cases',
        'cases.critical': 'Critical Cases', 'cases.recovered': 'Recovered Cases',
        'deaths.total': 'Total Deaths', 'population': 'Population',
        'tests.total': 'Total Tests', 'time': 'Time'
    })

In [20]:
print(statistics_df.head())

       Continent           Country        Date  Total Cases  Active Cases  \
0         Africa      Saint-Helena  2024-10-11         2166        2164.0   
1  South-America  Falkland-Islands  2024-10-11         1930           0.0   
2  North-America        Montserrat  2024-10-11         1403          19.0   
3           None  Diamond-Princess  2024-10-11          712           0.0   
4         Europe      Vatican-City  2024-10-11           29           0.0   

   Critical Cases  Recovered Cases  Total Deaths  Population  Total Tests  \
0             NaN              2.0           NaN      6115.0          NaN   
1             NaN           1930.0           NaN      3539.0       8632.0   
2             NaN           1376.0           8.0      4965.0      17762.0   
3             NaN            699.0          13.0         NaN          NaN   
4             NaN             29.0           NaN       799.0          NaN   

                        Time  
0  2024-10-11T03:45:27+00:00  
1  2024-10-1

In [21]:
display(statistics_df)

Unnamed: 0,Continent,Country,Date,Total Cases,Active Cases,Critical Cases,Recovered Cases,Total Deaths,Population,Total Tests,Time
0,Africa,Saint-Helena,2024-10-11,2166,2164.0,,2.0,,6115.0,,2024-10-11T03:45:27+00:00
1,South-America,Falkland-Islands,2024-10-11,1930,0.0,,1930.0,,3539.0,8632.0,2024-10-11T03:45:27+00:00
2,North-America,Montserrat,2024-10-11,1403,19.0,,1376.0,8.0,4965.0,17762.0,2024-10-11T03:45:27+00:00
3,,Diamond-Princess,2024-10-11,712,0.0,,699.0,13.0,,,2024-10-11T03:45:27+00:00
4,Europe,Vatican-City,2024-10-11,29,0.0,,29.0,,799.0,,2024-10-11T03:45:27+00:00
...,...,...,...,...,...,...,...,...,...,...,...
233,Europe,Europe,2024-10-11,253406198,2550270.0,4453.0,248754104.0,2101824.0,,,2024-10-11T03:45:05+00:00
234,South-America,South-America,2024-10-11,70200879,2149962.0,8953.0,66683585.0,1367332.0,,,2024-10-11T03:45:05+00:00
235,Oceania,Oceania,2024-10-11,14895771,110368.0,31.0,14752388.0,33015.0,,,2024-10-11T03:45:05+00:00
236,Africa,Africa,2024-10-11,12860924,511224.0,529.0,12090808.0,258892.0,,,2024-10-11T03:45:05+00:00


In [22]:
# Save the DataFrame to a CSV file
statistics_df.to_csv('covid_statistics.csv', index=False)

In [23]:
# Display the DataFrame shape
print(statistics_df.shape)

(238, 11)
