## HW01: DATA COLLLECTING

This is Assignment 01 for the course "Introduction to Data Science" at the Faculty of Information Technology, University of Science, Vietnam National University, Ho Chi Minh City.

(Latest update: 09/08/2024)

Student Name: Võ Nguyễn Phương Quỳnh

Student ID: 22127360

## **Assignment Objectives**

By completing this assignment, students will achieve the following objectives:
- Understand HTML Parsing
- Data Crawling via APIs
- Construct Dataframes
- Error Handling and Robustness

## **How to Complete and Submit the Assignment**

&#9889; **Note**: You should follow the instructions below. If anything is unclear, you need to contact the teaching assistant or instructor immediately for timely support.

**How to Do the Assignment**

You will work directly on this notebook file. First, fill in your full name and student ID (MSSV) in the header section of the file above. In the file, complete the tasks in sections marked:
```python
# YOUR CODE HERE
raise NotImplementedError()
```
Or for optional code sections:
```python
# YOUR CODE HERE (OPTION)
```
For markdown cells, complete the answer in the section marked:
```markdown
YOUR ANSWER HERE
```

**How to Submit the Assignment**

Before submitting, select `Kernel` -> `Restart Kernel & Run All Cells` if you are using a local environment, or `Runtime -> Restart session` and run all if using Google Colab, to ensure everything works as expected.

Next, create a submission folder with the following structure:
- Folder named `MSSV` (for example, if your student ID is `1234567`, name the folder `1234567`)
    - File `HW01.ipynb` (no need to submit other files)

Finally, compress this `MSSV` folder in `.zip` format (not `.rar` or any other format) and submit it via the link on Moodle.\
<font color=red>Please make sure to strictly follow this submission guideline.</font>

---
# 1. Import

In [3]:
# Necessary Packages
import time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

# YOUR CODE HERE (OPTION)
# If you need other support packages

---
# 2. Collect data from a website by parsing HTML (3 points)

In this section, you are tasked with collecting data from a website that simulates the sale of Pokémon. We have provided all the necessary URLs in a file called pokemon.txt, which contains a list of links you need to crawl for data. Each link corresponds to a page with details about different Pokémon for sale.

Your goal is to extract specific data from each of these pages and compile it into a structured format.

**Expected Output:** You need to create a `DataFrame` containing the following columns for each Pokémon:
- `SKU`: The unique identifier (ID) for each Pokémon.
- `Name`: The name of the Pokémon.
- `Price`: The sale price of the Pokémon.
- `InStock`: The quantity of this Pokémon currently available in stock.
- `Categories`: The category under which the Pokémon is listed (e.g., type, special edition).
- `Tags`: Any additional tags or attributes assigned to the Pokémon (e.g., rare, legendary, etc.).

**What You Need to Do**:
- Implement the `collect_data` function, which takes one input parameter: `course_urls_file`. This file contains all the URLs you need to process, each on a new line.
- For each URL in `pokemon.txt`, crawl the webpage and extract the relevant information for all the Pokémon listed on that page.
- The data you collect should be organized into a pandas `DataFrame` with the six specified columns (`SKU`, `Name`, `Price`, `InStock`, `Categories`, and `Tags`).
- The output DataFrame should resemble the structure of the provided sample file pokemon_example.csv. This sample contains a few examples to help you visualize the expected result format.

**Notes:**
- **Price Format**: Ensure the price is captured as a numerical value. If there are symbols like "$", remove them before storing the price.
- **In Stock**: If a Pokémon is out of stock, mark its quantity as 0 in the InStock field.
- **Categories & Tags**: Some Pokémon might belong to multiple categories or have multiple tags. Make sure to capture all relevant information and store them as lists or comma-separated strings.

In [14]:
pokemon_example = pd.read_csv('./assets/pokemon_example.csv')
pokemon_example

Unnamed: 0,SKU,Name,Price,InStock,Categories,Tags
0,4391,Bulbasaur,63.0,45,"Pokemon, Seed","bulbasaur, Overgrow, Seed"
1,7227,Ivysaur,87.0,142,"Pokemon, Seed","ivysaur, Overgrow, Seed"
2,7036,Venusaur,105.0,30,"Pokemon, Seed","Overgrow, Seed, venusaur"
3,9086,Charmander,48.0,206,"Lizard, Pokemon","Blaze, charmander, Lizard"
4,6565,Charmeleon,165.0,284,"Flame, Pokemon","Blaze, charmeleon, Flame"


In [4]:
def collect_data(course_urls_file: str) -> pd.DataFrame:
    """
    Collect data from a list of URLs provided in a file.

    Parameters
    ----------
    course_urls_file : str
        Path to the file containing URLs (e.g., 'Lab01/pokemon.txt').

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the required data fields.

    Raises
    ------
    NotImplementedError
        If the function is not implemented yet.
    """
    # Load URLs from file
    with open(course_urls_file, 'r') as file:
        urls = [line.strip() for line in file]

    # Initialize empty lists to store the values of each attribute
    sku, names, prices, in_stocks, categories, tags = [], [], [], [], [], []

    # YOUR CODE HERE
    count = 0
    
    for url in urls:
        #print(f"Processing URL: {url}")
        
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        sku1 = soup.find(attrs={"class": "sku"})
        name = soup.find(attrs={"class": "product_title entry-title"})
        price = soup.find(attrs={"class": "price"})
        inStock = soup.find(attrs={"class": "stock in-stock"})
        category = soup.find(attrs={"class": "posted_in"})
        tag = soup.find(attrs={"class": "tagged_as"})
        
        sku.append(sku1.get_text())
        names.append(name.get_text())
        prices.append(price.get_text().split("£")[1])
        in_stocks.append(inStock.get_text().split(" ")[0])
        categories.append(category.get_text().split(": ")[1])
        tags.append(tag.get_text().split(": ")[1])
        
        count += 1
        # print(f"Processed {count} of {len(urls)}")
        # if count == 10: break
        
        
        
    # Create DataFrame with collected data
    data = pd.DataFrame({
        "SKU": sku,
        "Name": names,
        "Price": prices,
        "InStock": in_stocks,
        "Categories": categories,
        "Tags": tags
    })

    return data

In [5]:
# TEST
data_pokemon = collect_data("./assets/pokemon.txt")
assert data_pokemon.shape == (755, 6), f"Expected shape (755, 6), got {data_pokemon.shape}"

In [7]:
# Save to csv file with name pokemon.csv to grade
data_pokemon.to_csv("student_pokemon.csv", sep=',', encoding='utf-8', index=False, header=True)

---
# 3. Collect data using Web API (4 points)

In this section, your task is to practice web data crawling using the World Bank API. You will be working with demographic and statistical data provided by the World Bank, covering population, education, health, and other key indicators for all countries from 1960 to 2022.

**Selected Indicators:** You will collect data for the following indicators
- `SP.POP.TOTL` – Total Population
- `SP.POP.TOTL.FE.IN` – Total Female Population
- `SP.POP.TOTL.MA.IN` – Total Male Population
- `SP.DYN.CBRT.IN` – Birth Rate
- `SP.DYN.CDRT.IN` – Death Rate
- `SP.DYN.LE00.MA.IN` – Average Life Expectancy (Male)
- `SP.DYN.LE00.FE.IN` – Average Life Expectancy (Female)
- `SE.PRM.ENRR` – Primary School Enrollment Rate
- `SE.TER.ENRR` – High School Enrollment Rate
- `SE.PRM.CMPT.ZS` – Primary School Completion Rate
- `SE.ADT.1524.LT.ZS` – Literacy Rate (Ages 15-24)

**Countries of Interest**: You are required to collect data from the following 7 countries as
- United States of America (US)
- India (IN)
- China (CN)
- Japan (JP)
- Canada (CA)
- Great Britain (GB)
- South Africa (ZA)

**Task Overview:** You are to implement a data collection function that queries the World Bank API for the specified indicators across these 7 countries. The data will be collected for each year from 1960 to 2022 and stored in a pandas `DataFrame` named `data_countries`.


You can expand your work on collecting data (such as collecting data from other countries and other indicators) by reading: https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation

**Hints**:

- Use the based URL: http://api.worldbank.org/v2/
- In order to collect data for each indicator of each country, you can use the URL: "http://api.worldbank.org/v2/countries/{country_code}/indicators/{indicator_code}"
    + `country_code` and `indicator_code` are provided above.
    + For example, you can use the following URL to get the `Total population` of Japan: http://api.worldbank.org/v2/countries/jp/indicators/SP.POP.TOTL

In [15]:
data_countries_examples = pd.read_csv("./assets/countries_example.csv")
data_countries_examples

Unnamed: 0,Total Population,Female Population,Male Population,Birth Rate,Death Rate,Male life expectancy,Female life expectancy,"School enrollment, primary","School enrollment, tertiary",Primary completion rate,Literacy rate,Year,Country
0,333287557.0,168266219.0,165021339.0,,,,,,,,,2022,USA
1,332031554.0,167550001.0,164481553.0,11.0,10.4,73.5,79.3,,,,,2021,USA
2,331511512.0,167203010.0,164308503.0,10.9,10.3,74.2,79.9,100.305793762207,87.5676574707031,100.923667907715,,2020,USA
3,328329953.0,165599805.0,162730147.0,11.4,8.7,76.3,81.4,100.981300354004,87.8887100219727,100.489051818848,,2019,USA
4,326838199.0,164926348.0,161911851.0,11.6,8.678,76.2,81.2,101.256561279297,88.2991790771484,100.092697143555,,2018,USA
5,325122128.0,164151818.0,160970309.0,11.8,8.638,76.1,81.1,101.821441650391,88.1673889160156,98.8321990966797,,2017,USA
6,323071755.0,163224028.0,159847727.0,12.2,8.493,76.1,81.1,101.362861633301,88.8350524902344,,,2016,USA
7,320738994.0,162158414.0,158580581.0,12.4,8.44,76.3,81.2,100.299911499023,88.8894119262695,,,2015,USA
8,318386329.0,161084758.0,157301571.0,12.5,8.237,76.5,81.3,99.6733779907227,88.6268692016602,,,2014,USA
9,316059947.0,160034189.0,156025758.0,12.4,8.215,76.4,81.2,99.455436706543,88.7264175415039,,,2013,USA


In [9]:
BASE_URL = "https://api.worldbank.org/v2/"
COUNTRIES = ["US", "IN", "CN", "JP", "CA", "GB", "ZA"]
INDICATORS = [
    "SP.POP.TOTL",
    "SP.POP.TOTL.FE.IN",
    "SP.POP.TOTL.MA.IN",
    "SP.DYN.CBRT.IN",
    "SP.DYN.CDRT.IN",
    "SP.DYN.LE00.MA.IN",
    "SP.DYN.LE00.FE.IN",
    "SE.PRM.ENRR",
    "SE.TER.ENRR",
    "SE.PRM.CMPT.ZS",
    "SE.ADT.1524.LT.ZS",
]

# YOUR CODE HERE (option)
# If you need other initializations
countryMap = {
    "us": "United States",
    "in": "India",
    "cn": "China",
    "jp": "Japan",
    "ca": "Canada",
    "gb": "Great Britain",
    "za": "South Africa",
}

featureMap = {
    "SP.POP.TOTL": "Total Population",
    "SP.POP.TOTL.FE.IN": "Female Population",
    "SP.POP.TOTL.MA.IN": "Male Population",
    "SP.DYN.CBRT.IN": "Birth Rate",
    "SP.DYN.CDRT.IN": "Death Rate",
    "SP.DYN.LE00.MA.IN": "Life Expectancy Male",
    "SP.DYN.LE00.FE.IN": "Life Expectancy Female",
    "SE.PRM.ENRR": "Primary School Enrollment",
    "SE.TER.ENRR": "High School Enrollment",
    "SE.PRM.CMPT.ZS": "Primary Completion Rate",
    "SE.ADT.1524.LT.ZS": "Literacy rate"
}

In [8]:
def collect_data(country_code: str, per_page: int, start_year: int, end_year: int, max_retries: int = 3, delay: int = 2) -> pd.DataFrame:
    """
    Collect data from the World Bank API for a specific country and date range.

    Parameters
    ----------
    country_code : str
        The ISO 3166-1 alpha-2 code for the country (e.g., 'US' for the United States).
    per_page : int
        Number of records to return per page.
    start_year : int
        The starting year of the data range.
    end_year : int
        The ending year of the data range.
    max_retries : int, optional
        Maximum number of retries for API requests in case of server errors (default is 3).
    delay : int, optional
        Delay (in seconds) between retries (default is 2).

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the data collected from the API.

    Raises
    ------
    ValueError
        If the API request fails or if the data is not available.
    """
    # YOUR CODE HERE
    df = pd.DataFrame(columns = INDICATORS)
    
    countrys = []
    years = []
    added = False
    # print(f"Processing {country_code}")
    
    for indication in INDICATORS:
        url = f"{BASE_URL}country/{country_code}/indicator/{indication}"
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')
        
        countrysXML = soup.find_all('wb:country')
        datesXML = soup.find_all('wb:date')
        valuesXML = soup.find_all('wb:value')
        
        data = []
        
        for i in range(len(valuesXML)):
            # check if year is in range
            if (int(datesXML[i].get_text()) > end_year or int(datesXML[i].get_text()) < start_year): 
                continue
            
            if valuesXML[i].get_text() == '':
                data.append(None)
            else:
                data.append(float(valuesXML[i].get_text()))
            
            if not added:
                countrys.append(countrysXML[i].get_text())
                years.append(int(datesXML[i].get_text()))
                
        df[indication] = data
        added = True        
    
    df['Year'] = years
    df['Country'] = countrys
    
    df.columns = [featureMap[col] for col in INDICATORS] + ['Year', 'Country']
    
    return df

In [10]:
def generate_countries_dataset(country_code_list: list) -> pd.DataFrame:
    """
    Generate a dataset by collecting data for a list of countries.

    Parameters
    ----------
    country_code_list : list of str
        List of ISO 3166-1 alpha-2 country codes (e.g., ['US', 'IN', 'CN']).

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the combined data for all countries.

    Raises
    ------
    ValueError
        If data collection for any country fails.
    """
    # Initialize an empty list to store DataFrames
    data_frames = []

    for country_code in country_code_list:
        try:
            # Collect data for each country and append the result to the list
            df = collect_data(country_code=country_code,
                              per_page=100, start_year=2000, end_year=2022)
            data_frames.append(df)
        except ValueError as e:
            print(f"Error collecting data for {country_code}: {e}")

    # Concatenate all collected DataFrames into a single DataFrame
    combined_data = pd.concat(data_frames, ignore_index=True)

    return combined_data

In [11]:
# TEST
data_countries = generate_countries_dataset(COUNTRIES)
assert data_countries.shape == (161, 13)

In [12]:
# Save to csv file with name coutries.csv to grade
data_countries.to_csv("student_countries.csv", sep=',', encoding='utf-8', index=False, header=True)

---
# 4. Crawl data from Springer Journal (3 points)

As a Computer Science student conducting research on a specific topic, it's essential to read papers from academic journals.

In this assignment, you will work with the journal SN Computer Science. Your task is to crawl and extract detailed information for every paper published in this journal. You can find an overview of the journal on the [main page](https://link.springer.com/journal/42979), which states:
```
SN Computer Science is a broad-based, hybrid, peer reviewed journal that publishes original research in all the disciplines of computer science including various inter-disciplinary aspects. The journal aims to be a global forum of, for, and by the community and offers:
```

Using your previous code as a foundation, you need to create a table (dataframe) that includes the following information for each paper:
- `Title`: The title of the paper
- `Date`: The publication date
- `Link`: The URL to the paper
- `Authors`: The authors of the paper
- `Affiliations`: The affiliations of each author


When you are a Computer Science student and going to do research on a specific topic, you need to read paper from journal. 

In this task, I will give you a journal, SN Computer Science and you need to crawl all information of each paper which published in this journal. You may find an overview of this journal in the  as:


Here is an example of the expected output (dataframe):

In [13]:
data_journal_examples = pd.read_csv("./assets/journal_example.csv")
data_journal_examples

Unnamed: 0,Title,Date,Link,Authors,Affiliations
0,An Upgraded Approach for Identifying Partially...,20 September 2024,https://link.springer.com/article/10.1007/s429...,"Barman, Abhijit, Saha, Diganta, Pal, Alok Ranjan",Department of Computer Science and Engineering...
1,A Comprehensive Review on Deep Learning Techni...,20 September 2024,https://link.springer.com/article/10.1007/s429...,"Sagar, Maloth, Vanmathi, C.",School of Computer Science Engineering and Inf...
2,Hybrid Deep Learning Approach with Feature Eng...,20 September 2024,https://link.springer.com/article/10.1007/s429...,"Bouamrane, Amira, Derdour, Makhlouf, Alksas, A...","LIAOA Laboratory, University of Oum El-Bouaghi..."
3,Roberta and BERT: Revolutionizing Mental Healt...,19 September 2024,https://link.springer.com/article/10.1007/s429...,"Chopra, Sonali, Agarwal, Parul, Ahmed, Jawed, ...","Jamia Hamdard, New Delhi, India, University of..."
4,An Intelligent Image Encryption Scheme Based o...,17 September 2024,https://link.springer.com/article/10.1007/s429...,"Dutta, Toshika, Gupta, Manish",Department of Computer Science and Engineering...
5,A Systematic Review on Federated Learning in E...,17 September 2024,https://link.springer.com/article/10.1007/s429...,"Mishra, Sambit Kumar, Sahoo, Subham Kumar, Swa...",Department of Computer Science and Engineering...
6,Park-Net: A Deep Model for Early Detection of ...,17 September 2024,https://link.springer.com/article/10.1007/s429...,"Bennour, Akram, Mekhaznia, Tahar","LAMIS Laboratory, Echahid Cheikh Larbi Tebessi..."
7,Leveraging Deep Embedding Models for Arabic Te...,17 September 2024,https://link.springer.com/article/10.1007/s429...,"Ellouze, Samira, Jaoua, Maher","ANLP Research Group, MIRACL Lab., ISIM Gabes, ..."
8,Bioanalytical Method Development and Validatio...,17 September 2024,https://link.springer.com/article/10.1007/s429...,"Tallam, Anil Kumar, Reddy, Konatham Teja Kumar...","Department of Pharmacy, Shri Venkateshwara Uni..."


Complete this task by writing code that scrapes and organizes this information into a structured format.

In [1]:
# YOUR CODE HERE
def url_gennerate(): 
    BASE_URL = "https://link.springer.com/journal/42979/articles?filterOpenAccess=false&page="
    
    with open("article.txt", "w") as file:
        for page in range(1, 66):
            url = f"{BASE_URL}{page}" 
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            heading_tags = soup.select('h3.app-card-open__heading')
            for heading in heading_tags:
                # Step 4: Find the <a> tag within the current <h3> tag
                link_tag = heading.find('a')
                
                # Step 5: Check if the <a> tag exists and has an href attribute
                if link_tag and 'href' in link_tag.attrs:
                    # Step 6: Append the href value to the article_links list
                    file.write(f"{link_tag['href']}\n")
            

In [None]:
def collect_journal_data(course_urls_file: str) -> pd.DataFrame:
    # Load URLs from file
    with open(course_urls_file, 'r') as file:
        urls = [line.strip() for line in file]

    # Initialize empty lists to store the values of each attribute
    titles, dates, links, authors, affiliations = [], [], [], [], []
    
    count = 0
    
    for url in urls:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        title = soup.find(attrs={"class": "c-article-title"})
        if (title):
            titles.append(title.get_text())
        else :
            titles.append(None)
        
        date = soup.find(attrs={"class": "c-article-identifiers__item"})
        while (date)  and (not ("Published:" in date.get_text())):
            date = date.find_next_sibling()
            if not date:
                break
        if (date):
            dates.append(date.get_text().split(": ")[1])
        else:
            dates.append(None)
            
        links.append(url)
        
        author = soup.find(attrs={"class": "c-article-author-list"})
        if (author):
            author = author.find_all('a', attrs={"data-test": "author-name"})
            author = [link.get_text() for link in author]
            authors.append(', '.join(author))
        else:
            authors.append(None)
        
        affiliation = soup.find(attrs={"class": "c-article-author-affiliation__list"})
        if (affiliation):
            affiliation = affiliation.find_all('p', attrs={"class": "c-article-author-affiliation__address"})
            affiliation = [aff.get_text() for aff in affiliation]
            affiliations.append(', '.join(affiliation))
        else:
            affiliations.append(None)
        
        count += 1
        # print(f"Processed {count} of {len(urls)}")      
        # if count == 100: break  
        
        
    # Create DataFrame with collected data
    data = pd.DataFrame({
        "Title": titles,
        "Date": dates,
        "Link": links,
        "Authors": authors,
        "Affiliations": affiliations
    })

    return data

In [6]:
#parse aticles' url to txt file.
url_gennerate()

In [6]:
# run the function to collect data
data_journal = collect_journal_data("article.txt")

In [7]:
# Save to csv file with name journal.csv to grade
data_journal.to_csv("student_journal.csv", sep=',', encoding='utf-8', index=False, header=True)