# DS 3000 HW 5 

Due: Sunday July 20th @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file and the a `PDF` file included with the coding results to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted files represent your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the files to gradescope. 

**Notice that this is a group assignment. Each group only need to submit one copy and when you submit the work, please include everyone in your group.**

### Tips for success
- Start early
- Make use of Piazza
- Make use of Office hour
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), though you are welcome to **talk about** (not show each other) the problems.

## Project proposal

For this course, we aim to complete a data analysis project about the the game [Palworld](https://en.wikipedia.org/wiki/Palworld). To help you start with the project, here are a couple of things you need to consider and work on to get a clean data for later analysis. 

To start with the project, please take some time to get familiar with the game. You don't need to play it but please at least know the basic terminologies, like what is a Pal. (And also, if you do play it, please do not spend too much time on it.)

The two recommended database is [https://palworld.gg/](https://palworld.gg/) and [https://paldb.cc/en/](https://paldb.cc/en/). You can use either, or both, or some other database about the Palworld. 

### Part 1.1 (10 points)

1. Are a Pal's work suitability scores (like Kindling, Planting, etc.) related to their elemental type? For example, do Fire-type Pals tend to have higher Kindling scores?

2. Which features (like element type, size, rarity) may affect a Pal's base power level and stats (HP, Attack, etc.)?

3. Based on work suitability scores and element types, which Pals are most similar in terms of their utility for base building and resource gathering?


### Part 1.2 (20 points)

Based on the questions we proposed in the part 1.1, what features we may need to include in the analysis? Check the websites, which website has those information? **You need to pick at least 8 features for analysis.** We recommend a mix of numerical (numbers etc.) and categorical (level etc.) features. Is there any other features that you think it may be important but hard to extract or find on the website (can be something in or not in the game)?

1. Number of Work Suitabilities: The number of different work types a Pal can perform (e.g., 1, 2, 3). This measures versatility and can be correlated with rarity or element type to determine if more versatile Pals are also rarer or stronger.
2. Pal ID Number: A numerical identifier for each Pal (e.g., #100). Useful for indexing and dataset integrity, though not for direct analysis.
3. Element Type: The elemental type(s) of the Pal (e.g., Fire, Earth, Dark, etc.). This is crucial for understanding how elemental types relate to work suitability and combat stats.
4. Rarity: The rarity tier of the Pal (e.g., Common, Rare, Epic, Legendary). Can be used to analyze trends in strength, utility, or drop rates.
5. Work Suitability Types: The kinds of work the Pal is suitable for (e.g., Handiwork, Mining, Transporting). These can help us study which work categories are most associated with certain elements or rarities.
6. Work Suitability Levels: Numerical levels indicating how effective a Pal is at a certain task (e.g., Lv 4 Handiwork). Useful for quantitative comparisons.
7. HP: The base health points of the Pal. Important for combat-related analysis and can be compared against element types and rarity.
8. Defense: The Pal’s defensive stat. Like HP, this can be used to explore if stronger defensive Pals trend toward certain types or rarities.

### Part 1.3 (20 points)

Suppose you do have all the features you mentioned in part 1.2. List 3-4 data visulizations you can make with those features. You do not need to make those visulizations here. Just describe the type of the visualizations (histogram, scatter plot etc. ), which features are involved, will there any hover data or color being added, and **discuss how these data visualizations may be related (or even answer) to your questions in part 1.1**. 

1. Scatter Plot – Work Suitability Level vs. HP
- X-axis: Highest Work Suitability Level
- Y-axis: HP
- Color: Element Type
- Hover Info: Pal ID, Rarity, Work Types
- Why: To see if there’s a tradeoff between combat readiness (HP) and labor value (work level). Helps address question 1: Do elemental types affect work suitability scores?
2. Grouped Bar Chart – Average Work Suitability Level by Element Type
- X-axis: Element Types
- Y-axis: Avg Work Level (split by work type: Mining, Transport, etc.)
- Color: Work Type
- Why: This directly addresses question 1 and 3 — seeing if, say, Fire-types are better at Kindling or if Water-types are better at Cooling.
3. Heatmap – Rarity vs. Number of Work Suitabilities
- Axes: Rarity (Y) × Number of Work Suitabilities (X)
- Color: Frequency (how many Pals fall into each combination)
- Why: Helps answer: Are rarer Pals more versatile? This addresses the updated feature 1 and part of question 2.
4. Boxplot – HP and Defense Distribution by Rarity
- X-axis: Rarity
- Y-axis: HP or Defense
- Color: Element Type (optional)
- Why: To examine the relationship between rarity and battle strength. This supports question 2 — How do different features (like rarity or element) affect base stats?

### Part 1.4  (50 points)

Now, go ahead and try to scrape the features you need. 

Please show all the codes you have for web scrapping. Your current output data frame should include at least 4 features. (You do not need to scrape all features at this moment, although it is recommend to start earlier. Also, you can choose to not to use the ones you have scraped in the later analysis. No need to worry if you need to change anything later). **Please design your code in pipeline and clearly document each function.** See the Python Style Guide in Week 1 for proper documentation. It is also recommended to save the data you have scrapped.

Note: The above code is a template and you'll need to adjust the class names and HTML structure according to the actual website you're scraping (palworld.gg or paldb.cc). The key is to get data for all Pals at once to enable meaningful analysis later.


In [29]:
# scrape palID
import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import plotly.express as px

def scrape_pal_ids(url):
    """
    Scrape Pal IDs from the given URL.

    Args:
        url (str): The URL of the Palworld database page.

    Returns:
        list: A list of Pal IDs.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all Pal ID elements
    pal_id_elements = soup.find_all('span', class_='index')  # Adjust class name as needed
    pal_ids = [element.text.strip() for element in pal_id_elements]

    return pal_ids


In [None]:
# Scrape the rarity of all the Pals
def scrape_pal_rarity(url = "https://palworld.gg/pals"):
    """
    Scrape the rarity of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.

    Returns:
        rarity (dict): A dictionary with Pal names as keys and their rarities as values.
    """
    soup = BeautifulSoup(requests.get(url).text)

    # Find all Pals, then find each of their rarities.
    pal_tag = soup.find_all("div", class_ = "pal")
    rarity = {}
    for pal in pal_tag:

        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # The "name" class is used twice in each Pal entry, first for their name, then for their rarity.
        name_class = pal.find_all("div", class_ = "name")
        # (Using next_element here rather than .text to avoid also getting the text from the nested children.)
        # Add the name and rarity of the Pal to the dictionary.
        rarity[name_class[0].next_element.strip()] = name_class[1].next_element

    return rarity


scrape_pal_rarity()

{'Anubis': 'Epic',
 'Arsox': 'Common',
 'Astegon': 'Epic',
 'Azurmane': 'Rare',
 'Azurobe': 'Rare',
 'Azurobe Cryst': 'Epic',
 'Bastigor': 'Epic',
 'Beakon': 'Rare',
 'Beegarde': 'Common',
 'Bellanoir': 'Legendary',
 'Bellanoir Libero': 'Legendary',
 'Blazamut': 'Epic',
 'Blazamut Ryu': 'Epic',
 'Blazehowl': 'Rare',
 'Blazehowl Noct': 'Epic',
 'Blue Slime': 'Common',
 'Braloha': 'Rare',
 'Bristla': 'Common',
 'Broncherry': 'Rare',
 'Broncherry Aqua': 'Epic',
 'Bushi': 'Rare',
 'Bushi Noct': 'Rare',
 'Caprity': 'Common',
 'Caprity Noct': 'Common',
 'Cattiva': 'Common',
 'Cave Bat': 'Common',
 'Cawgnito': 'Common',
 'Celaray': 'Common',
 'Celaray Lux': 'Common',
 'Celesdir': 'Rare',
 'Chikipi': 'Common',
 'Chillet': 'Common',
 'Chillet Ignis': 'Rare',
 'Cinnamoth': 'Common',
 'Cremis': 'Common',
 'Croajiro': 'Common',
 'Croajiro Noct': 'Rare',
 'Cryolinx': 'Rare',
 'Cryolinx Terra': 'Rare',
 'Daedream': 'Common',
 'Dazemu': 'Rare',
 'Dazzi': 'Common',
 'Dazzi Noct': 'Common',
 'Demon Eye

In [38]:
# Scrape the elements of the Pals
def scrape_pal_elements(url = "https://palworld.gg/pals"):
    """
    Scrape the element or elements of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.

    Returns:
        elements (dict): A dictionary with Pal names as string keys and their elements, the values, as lists of strings.
    """
    soup = BeautifulSoup(requests.get(url).text)

    pal_tag = soup.find_all("div", class_ = "pal")
    elements = {}
    for pal in pal_tag:
        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Load the individual entry page for the Pal.
        pal_page = requests.get("https://palworld.gg" + pal.a.attrs["href"]).text
        pal_soup = BeautifulSoup(pal_page)

        # Scrape the elements of the Pal and format them.
        pal_elems_tags = pal_soup.find("div", class_ = "elements").find_all("div", class_ = "name")
        pal_elems = []
        for tag in pal_elems_tags:
            pal_elems.append(tag.text)
        
        # Scrape the name of the Pal and add its elements to the dictionary.
        elements[pal_soup.find("h1", class_ = "name").text] = pal_elems
        
    return elements

scrape_pal_elements()

KeyboardInterrupt: 

In [40]:
# Scrape the work suitability of the Pals
def scrape_pal_work(url = "https://palworld.gg/pals"):
    """
    Scrape the work suitabilities of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.

    Returns:
        work (dict): A dictionary of Pal names as string keys and their suitabilities, the values, as lists of strings.
    """
    soup = BeautifulSoup(requests.get(url).text)

    pal_tag = soup.find_all("div", class_ = "pal")
    work = {}
    for pal in pal_tag:
        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Load the individual entry page for the Pal.
        pal_page = requests.get("https://palworld.gg" + pal.a.attrs["href"]).text
        pal_soup = BeautifulSoup(pal_page)

        # Scrape the suitabilities of the Pal and format them.
        pal_work_tags = pal_soup.find("div", class_ = "works").find_all("div", class_ = "active item")
        pal_work = []
        for tag in pal_work_tags:
            pal_work.append(tag.find("div", class_ = "name").text)
        
        # Scrape the name of the Pal and add its suitabilities to the dictionary.
        work[pal_soup.find("h1", class_ = "name").text] = pal_work
        
    return work

scrape_pal_work()

{'Anubis': ['Handiwork', 'Mining', 'Transporting'],
 'Arsox': ['Deforesting', 'Kindling'],
 'Astegon': ['Handiwork', 'Mining'],
 'Azurmane': ['Gathering', 'Generating Electricity'],
 'Azurobe': ['Watering'],
 'Azurobe Cryst': ['Cooling'],
 'Bastigor': ['Cooling', 'Deforesting', 'Mining'],
 'Beakon': ['Gathering', 'Generating Electricity', 'Transporting'],
 'Beegarde': ['Gathering',
  'Deforesting',
  'Handiwork',
  'Farming',
  'Medicine Production',
  'Planting',
  'Transporting'],
 'Bellanoir': ['Handiwork', 'Medicine Production', 'Transporting'],
 'Bellanoir Libero': ['Handiwork', 'Medicine Production', 'Transporting'],
 'Blazamut': ['Kindling', 'Mining'],
 'Blazamut Ryu': ['Kindling', 'Mining'],
 'Blazehowl': ['Deforesting', 'Kindling'],
 'Blazehowl Noct': ['Deforesting', 'Kindling'],
 'Blue Slime': ['Transporting'],
 'Braloha': ['Gathering', 'Mining', 'Planting'],
 'Bristla': ['Gathering',
  'Handiwork',
  'Medicine Production',
  'Planting',
  'Transporting'],
 'Broncherry': ['Pl