This is the template for DS3000 Final data analysis project. Once you finish, please remove all my instructions. You do not need to exactly follow the structure in the template but please make sure you have all the components. Write your report in paragraphs. Only use bullet points when list something (eg: functions) 

# Investigating Palworld Through Data Analysis
#### Team 11
- Ansh Aggarwal
- Sohum Balsara
- Bear Smith

In [None]:
# imports
import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import plotly.express as px

## Introduction

In this project we study Palworld, a survival and monster-taming video game in which players collect characters called Pals to battle each other and help build their bases. Each Pal has a different set of skills and qualifications, a combination of such features as their element type, their HP in battle, and their "work suitability" for various tasks like farming or electricity production. We want to learn about how those factors combine and influence each other, and what relationships that may not be stated outwardly can be found between them. This may help players better understand their game and be more successful at it, or at least increase their knowledge of the logic behind Palworld.

Our primary research questions are:
- How are a Pal's work suitability scores correlated to its other statistics, such as its element type or battle skills?
- What relationship does Pal rarity have with their other statistics?

## Data 

### Data Source

- All data was taken from [palworld.gg]("https://palworld.gg").
- We scraped each Pal's name, ID number, rarity, element(s), work suitability type(s) and level(s), HP, and defense scores.
- To make the data frame more easily filtered, the Pals' elements were listed using multiple boolean columns, each representing one element.
- For that same reason, their work suitabilities were also listed by column, but instead of True or False, each value was the Pal's level in that suitability or Nan if they did not have it. That way, dropna can be used to filter the data frame by element type.
- Finally, the Pals' HP and defense stats were converted to numbers in the data frame, so that mathematical operations can be done on those columns.

### Webscraping and cleaning functions overview
- `scrape_pal_ids()`
    - Creates a data frame with all the Pals' names as its index and one column of data containing their IDs.
- `scrape_pal_rarity()`
    - Adds a column containing the rarity of each Pal to the given data frame.
- `scrape_pal_elements()`
    - Adds the element or elements of each Pal to the given data frame using one-hot columns.
- `scrape_pal_work()`
    - Adds the work suitability or suitabilities of each Pal to the given data frame, along with its level.
- `scrape_pal_hp()`
    - Adds a column containing the HP of each Pal to the given data frame.
- `scrape_pal_defense()`
    - Adds a column containing the defense score of each Pal to the given data frame.

### Data overview

# TODO - IN PROGRESS
- You can see a couple rows of our data below.
- We store the Pals' rarity, their element or elements, their work suitabilities and what levels they are, 
- Discuss if there is any potential problems about the data (eg: missing values, any features that you did not collect but may be important, any other concerns)

In [None]:
import pandas as pd
pal_frame = pd.read_excel("pal_frame_for_project.xlsx")
# Reset the index back to the Pal names after reading in the spreadsheet file.
pal_frame.set_index("Unnamed: 0", inplace = True)
pal_frame.rename_axis(None, axis = 0, inplace = True)
pal_frame.head()

## Webscraping and cleaning

In [None]:
# list all the functions you have for webscraping and cleaning. Make sure write full 
# docstrings for each function

def scrape_pal_ids(url = "https://palworld.gg/pals"):
    """
    Scrape Pal IDs from the given URL.

    Args:
        url (str): The URL of the Palworld database page, assumed to be the current link unless otherwise specified.

    Returns:
        DataFrame: A dataFrame of Pal names and IDs.
    """
    response = requests.get(url).text
    soup = BeautifulSoup(response)

    # Find all Pal ID elements
    pal_dict = {}
    for pal in soup.find_all("div", class_ = "pal"):

        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Get the ID and name of the Pal and add them to the dictionary.
        pal_id_element = pal.find('span', class_='index').text.strip()
        pal_dict[pal.find("div", class_ = "name").next_element.strip()] = pal_id_element

    # Convert the dictionary to a DataFrame before returning it.
    return pd.DataFrame().from_dict(pal_dict, orient = "index", columns = ["ID"])

In [None]:
def scrape_pal_rarity(pal_df, url = "https://palworld.gg/pals"):
    """
    Scrape the rarity of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.
        pal_df (DataFrame): The DataFrame of Pal data to update.

    Returns:
        pal_df (DataFrame): The DataFrame given, with the rarity of each Pal added under a new column.
    """
    soup = BeautifulSoup(requests.get(url).text)

    # Find all Pals, then find each of their rarities.
    pal_tag = soup.find_all("div", class_ = "pal")
    for pal in pal_tag:

        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # The "name" class is used twice in each Pal entry, first for their name, then for their rarity.
        name_class = pal.find_all("div", class_ = "name")
        # (Using next_element here rather than .text to avoid also getting the text from the nested children.)
        # Add the rarity of the Pal to the DataFrame under its name.
        pal_df.loc[name_class[0].next_element.strip(), "Rarity"] = name_class[1].next_element

    return pal_df


In [None]:
def scrape_pal_elements(pal_df, url = "https://palworld.gg/pals"):
    """
    Scrape the element or elements of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.
        pal_df (DataFrame): The DataFrame of Pal data to update.

    Returns:
        pal_df (DataFrame): The DataFrame given, with a dummy variable added for each element,
        True meaning that a Pal does belong to them and False for not.
    """
    soup = BeautifulSoup(requests.get(url).text)

    # Find each Pal entry.
    pal_tag = soup.find_all("div", class_ = "pal")
    for pal in pal_tag:
        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Load the individual entry page for the Pal.
        pal_page = requests.get("https://palworld.gg" + pal.a.attrs["href"]).text
        pal_soup = BeautifulSoup(pal_page)

        # Scrape the Pal's name.
        pal_name = pal_soup.find("h1", class_ = "name").text.strip()

        # Initialize all the element column values for this Pal to False, as none have been found yet.
        pal_df.loc[pal_name, ["Earth", "Fire", "Dragon", "Dark", "Electricity", "Water", "Ice", "Leaf", "Normal"]] = False

        # Scrape the elements of the Pal.
        pal_elems_tags = pal_soup.find("div", class_ = "elements").find_all("div", class_ = "name")
        for tag in pal_elems_tags:
            # For each found element, change the Pal's value in the corresponding column to True.
            pal_df.loc[pal_name, tag.text] = True
        
    return pal_df


In [None]:
def scrape_pal_work(pal_df, url = "https://palworld.gg/pals"):
    """
    Scrape the work suitabilities of each Pal listed in the database, including the level of each work type.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.
        pal_df (DataFrame): The DataFrame of Pal data to update.

    Returns:
        pal_df (DataFrame): The DataFrame given, with 13 new columns added: 12 representing the work suitabilities
        of Pals, each filled with NaN or the Pal's skill level for that task, and one column counting the number of suitabilities
        the Pal has in total.
    """
    soup = BeautifulSoup(requests.get(url).text)

    # Find each Pal entry.
    pal_tag = soup.find_all("div", class_ = "pal")
    for pal in pal_tag:
        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Load the individual entry page for the Pal.
        pal_page = requests.get("https://palworld.gg" + pal.a.attrs["href"]).text
        pal_soup = BeautifulSoup(pal_page)

        # Scrape the name of the Pal.
        pal_name = pal_soup.find("h1", class_ = "name").text.strip()

        # Scrape the suitabilities of the Pal, and for each one that is found add its level to the column with the suitability's name.
        pal_work_tags = pal_soup.find("div", class_ = "works").find_all("div", class_ = "active item")
        for tag in pal_work_tags:
            pal_df.loc[pal_name, tag.find("div", class_ = "name").text] = tag.find("span", class_ = "value").text
        # Add the pal's total number of suitabilities as a new column.
        pal_df.loc[pal_name, "Number of work suitabilities"] = len(pal_work_tags)
        
    return pal_df


In [None]:
def scrape_pal_hp(pal_df, url = "https://palworld.gg/pals"):
    """
    Scrape the HP of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.
        pal_df (DataFrame): The DataFrame of Pal data to update.

    Returns:
        pal_df (DataFrame): The DataFrame given, with a new column representing the Pals' HPs.
    """
    soup = BeautifulSoup(requests.get(url).text)

    # Find each Pal entry.
    pal_tag = soup.find_all("div", class_ = "pal")
    for pal in pal_tag:
        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Load the individual entry page for the Pal.
        pal_page = requests.get("https://palworld.gg" + pal.a.attrs["href"]).text
        pal_soup = BeautifulSoup(pal_page)

        # Scrape the name of the Pal.
        pal_name = pal_soup.find("h1", class_ = "name").text.strip()

        # Scrape the HP of the pal (the first entry in the stats section) and add it to the HP column.
        pal_hp = int(pal_soup.find("div", class_ = "stats").find_all("div", class_ = "value")[0].text)
        pal_df.loc[pal_name, "HP"] = pal_hp
        
    return pal_df


In [None]:
def scrape_pal_defense(pal_df, url = "https://palworld.gg/pals"):
    """
    Scrape the defense score of each Pal listed in the database.

    Args:
        url (str): The URL of the database page, assumed to be the current link unless otherwise specified.
        pal_df (DataFrame): The DataFrame of Pal data to update.

    Returns:
        pal_df (DataFrame): The DataFrame given, with a new column representing the Pals' defense scores.
    """
    soup = BeautifulSoup(requests.get(url).text)

    # Find each Pal entry.
    pal_tag = soup.find_all("div", class_ = "pal")
    for pal in pal_tag:
        # Remove empty Pal entries.
        if pal.attrs["style"] == "display:none;":
            continue

        # Load the individual entry page for the Pal.
        pal_page = requests.get("https://palworld.gg" + pal.a.attrs["href"]).text
        pal_soup = BeautifulSoup(pal_page)

        # Scrape the name of the Pal.
        pal_name = pal_soup.find("h1", class_ = "name").text.strip()

        # Scrape the defense of the pal (the second entry in the stats section) and add it to the defense column.
        pal_hp = int(pal_soup.find("div", class_ = "stats").find_all("div", class_ = "value")[1].text)
        pal_df.loc[pal_name, "Defense"] = pal_hp
        
    return pal_df


In [None]:
pal_df = scrape_pal_defense(scrape_pal_hp(scrape_pal_work(scrape_pal_elements(scrape_pal_rarity(scrape_pal_ids())))))
pal_df

## Visualizations

### Visualization functions overview
List all the functions you have written for visualization. For each one, write one sentence to describe it. 
- `make_hist()`
    - Generate a histogram with given data and feature
 
### Visualization results
- Present 3-4 data visualizations.
- For each visualization, you need to include title, xlabel, ylabel, legend (if necessary)
- For each visualization, explain why you make this data visualization (how it related to your research question) and explain what you have learned from this visualization

In [None]:
# list all the functions you have for visualization. Make sure write full 
# docstrings for each function
def make_hist(df, y_feat):

    pass

#### visualization 1

In [None]:
# Write the code to run functions to get each data visualization in separate code chunks. 
# Interpret the figures. 

#### visualization 2

In [None]:
# Write the code to run functions to get each data visualization in separate code chunks. 
# Interpret the figures. 

#### visualization 3

In [None]:
# Write the code to run functions to get each data visualization in separate code chunks. 
# Interpret the figures. 

## Models

### Modeling functions overview
List all the functions you have written for modeling. For each one, write one sentence to describe it. 
- `fit_linear()`
    - fit a linear model to the data and output the r2, slope and intercept

### Model results

- Present 2-3 models for the analysis.
- Explain any pre-processing steps you have done (eg: scaling, polynomial, dummy features)
- For each model, explain why you think this model is suitable and what metrics you want to use to evaluate the model
    - If it is a classification model, you need to present the confusion matrix, calculate the accuracy, sensitivity and specificity with cross-validation
    - If it is a regression model, you need to present the r2 and MSE with cross-validation
    - If it is a linear regression model/multiple linear regression model, you need to interpret the meaning of the coefficient with the full data
    - If it is a decision tree model, you need to plot the tree with the full data
    - If it is a random forest model, you need to present the feature importance plot with the full data
    - If it is a PCA, you need to explain how to select the number of components and interpret the key features in the first two components
    - If it is a clustering, you need explain how to select the number of clustering and summarize the clustering. 

In [None]:
# list all the functions you have for modeling. Make sure write full 
# docstrings for each function
def fit_linear(df, y_feat, x_feat):
    """
    Fit a linear model to the data and output the r2, slope and intercept.
   Args:
        df (DataFrame): The DataFrame of Pal data to use.
        y_feat (str): The name of the feature to use as the dependent variable.
        x_feat (str): The name of the feature to use as the independent variable.
    Returns:
        r2 (float): The r-squared value of the fitted model.
        slope (float): The slope of the fitted model.
        intercept (float): The intercept of the fitted model.
    """
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import r2_score
    import numpy as np
    # Drop rows with NaN values in the specified columns.
    clean_df = df[[y_feat, x_feat]].dropna()
    X = clean_df[[x_feat]].to_numpy().reshape(-1, 1)
    y = clean_df[y_feat].to_numpy()
    # Split the data into training and testing sets.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # Fit the linear regression model.
    model = LinearRegression()
    model.fit(X_train, y_train)
    # Make predictions on the test set.
    y_pred = model.predict(X_test)
    # Calculate the r-squared value, slope, and intercept.
    r2 = r2_score(y_test, y_pred)
    slope = model.coef_[0]
    intercept = model.intercept_
    return r2, slope, intercept

#### Model 1

In [None]:
# Create Element Type column from one-hot element columns
element_cols = ["Earth", "Fire", "Dragon", "Dark", "Electricity", "Water", "Ice", "Leaf", "Normal"]


def get_element_type(row):
    for col in element_cols:
        if col in row.index and row[col] > 0:
            return col
    return "Unknown"


pal_frame["Element Type"] = pal_frame.apply(get_element_type, axis=1)

# Define fixed colors for each element
element_colors = {
    "Earth": "#8B4513",  # SaddleBrown
    "Fire": "#FF4500",  # OrangeRed
    "Dragon": "#800080",  # Purple
    "Dark": "#2F4F4F",  # DarkSlateGray
    "Electricity": "#FFD700",  # Gold
    "Water": "#1E90FF",  # DodgerBlue
    "Ice": "#00CED1",  # DarkTurquoise
    "Leaf": "#228B22",  # ForestGreen
    "Normal": "#A9A9A9",  # DarkGray
    "Unknown": "#808080"  # Gray
}

In [None]:
# visuals for model 1
import pandas as pd
import plotly.express as px


# --- Detect work-type columns ---
work_list = ['Handiwork', 'Mining',
       'Transporting', 'Deforesting',
       'Kindling', 'Gathering', 'Generating Electricity', 'Watering',
       'Cooling', 'Farming', 'Medicine Production', 'Planting']
work_columns = pal_frame[work_list]


def build_features(df, work_cols):
    out = df.copy()
    for idx in range(len(work_cols)):
        pal = work_cols.iloc[idx, :]
        pal = pal.dropna()
        out.loc[pal.name, "Highest Work Suitability Level"] = pal.sort_values(ascending = False).iloc[0]
        out.loc[pal.name, "Work Types"] = ", ".join(pal.index)
        out.loc[pal.name, "Work Skill Count"] = len(pal.index)

    return out

# --- Prepare data ---
pal_frame_vis = build_features(pal_frame, work_columns)

# --- 1. Scatter Plot ---
fig1 = px.scatter(
    pal_frame_vis.sort_values(by = "Highest Work Suitability Level", ascending = True),
    x="Highest Work Suitability Level",
    y="HP",
    color="Element Type",
    hover_data={"ID":True, "Rarity":True, "Work Types":True},
    title="Work Suitability vs HP by Element Type"
)
fig1.show()

# --- 2. Grouped Bar Chart ---
melted = pal_frame_vis.melt(
    id_vars=["Element Type"],
    value_vars=work_columns,
    var_name="Work Type",
    value_name="Level"
)
melted["Level"] = pd.to_numeric(melted["Level"], errors="coerce")
melted = melted.dropna(subset=["Level"])

agg_df = (melted.groupby(["Element Type", "Work Type"], as_index=False, observed=True)
          .agg(Level=("Level", "mean")))

fig2 = px.bar(
    agg_df,
    x="Element Type", y="Level", color="Work Type",
    barmode="group",
    title="Average Work Suitability Level by Element Type & Work Type"
)
fig2.show()


# --- 3. Heatmap ---
heat_df = pal_frame_vis.groupby(["Rarity","Work Skill Count"]).size().reset_index(name="Count")
fig3 = px.density_heatmap(
    heat_df,
    x="Work Skill Count", y="Rarity", z="Count", color_continuous_scale="Viridis",
    title="Rarity vs. Number of Work Suitabilities (Frequency)"
)
fig3.show()

# --- 4. Boxplots ---
fig4 = px.box(pal_frame_vis, x="Rarity", y="HP", color="Element Type",
              title="HP Distribution by Rarity")
fig4.show()
fig5 = px.box(pal_frame_vis, x="Rarity", y="Defense", color="Element Type",
              title="Defense Distribution by Rarity")
fig5.show()


In [None]:
# Write the code to run functions to fit each model in separate code chunks.
r2, slope, intercept = fit_linear(pal_frame_vis, "HP", "Highest Work Suitability Level")
print(f"Model 1: R2 = {r2}, Slope = {slope}, Intercept = {intercept}")

In [None]:
# Write the code to run functions to fit each model in separate code chunks. 
# Interpret the model results.

# Replot the scatter plot with the fitted line
fig1 = px.scatter(
    pal_frame_vis.sort_values(by = "Highest Work Suitability Level", ascending = True),
    x="Highest Work Suitability Level",
    y="HP",
    color="Element Type",
    hover_data={"ID":True, "Rarity":True, "Work Types":True},
    title="Work Suitability vs HP by Element Type"
)
fig1.add_scatter(
    x=pal_frame_vis["Highest Work Suitability Level"],
    y=intercept + slope * pal_frame_vis["Highest Work Suitability Level"],
    mode="lines",
    name="Fitted Line",
    line=dict(color="red", width=2)
)
fig1.show()
#Tthe fitted line shows a positive correlation between work suitability level and HP, indicating that Pals with higher work suitability levels tend to have higher HP. The slope of the line suggests that for each unit increase in work suitability level, the HP increases by approximately `slope` units. The R2 value indicates how well the model explains the variance in HP based on work suitability level.

#### Model 2

In [None]:
# Write the code to run functions to fit each model in separate code chunks. 
# Interpret the model results.


#### Model 3

In [None]:
# Write the code to run functions to fit each model in separate code chunks. 
# Interpret the model results. 

## Discussion

- One or two paragraphs to summarize your findings in the modeling sections and do the models answer your research question?
- Any other potential thing you can do with the analysis (eg: include more features, get more data, try some other models etc.)
- List the contribution for each group member.

### Contributions
- Sohum came up with the majority of our research questions and visualization ideas.
- Bear wrote the web scraping functions.
- Ansh wrote the data visualizations.
- 