This is the template for DS3000 Final data analysis project. Once you finish, please remove all my instructions. You do not need to exactly follow the structure in the template but please make sure you have all the components. Write your report in paragraphs. Only use bullet points when list something (eg: functions) 

# DS3000 Final Project
#### Team number: 9
- Xavier Yu, Vilasini Nathan, Justin Lee

## Introdunction

- One or two paragraphs about the background of the project. eg: the backgound of PalWorld and why your analysis can be interesting
- State your research questions. Limit the number of research questions to be one or two. 

## Data 

### Data Source

- List the website you have scraped the data from.
- List which information you have scraped
- Describe what kind of cleaning you have done to the data

### Webscraping and cleaning functions overview

List all the functions you have written for webscraping and data cleaning. For each one, write one sentence to describe it. 
- `extract_soup()`
    - build url and return soup object

### Data overview

- Show a couple of rows of the cleaned data you are going to use for the analysis
- Which is your target value (if exists)
- Give a general summary about the other features
- Discuss if there is any potential problems about the data (eg: missing values, any features that you did not collect but may be important, any other concerns)

## Webscraping and cleaning

In [None]:
# list all the functions you have for webscraping and cleaning. Make sure write full 
# docstrings for each function

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
import pylab as py
import scipy.stats as stats
import requests
from bs4 import BeautifulSoup
import pandas as pd


def get_html(url):
    """
    Retrieve HTML content from a given URL

    Args:
        url (str): The webpage URL to fetch HTML from

    Returns:
        str: HTML content as a string.
    """
    response = requests.get(url)
    return response.text


def extract_pal_links(html):
    """
    Extracts and returns a list of links to pals from the HTML soup object.
    
    Args:
        html (string): html text of a webpage
        
    Returns:
        pal_links (list): A list of full URLs to individual pal pages.
    """
    # parse HTML content to create soup
    soup = BeautifulSoup(html)
    
    pal_links = []  # Create an empty list to store full URLs

    # Find all <div> elements with class pal
    for pal_div in soup.find_all('div', class_='pal'):
        
        # Inside each <div class='pal'>, find the first <a> tag
        a_tag = pal_div.find('a')
        
        # Check if the <a> tag exists and has an 'href' attribute
        if a_tag and a_tag.has_attr('href'):
            # Build the full link by appending the relative path to the base URL
            full_link = 'https://palworld.gg' + a_tag['href']
            pal_links.append(full_link)

    return pal_links


def fetch_pal_data(pal_links):
    """
    Given a list of Palworld page links, fetch each page and extract the pal's name and stats.

    Args:
        pal_links (list): A list of full URLs to individual pal pages.

    Returns:
        pal_data (dictionary): dictionary containing a pal's name, link, and stats.
    """    
    name, href, stats = [], [], []

    for link in pal_links:
        # Request the HTML content of each pal's page
        response = requests.get(link)
        if response.status_code != 200:
            continue

        # Parse the HTML using BeautifulSoup
        pal_soup = BeautifulSoup(response.text, 'html.parser')

        # Get the pal's name from the <h1> tag
        name_tag = pal_soup.find('h1')
        pal_name = name_tag.text.strip() if name_tag else 'Unknown'

        # Dictionary to store stats like HP, Attack, etc.
        stats_dict = {}

        # Look for the section containing stats
        stats_div = pal_soup.find('div', class_='stats')
        if stats_div:
            items_div = stats_div.find('div', class_='items')
            if items_div:
                # Loop through all stat items
                for item in items_div.find_all('div', class_='item'):
                    stat_name_tag = item.find('div', class_='name')
                    stat_value_tag = item.find('div', class_='value')
                    if stat_name_tag and stat_value_tag:
                        stat_name = stat_name_tag.text.strip()
                        stat_value = stat_value_tag.text.strip()
                        stats_dict[stat_name] = stat_value

        # Save the collected data into corresponding lists
        name.append(pal_name)
        href.append(link)
        stats.append(stats_dict)
        
    pal_data = {'Name': name, 'Link': href, 'Stats': stats}


    return pal_data


def add_rarity_info(pal_data):
    """
    Adds rarity_level and rarity_name to each pal dictionary
    
    Args:
        pal_data (dictionary): contains pal info with at least a 'Link' key.
        
    Returns:
        pal_data (dictionary): 'Rarity Level' and 'Rarity Name' fields added
    """
    # empty lists to store rarity level and rarity name
    rar_level, rar_name = [], []
    
    for link in pal_data['Link']:
        html = get_html(link)  # FIXED: now using each Pal's page
        soup = BeautifulSoup(html, 'html.parser')

        # Find the rarity div
        rarity_div = soup.find('div', class_='rarity')
        rarity_level, rarity_name = None, None
        if rarity_div:
            lv_tag = rarity_div.find('div', class_='lv')
            name_tag = rarity_div.find('div', class_='name')
            if lv_tag and name_tag:
                rarity_level = lv_tag.text.strip()
                rarity_name = name_tag.text.strip()

        rar_level.append(rarity_level)
        rar_name.append(rarity_name)
        
    pal_data['Rarity Level'] = rar_level
    pal_data['Rarity Name'] = rar_name

    return pal_data


def add_element_work(pal_data):
    """
    Adds element and work suitability to pal_data
    
    Args:
        pal_data (dictionary): Dictionary, with at least a 'Link' key.
        
    Returns:
        pal_data (dictionary): 'Element', 'Work Suitability' added
    
    """
    # empty lists to store element and work suitability
    element, work = [], []
    
    # iterate through all links
    for link in pal_data['Link']:
        
        # get HTML
        html = get_html(link)
        # Parse the HTML content
        soup = BeautifulSoup(html)
        
        # to store all elements of each pal
        pal_element = []
        
        # get contents of first div with class elements
        elements_div = soup.find('div', class_='elements')

        # get text from each element and append to pal_element
        for el in elements_div.find_all('div', class_='name'):
                    element_text = el.text.strip()
                    pal_element.append(element_text)
        
        # empty dict to store work suitability and level
        work_level = {}

        # get contents of first div with class works
        works_div = soup.find('div', class_='works')

        # iterates through contents of work_div with div and class item
        for item in works_div.find_all('div', class_='item'):
            
            # only extracts displayed items
            if 'display:none' not in item.get('style', ''):
                if 'Lv' in item.text:
                    work_suit, level = item.text.split('Lv')
                    work_level[work_suit] = int(level)
                else:
                    work_suit = item.text.strip()
                    work_level[work_suit] = ''
        
        # appends pal_element and work_level to the larger element and work lists
        element.append(pal_element)
        work.append(work_level)
    
    # adds to dictionary
    pal_data['Element'] = element
    pal_data['Work Suitability'] = work
    
    return pal_data

In [None]:
url = 'https://palworld.gg/pals'
html = get_html(url)
pal_links = extract_pal_links(html)

pal_data = fetch_pal_data(pal_links)
pal_data = add_rarity_info(pal_data)
pal_data = add_element_work(pal_data)

# Create DataFrame for downstream processing
df = pd.DataFrame(pal_data)

# Expand the dict in "Stats" to columns
stats_df = df["Stats"].apply(pd.Series)

# Merge back and coerce numbers
df = pd.concat([df.drop(columns=["Stats"]), stats_df], axis=1)

# Convert stat columns and rarity to numeric
stat_cols = stats_df.columns.tolist()
for c in stat_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")
df["Rarity Level"] = pd.to_numeric(df["Rarity Level"], errors="coerce")

# Optional combined metric
df["Total Stats"] = df[stat_cols].sum(axis=1)

# Drop rows missing target
df = df.dropna(subset=["Rarity Level"])

## Visualizations

### Visualization functions overview
List all the functions you have written for visualization. For each one, write one sentence to describe it. 
- `make_hist()`
    - Generate a histogram with given data and feature
 
### Visualization results
- Present 3-4 data visualizations.
- For each visualization, you need to include title, xlabel, ylabel, legend (if necessary)
- For each visualization, explain why you make this data visualization (how it related to your research question) and explain what you have learned from this visualization

In [4]:
# list all the functions you have for visualization. Make sure write full 
# docstrings for each function
def make_hist(df, y_feat):

    pass

#### visualization 1

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(data=df, x="Total Stats", y="Rarity Level", hue="Rarity Name")
plt.title("Total Stats vs Rarity Level")
plt.xlabel("Total Stats")
plt.ylabel("Rarity Level")
plt.legend(title="Rarity")
plt.show()

#### visualization 2

In [6]:
# Write the code to run functions to get each data visualization in separate code chunks. 
# Interpret the figures. 

#### visualization 3

In [7]:
# Write the code to run functions to get each data visualization in separate code chunks. 
# Interpret the figures. 

## Models

### Modeling functions overview
List all the functions you have written for modeling. For each one, write one sentence to describe it. 
- `fit_linear()`
    - fit a linear model to the data and output the r2, slope and intercept

### Model results

- Present 2-3 models for the analysis.
- Explain any pre-processing steps you have done (eg: scaling, polynomial, dummy features)
- For each model, explain why you think this model is suitable and what metrics you want to use to evaluate the model
    - If it is a classification model, you need to present the confusion matrix, calculate the accuracy, sensitivity and specificity with cross-validation
    - If it is a regression model, you need to present the r2 and MSE with cross-validation
    - If it is a linear regression model/multiple linear regression model, you need to interpret the meaning of the coefficient with the full data
    - If it is a decision tree model, you need to plot the tree with the full data
    - If it is a random forest model, you need to present the feature importance plot with the full data
    - If it is a PCA, you need to explain how to select the number of components and interpret the key features in the first two components
    - If it is a clustering, you need explain how to select the number of clustering and summarize the clustering. 

In [8]:
# list all the functions you have for modeling. Make sure write full 
# docstrings for each function
def fit_linear(df, y_feat, x_feat):

    pass

#### Model 1

In [9]:
# Write the code to run functions to fit each model in separate code chunks. 
# Interpret the model results. 

#### Model 2

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold

Xcols = stat_cols  # or stat_cols + ["Total Stats"] if you want to include the sum too

single_results = []
cv = KFold(n_splits=5, shuffle=True, random_state=42)

for col in Xcols:
    X = df[[col]].values
    y = df["Rarity Level"].values
    # cross-validated R^2
    r2_scores = cross_val_score(LinearRegression(), X, y, scoring="r2", cv=cv)
    single_results.append({
        "Stat": col,
        "CV_R2_mean": np.mean(r2_scores),
        "CV_R2_std": np.std(r2_scores)
    })

single_results_df = pd.DataFrame(single_results).sort_values("CV_R2_mean", ascending=False)
print(single_results_df)

#### Model 3

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

X = df[stat_cols].copy()
y = df["Rarity Level"].values

# Train/test for a clean holdout evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Numeric pipeline
num_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

pre = ColumnTransformer([
    ("num", num_pipe, stat_cols)
])

reg_pipe = Pipeline(steps=[
    ("pre", pre),
    ("reg", LinearRegression())
])

reg_pipe.fit(X_train, y_train)
y_pred = reg_pipe.predict(X_test)
print("Holdout R^2:", r2_score(y_test, y_pred))

In [None]:
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestRegressor

# Permutation importance on the linear model
perm = permutation_importance(reg_pipe, X_test, y_test, n_repeats=30, random_state=42)
perm_importance = pd.DataFrame({
    "feature": stat_cols,
    "importance_mean": perm.importances_mean,
    "importance_std": perm.importances_std
}).sort_values("importance_mean", ascending=False)
print(perm_importance)

# RandomForestRegressor feature importance
rf = Pipeline(steps=[
    ("pre", pre),
    ("rf", RandomForestRegressor(n_estimators=800, random_state=42))
])
rf.fit(X_train, y_train)
rf_r2 = r2_score(y_test, rf.predict(X_test))

print("RF Holdout R^2:", rf_r2)

rf_feat_imp = pd.Series(rf.named_steps["rf"].feature_importances_, index=stat_cols).sort_values(ascending=False)
print(rf_feat_imp)


## Discussion

- One or two paragraphs to summarize your findings in the modeling sections and do the models answer your research question?
- Any other potential thing you can do with the analysis (eg: include more features, get more data, try some other models etc.)
- List the contribution for each group member.