## Module 2 Homework

In this homework, we're going to combine data from various sources to process it in Pandas and generate additional fields.

If not stated otherwise, please use the [LINK](https://github.com/DataTalksClub/stock-markets-analytics-zoomcamp/blob/main/02-dataframe-analysis/%5B2025%5D_Module_02_Colab_Working_with_the_data.ipynb) covered at the livestream to re-use the code snippets.

---

---
### Question 1: [IPO] Withdrawn IPOs by Company Type

**What is the total withdrawn IPO value (in $ millions) for the company class with the highest total withdrawal value?**

From the withdrawn IPO list ([stockanalysis.com/ipos/withdrawn](https://stockanalysis.com/ipos/withdrawn/)), collect and process the data to find out which company type saw the most withdrawn IPO value.

#### Steps:
1. Use `pandas.read_html()` with the URL above to load the IPO withdrawal table into a DataFrame. 
   *It is a similar process to Code Snippet 1 discussed at the livestream.*    You should get **99 entries**. 
2. Create a new column called `Company Class`, categorizing company names based on patterns like:
   - “Acquisition Corp” or “Acquisition Corporation” → `Acq.Corp`
   - “Inc” or “Incorporated” → `Inc`
   - “Group” → `Group`
   - “Holdings” → `Holdings`
   - “Ltd” or “Limited” → `Ltd`
   - Others → `Other`

  * Hint: make your function more robust by converting names to lowercase and splitting into words before matching patterns.

3. Define a new field `Avg. price` by parsing the `Price Range` field (create a function and apply it to the `Price Range` column). Examples:
   - '$8.00-$10.00' → `9.0`  
   - '$5.00' → `5.0`  
   - '-' → `None`
4. Convert `Shares Offered` to numeric, clean missing or invalid values.
5. Create a new column:  
   `Withdrawn Value = Shares Offered * Avg Price` (**71 non-null values**)
6. Group by `Company Class` and calculate total withdrawn value.
7. **Answer**: Which class had the highest **total** value of withdrawals?
---

In [1]:
# IMPORTS
import numpy as np
import pandas as pd
import requests


#Fin Data Sources
import yfinance as yf
import pandas_datareader as pdr

#Data viz
import plotly.graph_objs as go
import plotly.express as px

import time
from datetime import date

# for graphs
import matplotlib.pyplot as plt
     

In [14]:

import pandas as pd
import requests
from io import StringIO

def get_ipos(url):
    """
    Fetch IPO data for the given year from stockanalysis.com.
    """
    url = f"https://stockanalysis.com/ipos/withdrawn"
    headers = {
        'User-Agent': (
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
            'AppleWebKit/537.36 (KHTML, like Gecko) '
            'Chrome/58.0.3029.110 Safari/537.3'
        )
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()

        # Wrap HTML text in StringIO to avoid deprecation warning
        # "Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object."
        html_io = StringIO(response.text)
        tables = pd.read_html(html_io)

        if not tables:
            raise ValueError(f"No tables found for year {year}.")

        return tables[0]

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    except ValueError as ve:
        print(f"Data error: {ve}")
    except Exception as ex:
        print(f"Unexpected error: {ex}")

    return pd.DataFrame()

In [32]:
ipo_wdr = get_ipos("https://stockanalysis.com/ipos/withdrawn")

print(ipo_wdr.info())
print(ipo_wdr.isnull().sum())

ipo_wdr.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Symbol          100 non-null    object
 1   Company Name    100 non-null    object
 2   Price Range     100 non-null    object
 3   Shares Offered  100 non-null    object
dtypes: object(4)
memory usage: 3.3+ KB
None
Symbol            0
Company Name      0
Price Range       0
Shares Offered    0
dtype: int64


Unnamed: 0,Symbol,Company Name,Price Range,Shares Offered
0,ODTX,"Odyssey Therapeutics, Inc.",-,-
1,UNFL,"Unifoil Holdings, Inc.",$3.00 - $4.00,2000000
2,AURN,"Aurion Biotech, Inc.",-,-
3,ROTR,"PHI Group, Inc.",-,-
4,ONE,One Power Company,-,-


In [33]:
def classify_company(name):
    name = name.lower()
    if "acquisition corp" in name or "acquisition corporation" in name:
        return "Acq.Corp"
    elif "inc" in name or "incorporated" in name:
        return "Inc"
    elif "group" in name:
        return "Group"
    elif "holdings" in name:
        return "Holdings"
    elif "ltd" in name or "limited" in name:
        return "Ltd"
    else:
        return "Other"



import numpy as np
import re

def extract_price(price_range):
    # Handle empty or '--' cases
    if pd.isna(price_range) or price_range == '--':
        return np.nan
    
    # Remove $ and M characters
    cleaned_price = price_range.replace('$', '').replace('M', '')
    
    # Check if it's a range (contains a hyphen)
    if '-' in cleaned_price:
        # Extract the two values from the range
        values = re.findall(r'\d+\.?\d*', cleaned_price)
        if len(values) >= 2:
            # Calculate midpoint
            return (float(values[0]) + float(values[1])) / 2
    else:
        # For absolute values
        values = re.findall(r'\d+\.?\d*', cleaned_price)
        if values:
            return float(values[0])
    
    # Fallback if no numbers found
    return np.nan






In [36]:
ipo_wdr["Company Class"] = ipo_wdr["Company Name"].apply(classify_company)

ipo_wdr["Amount"] = ipo_wdr["Price Range"].apply(extract_price)

ipo_wdr["Shares Offered"] = pd.to_numeric(ipo_wdr["Shares Offered"], errors='coerce')

# Amount is already the average price from your previous step
ipo_wdr["Withdrawn Value"] = ipo_wdr["Shares Offered"] * ipo_wdr["Amount"]

# 3. Group by 'Company Class' and calculate total withdrawn value
class_totals = ipo_wdr.groupby("Company Class")["Withdrawn Value"].sum().sort_values(ascending=False)

# 4. Find the class with the highest total withdrawn value
highest_class = class_totals.index[0]
highest_value = class_totals.iloc[0]

print(f"Class totals (in descending order):\n{class_totals}")
print(f"\nThe class with the highest total value of withdrawals is: {highest_class}")
print(f"Total withdrawn value: {highest_value:,.2f}")

Class totals (in descending order):
Company Class
Acq.Corp    4.021000e+09
Inc         2.257164e+09
Other       7.679200e+08
Ltd         3.217346e+08
Holdings    3.030000e+08
Group       3.378750e+07
Name: Withdrawn Value, dtype: float64

The class with the highest total value of withdrawals is: Acq.Corp
Total withdrawn value: 4,021,000,000.00
