# Task 1: [IPO] Withdrawn IPOs by Company Type

What is the total withdrawn IPO value (in $ millions) for the company class with the highest total withdrawal value?

From the withdrawn IPO list (stockanalysis.com/ipos/withdrawn), collect and process the data to find out which company type saw the most withdrawn IPO value.

## 1. Use pandas.read_html() with the URL above to load the IPO withdrawal table into a DataFrame. It is a similar process to Code Snippet 1 discussed at the livestream. You should get 99 entries.

In [7]:
from pyspark.sql import SparkSession

try:
    spark = SparkSession.builder \
        .appName("Test") \
        .master("local[*]") \
        .getOrCreate()

    print("SparkSession started successfully!")
    spark.stop()

except Exception as e:
    print("Failed to start SparkSession:", e)


SparkSession started successfully!


In [8]:
from pyspark.sql import SparkSession
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Create Spark session
spark = SparkSession.builder \
    .appName("IPOWithdrawn") \
    .master("local[*]") \
    .getOrCreate()


# Step 2: Scrape the page
url = "https://stockanalysis.com/ipos/withdrawn"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Step 3: Find the table
table = soup.find("table")

# Step 4: Extract headers
headers = [th.text.strip() for th in table.find("thead").find_all("th")]

# Step 5: Extract rows
rows = []
for tr in table.find("tbody").find_all("tr"):
    cells = [td.text.strip() for td in tr.find_all("td")]
    rows.append(cells)

# Step 6: Create Pandas DataFrame
pandas_df = pd.DataFrame(rows, columns=headers)

# Step 7: Convert to PySpark DataFrame
spark_df = spark.createDataFrame(pandas_df)



In [3]:
# Show result
spark_df.filter("Symbol is not NULL").select("Symbol").distinct().count()

                                                                                

103

In [4]:
spark_df.show(truncate=False)


+------+-------------------------------------------------+---------------+--------------+
|Symbol|Company Name                                     |Price Range    |Shares Offered|
+------+-------------------------------------------------+---------------+--------------+
|PRB   |Peak Resources LP                                |$13.00 - $15.00|4,700,000     |
|COC   |COR3 & Co. (Holdings) Limited                    |$4.00 - $5.00  |3,875,000     |
|DIMR  |DiamiR Biosciences Corp.                         |-              |-             |
|SLGB  |Smart Logistics Global Limited                   |$5.00 - $6.00  |1,000,000     |
|EIL   |E I L Holdings Limited                           |-              |-             |
|ODTX  |Odyssey Therapeutics, Inc.                       |-              |-             |
|UNFL  |Unifoil Holdings, Inc.                           |$3.00 - $4.00  |2,000,000     |
|AURN  |Aurion Biotech, Inc.                             |-              |-             |
|ROTR  |PH

In [9]:
spark_df.printSchema()

root
 |-- Symbol: string (nullable = true)
 |-- Company Name: string (nullable = true)
 |-- Price Range: string (nullable = true)
 |-- Shares Offered: string (nullable = true)



## 2. Create a new column called Company Class, categorizing company names based on patterns like:
- “Acquisition Corp” or “Acquisition Corporation” → Acq.Corp
- “Inc” or “Incorporated” → Inc
- “Group” → Group
- “Ltd” or “Limited” → Limited
- “Holdings” → Holdings
- Others → Other
- Order: Please follow the listed order of classes and assign the first matched value (e.g., for 'shenni holdings limited', you assign the 'Limited' class).


Hint: make your function more robust by converting names to lowercase and splitting into words before matching patterns.

In [10]:
from pyspark.sql.functions import when, lower, col

# Start with your existing DataFrame `spark_df` that has a column like 'Name' or 'Company'
# Replace 'Name' with the actual column name in your DataFrame

# Define classification logic based on suffixes
spark_df = spark_df.withColumn(
    "Class",
    when(lower(col("`Company Name`")).rlike(r"\bacquisition corp(oration)?\b"), "Acq.Corp")
    .when(lower(col("`Company Name`")).rlike(r"\b(inc|incorporated)\b"), "Inc")
    .when(lower(col("`Company Name`")).rlike(r"\bgroup\b"), "Group")
    .when(lower(col("`Company Name`")).rlike(r"\b(ltd|limited)\b"), "Limited")
    .when(lower(col("`Company Name`")).rlike(r"\bholdings\b"), "Holdings")
    .otherwise("Other")
)


In [11]:
spark_df.groupby("Class").count().show()

[Stage 0:>                                                        (0 + 10) / 10]

+--------+-----+
|   Class|count|
+--------+-----+
|     Inc|   50|
|   Other|    8|
| Limited|   20|
|   Group|    3|
|Holdings|    1|
|Acq.Corp|   21|
+--------+-----+



                                                                                

## 3. Define a new field Avg. price by parsing the Price Range field (create a function and apply it to the Price Range column). Examples:
- '$8.00-$10.00' → 9.0
- '$5.00' → 5.0
- '-' → None

In [12]:
from pyspark.sql.functions import col, regexp_extract, when

# Objective: Parse Price Range to get average price
# Method: Use regex to extract prices, handle ranges and single values

# Get first price (e.g. $8.00)
price_low = regexp_extract(col("Price Range"), r"\$(\d+\.\d+)", 1)  # \$ matches $, (\d+\.\d+) captures digits.decimals

# Get second price if range exists (e.g. $10.00)
price_high = regexp_extract(col("Price Range"), r"\$(?:\d+\.\d+)\s*-\s*\$(\d+\.\d+)", 1)  # non-capturing first price, capture second

# Compute average price:
# - If "-", set None
# - If range, average both prices
# - Else, use single price
spark_df = spark_df.withColumn(
    "avg_price",
    when(col("Price Range") == "-", None)
    .when(price_high != "", (price_low.cast("float") + price_high.cast("float")) / 2)
    .otherwise(price_low.cast("float"))
)


In [13]:
spark_df.show(truncate=False)
spark_df.filter("avg_price is not NULL").select("Symbol").distinct().count()


+------+-------------------------------------------------+---------------+--------------+-------+---------+
|Symbol|Company Name                                     |Price Range    |Shares Offered|Class  |avg_price|
+------+-------------------------------------------------+---------------+--------------+-------+---------+
|PRB   |Peak Resources LP                                |$13.00 - $15.00|4,700,000     |Other  |14.0     |
|COC   |COR3 & Co. (Holdings) Limited                    |$4.00 - $5.00  |3,875,000     |Limited|4.5      |
|DIMR  |DiamiR Biosciences Corp.                         |-              |-             |Other  |NULL     |
|SLGB  |Smart Logistics Global Limited                   |$5.00 - $6.00  |1,000,000     |Limited|5.5      |
|EIL   |E I L Holdings Limited                           |-              |-             |Limited|NULL     |
|ODTX  |Odyssey Therapeutics, Inc.                       |-              |-             |Inc    |NULL     |
|UNFL  |Unifoil Holdings, In

75

### 5. Create a new column:
Withdrawn Value = Shares Offered \* Avg Price (71 non-null values)

In [14]:
from pyspark.sql.functions import when, lit, col, regexp_replace

spark_df = spark_df.withColumn(
    "share_offered_int",
    when((col("`Shares Offered`") == "-") | (col("`Shares Offered`") == ""), None)
    .otherwise(regexp_replace(col("`Shares Offered`"), ",", "").cast("int"))
)
df = spark_df.withColumn("withdrawn_value", col("avg_price") * col("share_offered_int"))


In [15]:
df.printSchema()

root
 |-- Symbol: string (nullable = true)
 |-- Company Name: string (nullable = true)
 |-- Price Range: string (nullable = true)
 |-- Shares Offered: string (nullable = true)
 |-- Class: string (nullable = false)
 |-- avg_price: double (nullable = true)
 |-- share_offered_int: integer (nullable = true)
 |-- withdrawn_value: double (nullable = true)



In [16]:
spark.sparkContext.setLogLevel("ERROR")

from pyspark.sql.functions import col, sum as _sum

df = df.withColumn("withdrawn_value_num", col("withdrawn_value").cast("float"))

df.groupby("class").agg(
    _sum("withdrawn_value_num").alias("total_withdrawn_value")
).orderBy(col("total_withdrawn_value").desc()).show(truncate=False)


+--------+---------------------+
|class   |total_withdrawn_value|
+--------+---------------------+
|Acq.Corp|4.021E9              |
|Inc     |2.257164234E9        |
|Other   |8.33720015E8         |
|Limited |5.726720855E8        |
|Holdings|7.5E7                |
|Group   |2.71875E7            |
+--------+---------------------+

