## Create Final Table
清理后的数据中，课程信息和注册信息混杂在一起。这不仅带来了冗余，也带来了一些麻烦，例如：
- 未来我们想要实现按教授名索引课程的功能，然而，部分课程由多个教授任课，这给我们分裂不同课程带来了麻烦。
- 未来我们可能会加入关于不同季度课程的评分数据，如果添加进当前表中会使其非常冗杂。

因此，我们决定将当前清理过的表分为四张表：
- Professors
  - 存储教授信息
  - `prof_id`: 主键
  - `prof_last_name`: 教授的姓, 不可为null
  - `prof_first_name`: 教授的名, 可为null
  - `prof_middle_name`: 中间名, 可谓null

- Courses
  - 存储课程信息
  - `course_offering_id`: 主键
  - `department`: 部门
  - `course_id`: 课程编号
  - `instructor`: 关键属性，讲师就是原表的 `prof` 列，一门课的讲师是识别其的重要属性之一。
    - 教授表只包含单独的教授信息
    - 原表中的课程有 `department`, `course_id` 等信息
    - 多个教授可能教授多门课程，同一门 `department-course_id` 可能同时开了多门课，由不同教授教授
    - 因此，如果没有 `instructor` 列，那么多门课 (`department-course_id-instructor`) 就会被认定为一门课 ((`department-course_id`))，被连接至所有教授这门课的教授
  - `year`: 学年
  - `quarter`: 季度
  - `total`: 总座位数

- Course_Professors
  - 链接表，链接课程和教授。为什么不做列？：因为同一门课可能有多个教授。
  - `course_offering_id`: 外键，连接到 Course 表
  - `prof_id`: 外键，链接到 Professors 表

- Enrollment_Snapshots
  - 注册数据快照表
  - `course_offering_id`: 外键，连接到 Course 表
  - `date`: 日期
  - `enrolled_ct`: 注册人数
  - `waitlist`: 候补名单人数

在未来，我们还可能添加 Comments 表，Course_Rating 表, Professor_Rating 表，Department_Requirement 表等

## Create Process

#### Required parameters and test S3 connection

In [0]:
import json
import os
import uuid
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, sum, when, lit
from pyspark.sql.functions import sum, avg, max, min, count, countDistinct, first, last, mean, stddev, collect_list, collect_set, approx_count_distinct, expr
from pyspark.sql.functions import col, from_unixtime, to_timestamp, date_format, row_number
from pyspark.sql.functions import split, locate, explode, trim, substring, size
from pyspark.sql.functions import sha2, concat_ws


# 必要的参数，链接 AWS S3
AWS_ACCESS_KEY = ""
AWS_SECRET_KEY = ""
BUCKET_NAME = ""
REGION = ""

In [0]:
spark.conf.set("fs.s3a.access.key", AWS_ACCESS_KEY)
spark.conf.set("fs.s3a.secret.key", AWS_SECRET_KEY)

In [0]:
# 关于S3的基本参数
base_path = f"s3a://{BUCKET_NAME}/ucsd"
path_final_data = f"{base_path}/final/final"
path_final_table = f"{base_path}/final_table"

try:
    df = spark.read.csv(f"{path_final_data}", header=True, inferSchema=True)
    display(df.show(3))
    df.printSchema()
except Exception as e:
    print(f"Table read failed: {e}")

+--------------------+----------+-----+--------+-----------+----+-------+----------+---------+
|                prof|      date|total|waitlist|enrolled_ct|year|quarter|department|course_id|
+--------------------+----------+-----+--------+-----------+----+-------+----------+---------+
|Solomon; Amanda L...|2025-01-03|   -1|       0|          5|2025| Winter|      AAPI|      198|
|Solomon; Amanda L...|2025-01-02|   -1|       0|          5|2025| Winter|      AAPI|      198|
|Solomon; Amanda L...|2025-01-18|   -1|       0|          7|2025| Winter|      AAPI|      198|
+--------------------+----------+-----+--------+-----------+----+-------+----------+---------+
only showing top 3 rows

root
 |-- prof: string (nullable = true)
 |-- date: date (nullable = true)
 |-- total: integer (nullable = true)
 |-- waitlist: integer (nullable = true)
 |-- enrolled_ct: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- quarter: string (nullable = true)
 |-- department: string (nullable = 

#### Tool functions

In [0]:
# uuid 生成函数
# 弃用
# 原因：uuid() 会在每次调用时运行一次，返回一个独立的值
# 然而，由于Spark惰性求值+不可变的特性，每次我们调用 withColumn(...) 或其他转换操作时，不会修改原始 DataFrame，而是返回一个新的 DataFrame
# 这个新的df会重新运行一次uuid生成函数
# 最后造成uuid在两个df中不同
uuid_udf = F.udf(lambda: str(uuid.uuid4()), StringType())

# key 生成函数
def generate_id(df, columns, key_col_name):
    df = df.withColumn(key_col_name, sha2(concat_ws("||", *columns), 256))
    return df

# 分裂 prof 为first_name, last_name, middle_name
def split_prof_name(df):

    # 分裂多个教授授课的课程
    # 有一些课程由多个教授授课，这种情况下教授名字被 & 链接
    # 例如 Bafna; Vineet & Zhong; Sheng 
    # 将其分裂为多行
    df = df.withColumn("prof", explode(split(col("prof"), "& "))) \
            .withColumn("prof", trim(col("prof")))

    # 对于professor的名字，其格式为 last_name; first_name middle_name(可能为null)
    # 将其分裂为三列
    # 如果没有中间名，prof_middle_name 列为null
    # 如果教授为 Staff，prof_first_name 和 prof_last_name 都为 Staff

    df = df.withColumn("isStaff", col("prof") == lit("Staff"))

    df = df.withColumn(
        "prof_last_name",
        when(col("isStaff"), "Staff")
        .otherwise(trim(split(col("prof"), "; ").getItem(0)))
    ).withColumn(
        "first_middle_name",
        when(col("isStaff"), "Staff")
        .otherwise(trim(split(col("prof"), "; ").getItem(1)))
    )

    df = df.withColumn(
        "prof_first_name", 
        when(col("isStaff"), "Staff")
        .otherwise(split(col("first_middle_name"), " ", 2).getItem(0))
    ).withColumn(
        "prof_middle_name",
        when(col("isStaff"), lit(None))
        .otherwise(
            # 检查有没有middlename
            when(size(split(col("first_middle_name"), " ")) > 1, split(col("first_middle_name"), " ", 2).getItem(1))
            .otherwise(lit(None))
        )
    )
    
    df = df.drop("isStaff", "first_middle_name")

    return df

# 准备 professors 表的数据
def create_prof_table_data(df_original):
    df = df_original
    df = df.select("prof").distinct()
    
    # 分裂教授名字
    df = split_prof_name(df)
    df = df.drop("prof").distinct()

    # 生成主键列
    df = generate_id(df, ["prof_first_name", "prof_last_name", "prof_middle_name"], "prof_id")

    return df

# 准备 Courses 表的数据
def create_courses_table_data(df):
    df = df.select("department", "course_id", "year", "quarter", "total", "prof").distinct()
    # 生成主键
    df = generate_id(df, ["department", "course_id", "year", "quarter", "total", "prof"], "course_offering_id")
    return df

# 建立 Courses-Professors 连接表
# 两张表之间暂时的JOIN列为 prof 列
def create_courses_professors_table_data(courses, professors, registrations_original):
    
    # 现在多了 prof_first_name, prof_last_name, prof_middle_name 列
    df = split_prof_name(registrations_original)

    # 获取 prof_id
    df = df.join(professors, on=["prof_first_name", "prof_last_name", "prof_middle_name"], how="inner")

    # 获取 courses_id
    df = df.join(courses, on=["year", "quarter", "department", "course_id", "total"], how="inner")

    # 只取需要的两列
    df = df.select("prof_id", "course_offering_id").distinct()

    return df

def create_enrollment_snapshots_table_data(registrations_original, courses):
    df = registrations_original
    
    # 获取 courses_id
    df = df.join(courses, on=["year", "quarter", "department", "course_id", "prof", "total"], how="inner")

    # 只取需要的列
    df = df.select("date", "waitlist", "enrolled_ct", "course_offering_id").distinct()

    return df

def clean_tables(professors, courses, courses_professors, enrollment_snapshots):

    # professors = professors.selectExpr("")
    courses = courses.withColumnRenamed("prof", "instructor")

    return professors, courses, courses_professors, enrollment_snapshots


In [0]:
data_final = spark.read.csv(f"{path_final_data}", header=True, inferSchema=True)

In [0]:
df_professors = create_prof_table_data(data_final)
df_professors.show(3)

+--------------+---------------+----------------+--------------------+
|prof_last_name|prof_first_name|prof_middle_name|             prof_id|
+--------------+---------------+----------------+--------------------+
|           Som|        Brandon|               D|29d8e421bec3ce8b9...|
|  Sanchez Cruz|          Jorge|            null|42c8dd041bb069f27...|
|    Schurmeier|       Kimberly|            null|d3dc139bd6cc4c799...|
+--------------+---------------+----------------+--------------------+
only showing top 3 rows



In [0]:
df_courses = create_courses_table_data(data_final)
df_courses.show(3)

+----------+---------+----+-------+-----+--------------------+--------------------+
|department|course_id|year|quarter|total|                prof|  course_offering_id|
+----------+---------+----+-------+-----+--------------------+--------------------+
|       AAS|       11|2025| Winter|   68|Butler; Elizabeth...|f18fb01db40425f1d...|
|      AESE|      241|2025| Winter|   35|Erat; Sanjiv & Wa...|8ef5ad892185ad15a...|
|       AAS|       10|2025| Winter|   68|Butler; Elizabeth...|c85a26fe19ff326df...|
+----------+---------+----+-------+-----+--------------------+--------------------+
only showing top 3 rows



In [0]:
df_courses_professors = create_courses_professors_table_data(df_courses, df_professors, data_final)
df_courses_professors.show(3)

+--------------------+--------------------+
|             prof_id|  course_offering_id|
+--------------------+--------------------+
|8fdf7dc7d4b9cbfd3...|86dbdaa7645e34173...|
|8fdf7dc7d4b9cbfd3...|c523e25947201a453...|
|8fdf7dc7d4b9cbfd3...|fc10273071b29d19e...|
+--------------------+--------------------+
only showing top 3 rows



In [0]:
df_enrollment_snapshots = create_enrollment_snapshots_table_data(data_final, df_courses)
df_enrollment_snapshots.show(3)

+----------+--------+-----------+--------------------+
|      date|waitlist|enrolled_ct|  course_offering_id|
+----------+--------+-----------+--------------------+
|2024-11-13|       0|          2|6d1f02cafc5ab40a2...|
|2024-11-15|       0|          2|6d1f02cafc5ab40a2...|
|2024-11-20|       0|          2|6d1f02cafc5ab40a2...|
+----------+--------+-----------+--------------------+
only showing top 3 rows



In [0]:
# final clean
df_professors, df_courses, df_courses_professors, df_enrollment_snapshots = clean_tables(df_professors, df_courses, df_courses_professors, df_enrollment_snapshots)

In [0]:
df_enrollment_snapshots.join(df_courses, on="course_offering_id", how="inner").select("department", "course_id", "year", "quarter", "instructor").distinct().count()

Out[42]: 13467

In [0]:
data_final.select("department", "course_id", "year", "quarter", "prof").distinct().count()

Out[44]: 13467

## Store the table data

In [0]:
path_table_professors = f"{path_final_table}/professors"
path_table_courses = f"{path_final_table}/courses"
path_table_courses_professors = f"{path_final_table}/courses_professors"
path_table_enrollment_snapshots = f"{path_final_table}/enrollment_snapshots"

# write to s3
df_professors.coalesce(1).write.csv(path_table_professors, mode="overwrite", header=True)
df_courses.coalesce(1).write.csv(path_table_courses, mode="overwrite", header=True)
df_courses_professors.coalesce(1).write.csv(path_table_courses_professors, mode="overwrite", header=True)
df_enrollment_snapshots.coalesce(1).write.csv(path_table_enrollment_snapshots, mode="overwrite", header=True)

## Conclusion
现在我们有四个数据集，分别储存在 `{base_path}/final_table/` 下的文件夹中。