# 國立陽明交通大學課程爬蟲 v4.0 - 基本資訊版

此 Notebook 用於爬取 NYCU 課程的**基本資訊**，包括：
- 課程代碼、課程名稱
- 授課教師、學分、時數
- **結構化上課時間**（含星期、節次、具體時間）
- **結構化教室資訊**（含教室、樓層）
- 人數限制、選課人數
- 課程類型、英文授課標記

**優點**：
- 速度快，約 2-3 分鐘完成
- **新格式 v2.0**：陣列格式 + metadata
- **結構化資料**：時間、教室自動解析
- **數字型別**：學分、時數使用 number

**不包含**：課程綱要、先修科目、評分方式等詳細資訊

**版本**：v4.0 | **資料格式**：v2.0

## 步驟 1: 參數設定

修改下方的學年度和學期參數：

In [None]:
# ============= 參數設定 =============
YEAR = 114          # 學年度
SEMESTER = 1        # 學期 (1=上學期, 2=下學期)
# ===================================

## 步驟 2: 安裝必要套件

執行此儲存格安裝 requests 套件：

In [None]:
!pip install requests -q

## 步驟 3: 爬蟲程式碼

執行此儲存格載入爬蟲類別：

In [None]:
import json
import re
import requests
import warnings
from datetime import datetime, timedelta

warnings.filterwarnings('ignore')

class NYCUCrawler:
    """陽明交大課程爬蟲 v4.0"""
    
    # 時段對應表
    PERIOD_TIME_MAP = {
        'y': ('06:00', '06:50'), 'z': ('07:00', '07:50'),
        '1': ('08:00', '08:50'), '2': ('09:00', '09:50'),
        '3': ('10:10', '11:00'), '4': ('11:10', '12:00'),
        'n': ('12:10', '13:00'),
        '5': ('13:20', '14:10'), '6': ('14:20', '15:10'),
        '7': ('15:30', '16:20'), '8': ('16:30', '17:20'),
        '9': ('17:30', '18:20'),
        'a': ('18:25', '19:15'), 'b': ('19:20', '20:10'),
        'c': ('20:15', '21:05'), 'd': ('21:10', '22:00')
    }
    
    # 星期對應表
    DAY_MAP = {
        'M': (1, 'Monday'), 'T': (2, 'Tuesday'), 'W': (3, 'Wednesday'),
        'R': (4, 'Thursday'), 'F': (5, 'Friday'), 'S': (6, 'Saturday'),
        'U': (7, 'Sunday')
    }
    
    def __init__(self, year, semester):
        self.year = year
        self.semester = semester
        self.acysem = str(year) + str(semester)
        self.flang = "zh-tw"
        self.headers = {
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        }
        self.dep_list = []
        self.courses_list = []  # v4.0: 改用陣列
        
    def parse_schedule_structured(self, time_classroom_str):
        """將時間-教室字串解析為結構化的 schedule 陣列"""
        schedule = []
        if not time_classroom_str or time_classroom_str.strip() == '':
            return schedule
        
        segments = time_classroom_str.split(',')
        
        for segment in segments:
            segment = segment.strip()
            if not segment:
                continue
            
            parts = segment.split('-')
            time_part = parts[0] if parts else ''
            classroom_part = parts[1] if len(parts) > 1 else ''
            
            pattern = r'([MTWRFSU])([1-9yznabcd]+)'
            matches = re.findall(pattern, time_part)
            
            for day_code, periods_str in matches:
                if day_code not in self.DAY_MAP:
                    continue
                
                day_num, day_name = self.DAY_MAP[day_code]
                periods = [p for p in periods_str]
                period_numbers = []
                
                for p in periods:
                    if p in self.PERIOD_TIME_MAP:
                        if p == 'y': period_numbers.append(0)
                        elif p == 'z': period_numbers.append(1)
                        elif p == 'n': period_numbers.append(7)
                        elif p == 'a': period_numbers.append(10)
                        elif p == 'b': period_numbers.append(11)
                        elif p == 'c': period_numbers.append(12)
                        elif p == 'd': period_numbers.append(13)
                        else: period_numbers.append(int(p))
                
                if not period_numbers:
                    continue
                
                first_period = periods[0]
                last_period = periods[-1]
                time_start = self.PERIOD_TIME_MAP.get(first_period, ('', ''))[0]
                time_end = self.PERIOD_TIME_MAP.get(last_period, ('', ''))[1]
                
                classroom = ''
                floor = ''
                if classroom_part:
                    floor_match = re.search(r'\[([^\]]+)\]', classroom_part)
                    if floor_match:
                        floor = floor_match.group(1)
                        classroom = classroom_part[:floor_match.start()]
                    else:
                        classroom = classroom_part
                
                schedule.append({
                    "day": day_num,
                    "day_name": day_name,
                    "periods": period_numbers,
                    "time_start": time_start,
                    "time_end": time_end,
                    "classroom": classroom,
                    "floor": floor
                })
        
        return schedule
    
    def get_type(self):
        res = requests.get('https://timetable.nycu.edu.tw/?r=main/get_type', 
                          headers=self.headers, verify=False)
        return res.json()
    
    def get_category(self, ftype):
        res = requests.post('https://timetable.nycu.edu.tw/?r=main/get_category', 
                          data={'ftype': ftype, 'flang': self.flang, 
                                'acysem': self.acysem, 'acysemend': self.acysem},
                          headers=self.headers, verify=False)
        return res.json()
    
    def get_college(self, fcategory, ftype):
        res = requests.post('https://timetable.nycu.edu.tw/?r=main/get_college',
                          data={'fcategory': fcategory, 'ftype': ftype, 
                                'flang': self.flang, 'acysem': self.acysem, 
                                'acysemend': self.acysem},
                          headers=self.headers, verify=False)
        return res.json()
    
    def get_dep(self, fcollege, fcategory, ftype):
        res = requests.post('https://timetable.nycu.edu.tw/?r=main/get_dep',
                          data={'fcollege': fcollege, 'fcategory': fcategory, 
                                'ftype': ftype, 'flang': self.flang, 
                                'acysem': self.acysem, 'acysemend': self.acysem},
                          headers=self.headers, verify=False)
        return res.json()
    
    def get_cos(self, dep):
        url = "https://timetable.nycu.edu.tw/?r=main/get_cos_list"
        data = {
            "m_acy": self.year, "m_sem": self.semester,
            "m_acyend": self.year, "m_semend": self.semester,
            "m_dep_uid": dep, "m_group": "**", "m_grade": "**",
            "m_class": "**", "m_option": "**", "m_crsname": "**",
            "m_teaname": "**", "m_cos_id": "**", "m_cos_code": "**",
            "m_crstime": "**", "m_crsoutline": "**", "m_costype": "**",
            "m_selcampus": "**"
        }
        
        r = requests.post(url, headers=self.headers, verify=False, data=data)
        if r.status_code != requests.codes.ok:
            return
        
        raw_data = json.loads(r.text)
        existing_ids = {course['id'] for course in self.courses_list}
        
        for dep_value in raw_data:
            language = raw_data[dep_value]["language"]
            for dep_content in raw_data[dep_value]:
                if re.match("^[1-2]+$", dep_content) is None:
                    continue
                for cos_id in raw_data[dep_value][dep_content]:
                    if cos_id in existing_ids:
                        continue
                    
                    raw_cos_data = raw_data[dep_value][dep_content][cos_id]
                    schedule = self.parse_schedule_structured(raw_cos_data["cos_time"])
                    
                    brief_code = list(raw_data[dep_value]["brief"][cos_id].keys())[0]
                    brief = raw_data[dep_value]["brief"][cos_id][brief_code]['brief'].split(',')
                    tags = [tag.strip() for tag in brief if tag.strip()]
                    
                    name = raw_cos_data["cos_cname"].replace("(英文授課)", '').replace("(英文班)", '').strip()
                    
                    def safe_int(val, default=0):
                        try:
                            return int(val) if val and str(val).strip() else default
                        except:
                            return default
                    
                    def safe_float(val, default=0.0):
                        try:
                            return float(val) if val and str(val).strip() else default
                        except:
                            return default
                    
                    course = {
                        "id": raw_cos_data["cos_id"],
                        "semester_code": self.acysem,
                        "name": name,
                        "teacher": raw_cos_data["teacher"],
                        "credit": safe_float(raw_cos_data["cos_credit"]),
                        "hours": safe_float(raw_cos_data["cos_hours"]),
                        "type": raw_cos_data["cos_type"],
                        "enrollment": {
                            "limit": safe_int(raw_cos_data["num_limit"]),
                            "current": safe_int(raw_cos_data["reg_num"])
                        },
                        "schedule": schedule,
                        "english_taught": language[cos_id]["授課語言代碼"] == "en-us",
                        "tags": tags,
                        "raw_time_classroom": raw_cos_data["cos_time"]
                    }
                    
                    self.courses_list.append(course)
                    existing_ids.add(cos_id)
    
    def crawl(self):
        print("=" * 70)
        print(f"NYCU 課程爬蟲 v4.0 - {self.year}-{self.semester} (基本資訊)")
        print("=" * 70)
        
        start_time = datetime.now()
        
        print("\n取得課程基本資料...")
        types = self.get_type()
        
        for i in range(len(types)):
            ftype = types[i]["uid"]
            print(f"  處理: {types[i]['cname']}")
            categories = self.get_category(ftype)
            
            if types[i]["cname"] == "其他課程":
                for fcategory in categories.keys():
                    if fcategory not in self.dep_list:
                        self.dep_list.append(fcategory)
                        self.get_cos(fcategory)
            else:
                for fcategory in categories.keys():
                    colleges = self.get_college(fcategory, ftype)
                    if len(colleges):
                        for fcollege in colleges.keys():
                            deps = self.get_dep(fcollege, fcategory, ftype)
                            if len(deps):
                                for fdep in deps.keys():
                                    if fdep not in self.dep_list:
                                        self.dep_list.append(fdep)
                                        self.get_cos(fdep)
                    else:
                        deps = self.get_dep("", fcategory, ftype)
                        if len(deps):
                            for fdep in deps.keys():
                                if fdep not in self.dep_list:
                                    self.dep_list.append(fdep)
                                    self.get_cos(fdep)
        
        # 建立 metadata
        semester_name_map = {1: "上學期", 2: "下學期", 'X': "暑期"}
        semester_name = semester_name_map.get(self.semester, f"第{self.semester}學期")
        
        metadata = {
            "semester": f"{self.year}-{self.semester}",
            "semester_name": f"{self.year-1}學年度{semester_name}",
            "academic_year": self.year,
            "term": self.semester,
            "total_courses": len(self.courses_list),
            "last_updated": datetime.now().isoformat() + 'Z',
            "source_url": "https://timetable.nycu.edu.tw",
            "crawler_version": "4.0",
            "data_format_version": "2.0"
        }
        
        end_time = datetime.now()
        elapsed = end_time - start_time
        
        print(f"\n已取得 {len(self.courses_list)} 門課程的基本資料")
        print(f"花費時間: {elapsed}")
        print(f"資料格式: v2.0 (陣列格式)")
        print("=" * 70)
        
        return {
            "metadata": metadata,
            "courses": self.courses_list
        }

print("爬蟲類別載入完成 (v4.0)")

## 步驟 4: 執行爬蟲

開始爬取課程資料：

In [None]:
# 建立爬蟲實例並執行
crawler = NYCUCrawler(YEAR, SEMESTER)
data = crawler.crawl()  # v4.0: 回傳包含 metadata 和 courses 的字典

## 步驟 5: 查看結果

顯示統計資訊和範例資料：

In [None]:
# === 資料格式 v2.0 統計 ===

# Metadata 資訊
print("=== Metadata ===")
print(f"學期: {data['metadata']['semester']} ({data['metadata']['semester_name']})")
print(f"總課程數: {data['metadata']['total_courses']}")
print(f"資料格式版本: {data['metadata']['data_format_version']}")
print(f"爬蟲版本: {data['metadata']['crawler_version']}")
print()

# 課程統計
courses = data['courses']
print("=== 課程統計 ===")
print(f"總課程數: {len(courses)}")

# 英文授課統計
english_courses = sum(1 for c in courses if c.get('english_taught', False))
print(f"英文授課: {english_courses} ({english_courses/len(courses)*100:.1f}%)")

# 選別統計
course_types = {}
for c in courses:
    ctype = c.get('type', '未知')
    course_types[ctype] = course_types.get(ctype, 0) + 1

print("\n課程選別:")
for ctype, count in sorted(course_types.items(), key=lambda x: x[1], reverse=True):
    print(f"  {ctype}: {count} ({count/len(courses)*100:.1f}%)")

# 顯示第一門課程的完整資料（v2.0 格式）
print("\n=== 範例課程 (v2.0 格式) ===")
first_course = courses[0]
print(json.dumps(first_course, ensure_ascii=False, indent=2))

## 步驟 6: 下載資料

將資料儲存為 JSON 檔案並下載：

In [None]:
import os

# 儲存為 JSON 檔案 (v2.0 格式)
filename = f"{YEAR}-{SEMESTER}_data.json"
with open(filename, 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

file_size = os.path.getsize(filename)
print(f"資料已儲存至: {filename}")
print(f"檔案大小: {file_size/1024/1024:.2f} MB")
print(f"資料格式: v2.0 (陣列格式 + metadata)")
print()
print("資料結構:")
print(f"  - metadata: {len(data['metadata'])} 個欄位")
print(f"  - courses: {len(data['courses'])} 門課程")

# 下載檔案 (Google Colab)
try:
    from google.colab import files
    files.download(filename)
    print(f"\n檔案下載已開始: {filename}")
except ImportError:
    print(f"\n不在 Colab 環境，檔案已儲存在本地: {filename}")

## 資料結構說明 (v2.0 格式)

### 整體結構

```json
{
  "metadata": {
    "semester": "114-1",
    "semester_name": "113學年度上學期",
    "total_courses": 8028,
    "data_format_version": "2.0",
    "crawler_version": "4.0"
  },
  "courses": [ /* 課程陣列 */ ]
}
```

### 單門課程欄位

```json
{
  "id": "515002",
  "semester_code": "1141",
  "name": "微分方程",
  "teacher": "楊春美",
  "credit": 3.0,              // v2.0: 數字型別
  "hours": 3.0,               // v2.0: 數字型別
  "type": "必修",
  "enrollment": {             // v2.0: 巢狀物件
    "limit": 55,
    "current": 66
  },
  "schedule": [               // v2.0: 結構化時間
    {
      "day": 1,
      "day_name": "Monday",
      "periods": [3, 4],
      "time_start": "10:10",
      "time_end": "12:00",
      "classroom": "EE102",
      "floor": "GF"
    }
  ],
  "english_taught": false,
  "tags": ["工程數學"],
  "raw_time_classroom": "M34W2-EE102[GF]"
}
```

### v2.0 主要改進

| 項目 | 舊格式 | 新格式 v2.0 |
|------|--------|-------------|
| 資料結構 | 物件 `{}` | 陣列 `[]` + metadata |
| 學分/時數 | 字串 `"3.00"` | 數字 `3.0` |
| 時間資訊 | 字串陣列 `["M3", "M4"]` | 結構化物件（含時間、教室） |
| 人數資訊 | 兩個欄位 | 巢狀物件 `enrollment` |

### 時段對照表

- **星期代碼**: M=Monday(一), T=Tuesday(二), W=Wednesday(三), R=Thursday(四), F=Friday(五)
- **節次代碼**: 
  - y = 06:00-06:50
  - z = 07:00-07:50
  - 1-9 = 08:00-18:20
  - n = 12:10-13:00 (午休)
  - a-d = 18:25-22:00 (晚上)

### 使用範例

```python
# 取得所有課程
courses = data['courses']

# 篩選必修課程
required = [c for c in courses if c['type'] == '必修']

# 篩選 3 學分課程
three_credits = [c for c in courses if c['credit'] == 3.0]

# 篩選星期一有課的課程
monday_courses = [c for c in courses 
                  if any(s['day'] == 1 for s in c['schedule'])]

# 取得課程時間資訊
for course in courses[:5]:
    for sched in course['schedule']:
        print(f"{course['name']}: {sched['day_name']} {sched['time_start']}-{sched['time_end']}")
```

---

**版本**: v4.0 | **資料格式**: v2.0 | **更新日期**: 2025-01-19