# Data Mining Laboratory - Optional Group Project (Web Scraping Focus)

## 1. Introduction

This document outlines an **optional, intensive project opportunity** for the Data Mining Laboratory course. Students choosing to undertake this project and complete it successfully will have its grade replace **50%** of their final lab grade, reflecting the significant effort involved. The grade obtained here can be used to replace the 2nd half of the laboratory assignments, starting from the Web Scraping lab and ending at (but not including) the optional final lab assignment.

### Important! - Only ethical & polite scraping allowed. Abusive scraping projects will not be accepted for grading. See subsection 4 for more details.

*   **Team Size:** 4 students collaborating.
*   **Duration:** Approximately 4+ weeks of focused effort alongside regular labs, culminating in a presentation during the final lab sessions.
*   **Core Objective:** To apply web scraping techniques (using BeautifulSoup, Playwright, and potentially APIs) to gather substantial data from multiple, diverse web sources. The central task involves the **meticulous cleaning and systematic integration** of this data to construct a structured, coherent database designed for a specific purpose. This project emphasizes tackling real-world data heterogeneity.
*   **Signup:** Teams interested in this challenge should sign up by April 16th 2025.

---

## 2. General Requirements

All project options are built upon these common requirements:

*   **Technology Stack:** Primarily Python, leveraging libraries such as `requests`, `BeautifulSoup4`, `Playwright`, `pandas`, and employing a suitable database solution (e.g., SQLite file, PostgreSQL database, or structured file formats like Parquet), though there is permission to use any preferred technology.
*   **Code Quality:** Emphasis on well-structured, compartmentalized (e.g. separate raw scraping from cleaning/processing) and maintainable code. Using version control (e.g., Git) is strongly recommended for managing the project's development. A `requirements.txt` file listing all dependencies would be desired, or a readme on how to replicate results.
*   **Data Product:** The primary deliverable is a final, cleaned, and structured database, made accessible for review. The format should be chosen appropriately for the collected data.
*   **Documentation/Report:** Comprehensive documentation (e.g., a detailed `README.md` file or a separate report) is required, explaining:
    *   The project's final goal and the precise scope achieved.
    *   Data sources utilized (URLs, justification for selection).
    *   Methodology: Details of the scraping strategy (tools, implemented politeness measures), the data cleaning procedures (handling missing values, normalization logic, integration techniques), and the final database schema/structure.
    *   Challenges encountered (e.g., anti-scraping mechanisms, data inconsistencies, necessary scope adjustments) and the solutions implemented.
    *   Ethical considerations: Documented checks of `robots.txt` and Terms of Service, details of politeness measures taken.
*   **Presentation:** A final presentation (~10-15 mins) during the designated lab session. This should demonstrate the project, showcase the resulting database, explain the process, and discuss key findings or significant challenges. Team members should be prepared to discuss their contributions and technical details.

---

## 3. Project Options (Choose One)

### Option 1: The Comprehensive Vehicle Database

*   **Initial Objective:** To create a database cataloging **passenger vehicle models** currently or recently available globally, including their core technical specifications.
*   **Scope Management:** This initial objective represents a significant undertaking. Teams selecting this option must perform initial reconnaissance and **realistically refine the scope** based on feasibility. Possible refined scopes include:
    *   Focusing on a specific continent (e.g., Europe, North America).
    *   Focusing on a particular vehicle segment (e.g., Electric Vehicles, SUVs).
    *   Concentrating on a defined timeframe (e.g., models released in the last 10 years).
    *   **The final report must clearly articulate the *scope ultimately addressed*.**
*   **Technical Emphasis:** The primary challenge lies in **integrating heterogeneous data**. Information must be gathered from various manufacturer websites and potentially public specification portals (requiring careful review of ToS). The key technical task is **data normalization**: converting units (e.g., HP/kW), standardizing feature terminology, and reconciling different trim level structures within a single, consistent schema.
*   **Data Points (Examples):** Brand, Model, Generation/Series, Model Year(s), Trim Level(s), Body Style, Engine Type (Petrol, Diesel, Electric, Hybrid), Displacement/Battery Capacity, Power (HP/kW), Torque (Nm/lb-ft), Range (EVs), Drive Type (FWD/RWD/AWD), Key Dimensions, Cargo Capacity, Base MSRP (where consistently available).
*   **Ethical Requirement:** Diligent adherence to polite scraping practices is essential. Implement significant delays, fully respect `robots.txt`, review ToS, and utilize appropriate user agents. Data accuracy is a goal, but responsible scraping practices take precedence if sources are restrictive.

### Option 2: European Computer Science Programs Catalogue

*   **Goal:** To construct a comparative database of **Bachelor's and Master's degree programs in Computer Science** (and closely related fields like Data Science, Software Engineering) offered by universities across **Europe**.
*   **Technical Emphasis:** This option necessitates **considerable adaptability and custom scraping logic**. University websites exhibit significant structural and technological diversity. Teams should expect to develop *distinct scraping strategies* for various university domains. Parsing complex elements like course catalogues (potentially in PDF format), consistently identifying core versus elective courses, and extracting uniform data on fees or duration present notable challenges. Playwright will likely prove indispensable for many sites.
*   **Data Points (Examples):** University Name, Country, City, Program Name, Degree Level (BSc/MSc), Program Duration (Years/ECTS), Core Course Titles/Modules, List of Stated Specializations/Tracks, Language of Instruction, Published Tuition Fees (noting EU/non-EU status), Link to the official Program Page.
*   **Scope:** Concentrate on publicly available information sourced from official university websites. Define the geographical scope of "Europe" clearly (e.g., EU+EEA+UK+CH, or a specific list of countries).

### Option 3: GitHub Open Source Project Dynamics (Python Data Science Focus)

*   **Goal:** To assemble a **meaningful and potentially insightful database** analyzing activity, popularity, and community metrics for approximately the **Top 50 Python libraries central to Data Science**. The criteria for defining "Top" and the final list of libraries must be clearly justified in the documentation (e.g., based on GitHub stars, PyPI download statistics, relevant surveys).
*   **Technical Emphasis:** The objective extends beyond collecting basic statistics; it involves deriving **metrics that reflect project vitality**. This requires thoughtful selection of data points (e.g., recent commit frequency, issue resolution patterns, contributor growth trends, pull request merge velocity) and potentially integrating information scraped from multiple areas within repositories (main page, issues tab, pull requests, commit history).
*   **Data Points (Examples):** Repository Name/URL, Owner/Organization, License Type, Star Count (potentially historical data if feasible), Fork Count, Number of Open Issues, Number of Closed Issues (or Ratio), Counts of Open/Closed/Merged Pull Requests (perhaps focusing on recent activity), Date of Last Commit (to main branch), Number of Releases, Total Number of Contributors, Number of Recent Contributors (e.g., active in the last 6 months), Commit Frequency (e.g., average commits per month).
*   **Ethical Requirement:** **Strict compliance with GitHub's Terms of Service and API/scraping guidelines is mandatory.** Utilize the official GitHub API where practical and efficient (this may require learning the API specifics). Direct scraping must be performed with extreme politeness (very slow request rates, adherence to `robots.txt`). Identify your bot clearly. Overly aggressive scraping is unprofessional and likely to result in access restrictions.

---

## 4. Ethical Scraping & Legal Considerations

Responsible data collection practices are fundamental to this project.

*   **Check `robots.txt`:** Always analyze and try to respect the `/robots.txt` file of target websites before initiating scraping. Document any disallowed paths and adhere to specified crawl delays.
*   **Review Terms of Service (ToS):** Carefully examine the ToS or Usage Policy for each significant data source. Prioritize websites that appear permissive of scraping for non-commercial, academic research or where the data is clearly intended for public dissemination (like university program information). **Document that ToS checks were performed** and summarize relevant findings in your report. If explicit prohibitions exist, seek alternative sources or adjust the project scope accordingly.
*   **Practice Polite Scraping:** Implement substantial delays (minimum `time.sleep(1)` seconds *or more*) between consecutive requests to the same server to avoid undue load. Use caching mechanisms effectively to prevent redundant downloads. Consider scheduling scraping activities during off-peak hours.
*   **Identify Your Bot:** Employ a descriptive User-Agent string that clearly identifies you and the project, for instance: `"X-DataMiningProject/1.0..."`.
*   **Data Scope & Use:** Collect only data directly relevant to the project's objectives. Focus exclusively on publicly available, non-personal information. Make no attempt to bypass login systems or access private data. The collected dataset is intended solely for academic analysis within this course.

---

## 5. Presentation and Deliverables Review

Project evaluation will occur through presentation and direct review of materials, not a formal platform submission.

*   **Final Presentation:** Teams will present their work during a designated lab session towards the end of the semester. The presentation should articulate the project goals, detail the methodology, discuss challenges faced, describe the final dataset structure, and share key insights or difficulties.
*   **Code & Data Access:** During or immediately following the presentation, teams are required to provide the instructors with access to:
    *   Their complete, runnable source code (preferably via a link to a Git repository like GitHub/GitLab, or a shared folder).
    *   The final compiled database (e.g., by providing the `.csv` or `.db` file, granting temporary access to a cloud database instance, or sharing links to structured data files).
    *   The comprehensive documentation (e.g., a link to the `README.md` in the repository or a shared document).

Please ensure all components are well-organized and readily accessible for evaluation. Good luck with your project.