


## How Databases and Web Services Fit Into the Picture
They are an interesting hybrid, they store non-volatile data from the perspective of the python program, but access to them introduces elements of volatility:
1. **Non-Volatile:** Data persists beyond program execution, just like files.
2. **Volatile:** Unlike files, a database can be unavailable (e.g., server downtime), and data integrity can be affected by concurrent access.

In a structured **progression from volatile to non-volatile** data, databases fit in as **external but queryable structures**:
- Unlike files, they support **efficient querying** (e.g., SQL queries).
- Unlike in-memory Python data structures, they **persist beyond execution**.
- Unlike JSON or CSV files, they allow **dynamic updates and complex relationships**.



### Brief Overview of APIs
We will use Application Program Interfaces (APIs) throughout this course as they enable software programs to communicate with each other. The API defines a set of rules that allow the two programs to communicate with each other, and when that communication involves external data structures the API defines:

 **1. Endpoint (Exposes Data)**
   - The URL where the API listens.
   - Example: `https://query.wikidata.org/sparql`

 **2. Request/Response Model (How Data is Sent and Received)**
   - Defines the protocol and request method.
   - **REST APIs**: `GET`, `POST` over HTTP.
   - **SPARQL APIs**: SPARQL Query over HTTP (`GET`/`POST`).
   - **Database APIs**: SQL queries over TCP/IP.

 **3. Data Format (How Data is Structured)**
   - **REST APIs**: JSON, XML, CSV.
   - **SPARQL APIs**: SPARQL-JSON, RDF/XML, Turtle, N-Triples.
   - **Databases**: Tabular data (Relational Tables).
   - **GraphQL APIs**: JSON (Custom queries).
     
We will dive deeper into APIs as the course proceeds, but you need to understand what they are and what they do.

# What is a Database? 

A **database** is like a super-organized digital filing system where you store, manage, and retrieve information efficiently. Instead of using multiple text files or spreadsheets, a database helps keep data structured, searchable, and scalable.  

Imagine you’re running a **chemical inventory** for a lab. You could store data in a spreadsheet (`CSV file`), but as the dataset grows, searching, updating, and ensuring accuracy become difficult. A **database** solves this by organizing data into a structured system that allows for efficient searching, sorting, and updating.

---

## SQL vs. NoSQL: Two Ways to Organize a Database
Databases come in two major types: **SQL (Structured Query Language) databases** and **NoSQL (Not Only SQL) databases**.  

### SQL Databases: Like a Well-Organized Filing Cabinet
📂 **Think of SQL databases as an Excel spreadsheet where everything is stored in structured tables.**  
- Data is organized into **tables** with **rows (records)** and **columns (fields)**.  
- You must **predefine the structure** (e.g., a table for chemical compounds must always have columns like “Name,” “Formula,” and “Molecular Weight”).  
- Data is retrieved using **SQL queries**, like asking a librarian for a specific book.  
- Best for structured data with **relationships** (e.g., linking a chemical sample to a supplier).  
- Examples: **SQLite, PostgreSQL, MySQL, Microsoft SQL Server.**  

📌 **Analogy:**  
- If you store lab results in a filing cabinet with labeled folders, SQL databases ensure everything follows a strict structure.
- Example SQL query:  
  ```sql
  SELECT * FROM chemicals WHERE molecular_weight > 200;
  ```
  *(Find all chemicals with a molecular weight above 200.)*

---

### NoSQL Databases: Like a Digital Whiteboard
📝 **NoSQL databases are more flexible, like sticky notes on a whiteboard that can change anytime.**  
- Instead of structured tables, data is stored as **documents, key-value pairs, graphs, or columns**.  
- You don’t need a strict structure—different records can have different fields.  
- Great for **large, flexible datasets** that may change often (e.g., tracking real-time sensor data).  
- Common in big data, web applications, and IoT (Internet of Things).  
- Examples: **MongoDB (document-based), Redis (key-value), Cassandra (column-based).**  

📌 **Analogy:**  
- If SQL is like an Excel spreadsheet, **NoSQL is like a collection of Google Docs**—each document can have a different format.  
- Example MongoDB NoSQL document (JSON-like format):  
  ```json
  {
    "name": "Acetone",
    "formula": "C3H6O",
    "properties": {
      "molecular_weight": 58.08,
      "boiling_point": 56.5
    }
  }
  ```
  *(Flexible—some documents may have extra fields, some may not.)*

---

#### How to Choose?
| Feature         | SQL Databases (Structured) | NoSQL Databases (Flexible) |
|---------------|-------------------------|-------------------------|
| **Data Structure** | Organized in tables (rows & columns) | Documents, key-value, graphs |
| **Flexibility**  | Fixed structure (schema) | Dynamic structure (schema-less) |
| **Best For** | Structured, relational data (e.g., patient records, inventory) | Big data, fast-changing data (e.g., IoT, social media) |
| **Query Language** | Uses SQL (structured queries) | Uses APIs, JSON-like queries |
| **Examples** | SQLite, PostgreSQL, MySQL | MongoDB, Redis, Firebase |

---

#### Putting It All Together
- **SQL is best for structured, well-defined data** (like lab inventory).  
- **NoSQL is better for rapidly changing, unstructured data** (like real-time sensor readings).  
- Both are **external data storage** solutions that move data from volatile (RAM) to non-volatile storage, ensuring persistence.  


# Semantic Web: A Scientific Data Perspective
The **Semantic Web** is an extension of the traditional World Wide Web that enables machines to understand and process data in a structured and meaningful way. It transforms the web from a collection of documents intended for human reading into a globally linked network of **structured data** that software can interpret, query, and reason over.

At the heart of the Semantic Web are **linked data technologies**, which define **data relationships** instead of just formatting data into tables or files. The Semantic Web allows Python scripts to query machine-readable datasets. This means scientists can directly access structured data on molecules, atmospheric data, physical constants, and more. The key components include:

1. **RDF (Resource Description Framework)**
   - The **data model** of the Semantic Web.
   - Represents **facts as triples**: **subject → predicate → object**.
   - Example (Chemical Properties in RDF):
     ```
     <Acetone>   <has_molecular_weight>   "58.08 g/mol"
     ```
   - Enables **machine-readable relationships**, making **chemical knowledge searchable**.

2. **OWL (Web Ontology Language)**
   - Extends RDF with **logic & reasoning**.
   - Defines **ontologies**: hierarchical classifications of concepts.
   - Example: If **Acetone is a Ketone**, and **Ketones are Organic Compounds**, OWL **infers** that **Acetone is an Organic Compound**.

3. **SPARQL (Query Language for RDF)**
   - The SQL of the Semantic Web.
   - Allows **querying linked datasets** across the web.

4. **Linked Open Data (LOD)**
   - RDF-based datasets are interlinked across disciplines.
   - **Example Datasets**:
     - **Wikidata** (General knowledge)
     - **PubChem RDF** (Chemical data)
     - **DBpedia** (Structured Wikipedia data)
     - **Gene Ontology** (Biological data)

## **Semantic Web vs Traditional Databases**
| **Feature**          | **Relational Databases (SQL, NoSQL)** | **Semantic Web (RDF, OWL, SPARQL)** |
|--------------------|--------------------------------|--------------------------------|
| **Structure**      | Tables (structured schema)    | Graph-based (triples: subject-predicate-object) |
| **Queries**       | SQL                            | SPARQL |
| **Flexibility**   | Rigid schema                  | Dynamic relationships (easily extendable) |
| **Interoperability** | Limited to specific database engines | Global linking of datasets (LOD cloud) |
| **Examples**      | PostgreSQL, MongoDB           | Wikidata, PubChem RDF, DBpedia |



#  Web Scraping and Databases: A Hybrid Approach
While web scraping is often used to extract data for immediate use, it does not store it. A powerful workflow would be:
- **Scrape data** from online sources.
- **Store it in a structured database (SQL or NoSQL)** for long-term analysis.
- **Query it later** instead of repeatedly scraping.


## Web Scraping as a Data Acquisition Method
Web scraping is a method of extracting **external data** from structured or semi-structured sources on the web and transforming it into a usable format. Unlike databases or file storage, web scraping **does not inherently store data**—it is a way to retrieve and structure data from the web dynamically. It allows access to **data stored in HTML web pages** that might not be available via an API.

### Web Scraping vs. APIs
| Feature         | Web Scraping | APIs |
|---------------|-------------|------|
| **Access** | Extracts data from web pages (HTML tables, text, lists) | Queries structured data from a service (often JSON or XML) |
| **Structure** | Often semi-structured (needs parsing) | Well-structured |
| **Reliability** | Pages may change, breaking the scraper | More stable (unless API changes) |
| **Use Case** | Extracting tables, research data, metadata from articles | Accessing structured datasets (PubChem, NCBI, weather data) |

Thus, **web scraping is an alternative to APIs when structured access is unavailable**.

---

## Web Scraping as a Bridge from Classical Literature to Structured Data
Scientific data has historically been communicated through **journal articles, textbooks, and reports**. Many modern scientific knowledge repositories (e.g., Wikipedia, research databases) still store information in text-based formats rather than structured databases. Web scraping allows you to:

- Extract **tabular data** (like chemical properties from Wikipedia or patents).
- Retrieve **text-based metadata** (such as author names, abstracts, and citations).
- Collect **non-tabular structured information** (like structured web pages with lists of elements).

By applying **text parsing, table extraction, and structured storage**, web scraping allows researchers to **convert human-readable content into machine-readable data**.
