# Non-Volatile External Data Structures
Python’s built-in structures (lists, tuples, dictionaries, etc.) exist in **volatile memory** (RAM) and disappear when the program ends. External data structures are stored in **non-volatile memory** and persist beyond program execution. These include:

1. **File-Based Data Structures**
   - **Text-Based Files (Structured/Unstructured)**
     - `.txt`: Raw text files (unstructured).
     - `.csv`: Tabular data (structured, human-readable, but limited).
     - `.json`: Key-value format (structured, hierarchical, readable).
   - **Binary Files**
     - `.pickle`: Stores Python objects in a serialized format (not human-readable).
     - `.npy/.npz`: NumPy’s binary storage for efficient numerical data.

2. **Databases (Structured, Queryable External Data)**
   - **SQL Databases (Relational)**
     - Store data in structured tables with defined relationships.
     - Examples: SQLite, PostgreSQL, MySQL.
   - **NoSQL Databases (Flexible, Key-Value, Document-Based)**
     - Store unstructured or semi-structured data in key-value or document formats.
     - Examples: MongoDB (documents), Redis (key-value pairs).

3. **Web APIs and Networks as External Data Sources**
   - Accessing data from remote servers (e.g., PubChem, weather services).
   - Often return data in JSON, XML, or other standardized formats.
   - Unlike local files or databases, APIs require a network connection.

## Understanding Input/Output (I/O)

At its core, **Input/Output (I/O)** refers to any communication between a program and the outside world. It is not limited to data storage and retrieval; it also includes interactions with users, hardware, and network resources. I/O operations can be broadly categorized into:

1. **User Interaction**  
   - Input: Receiving user input via `input()` or GUI elements.  
   - Output: Displaying text via `print()`, rendering graphics, or updating a UI.

2. **File I/O** (Non-volatile Storage)  
   - Reading and writing data to files (e.g., `.txt`, `.csv`, `.json`, `.pickle`).
   - Persistent storage that remains available after the program terminates.

3. **Network I/O**  
   - Communicating with remote servers, APIs, or databases over the internet.
   - Sending and receiving data over sockets (e.g., accessing PubChem via an API).

4. **Inter-process and Hardware I/O**  
   - Communicating with external devices like sensors, databases, or microcontrollers.
   - Data exchange between different programs or services.

## Understanding Interfaces
Inherent in IO operations is the interface between two entities or systems and we are going to need to introduce the concept of an API (Application Program Interface).  If you look at the above IO systems you realize there are human, hardware and sofware components, and so there are different types of interfaces. The following table gives an overview of several interfaces, and as a human, you have used both CLIs and GUIs in this class.

| **Interface Type** | **Example** | **Who/What Interacts?** |
|-------------------|------------|-------------------|
| **Graphical User Interface (GUI)** | Windows, Web Apps | **User ↔ System** (via visual elements like buttons, menus) |
| **Command Line Interface (CLI)** | Terminal, Bash, Python REPL | **User ↔ System** (via text commands) |
| **Application Programming Interface (API)** | REST API, Database API | **Software ↔ Software** (via structured requests & responses) |
| **Hardware Interfaces** | USB, HDMI, Bluetooth | **Physical Devices ↔ System** |


# File Based Data Structures
Before we proceed, we are going to install two new third party packages; Seaborn and pandas.  Seaborn is a visualization package built on Matplotlib and it comes with a series of files we can use for various data explorations.  Pandas is a data manipulation package built on Numpy and is widely used to handle structured data like csv, json, SQL, Excel...)

Open your Ubuntu terminal and activate your virtual environment (mine is py4sci)
```bash 
conda activate py4sci
``` 
The (base) in front of the command prompt should change to (py4sci). Now that you have activated your virtual environment you are ready to install the packages in it, and we will do them one at a time, although you could do them both at once.
```bash
conda install -c conda-forge seaborn
conda install -c conda-forge pandas
```
Let's now look at the data sets we downloaded when we installed Seaborn


In [2]:
# List of halogen symbols
halogens = ["F", "Cl", "Br", "I", "At", "Ts"]

# Atomic numbers
atomic_numbers = [9, 17, 35, 53, 85, 117]

# Atomic masses (g/mol)
atomic_masses = [18.998, 35.45, 79.904, 126.90, 210, 294]

# Electronegativity (Pauling scale)
electronegativities = [3.98, 3.16, 2.96, 2.66, 2.2, None]  # Ts unknown

# Boiling points (K)
boiling_points = [85.03, 239.11, 332.0, 457.4, 610, None]  # Ts unknown
print(f"halogrens = {halogens} \natomic number = {atomic_numbers} \natomic_masses = {atomic_masses} \
\nelectronegativities = {electronegativities}")

halogrens = ['F', 'Cl', 'Br', 'I', 'At', 'Ts'] 
atomic number = [9, 17, 35, 53, 85, 117] 
atomic_masses = [18.998, 35.45, 79.904, 126.9, 210, 294] 
electronegativities = [3.98, 3.16, 2.96, 2.66, 2.2, None]


In [None]:
import os
import urllib.request
import pandas as pd

# Define the directory structure
base_data_dir = os.path.expanduser("~/data")  # Parent directory
pubchem_data_dir = os.path.join(base_data_dir, "pubchem_data")  # Subdirectory for PubChem
os.makedirs(pubchem_data_dir, exist_ok=True)  # Ensure directories exist

# Define file URL and local path
file_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/periodictable/CSV?response_type=save&response_basename=PubChemElements_all"
local_file_path = os.path.join(pubchem_data_dir, "PubChemElements_all.csv")

# Download the file
print(f"Downloading PubChem CSV to: {local_file_path} ...")
urllib.request.urlretrieve(file_url, local_file_path)
print("Download complete!")

# Verify if the file was saved
if os.path.exists(local_file_path):
    print(f"File successfully saved at: {local_file_path}")

    # Load into Pandas DataFrame
    df = pd.read_csv(local_file_path)
    print("\nFirst few rows of the dataset:")
    print(df.head())  # Display first few rows

else:
    print("Download failed!")


In [None]:
import csv
import os

# Define the file path
file_path = os.path.expanduser("~/data/pubchem_data/PubChemElements_all.csv")

# Open the CSV file and read the data
with open(file_path, mode='r', encoding='utf-8') as file:
    reader = csv.reader(file)  # Read the CSV file
    rows = list(reader)  # Convert to a list of lists

# Extract headers (column names)
headers = rows[0]  # First row contains the column names

# Extract all column data as lists (skip the header row)
columns = {header: [] for header in headers}  # Dictionary to hold columns as lists

for row in rows[1:]:  # Skip header row
    for col_index, value in enumerate(row):
        columns[headers[col_index]].append(value)

# Construct dictionary of elements
elements_dict = {}

for i in range(len(columns["Symbol"])):  # Iterate over element rows
    element_symbol = columns["Symbol"][i]  # Get element symbol
    element_data = {headers[j]: columns[headers[j]][i] for j in range(len(headers))}  # Create embedded dictionary

    elements_dict[element_symbol] = element_data  # Assign to main dictionary

# Display a sample of the final dictionary
sample_element = "I"  # Example: Chlorine
print(f"Data for {sample_element}:")
print(elements_dict.get(sample_element, "Element not found!"))



In [None]:
import json
import os

# Define the JSON file path
json_file_path = os.path.expanduser("~/data/pubchem_data/elements_data.json")

# Save the dictionary as a JSON file
with open(json_file_path, "w", encoding="utf-8") as json_file:
    json.dump(elements_dict, json_file, indent=4)

print(f"Full periodic table data saved to: {json_file_path}")

# -------------------------------
# Reload JSON later and display sample
# -------------------------------
print("\nReloading data from JSON...")

with open(json_file_path, "r", encoding="utf-8") as json_file:
    loaded_elements = json.load(json_file)  # Load back into a dictionary

# Display data for Oxygen as an example
sample_element = "O"
print(f"\nData for {sample_element} (from JSON):")
print(loaded_elements.get(sample_element, "Element not found!"))


In [1]:
import json
import os

json_file_path = "~/data/pubchem_data/elements_data.json"

with open(os.path.expanduser(json_file_path), "r", encoding="utf-8") as json_file:
    elements_data = json.load(json_file)

# Print all stored halogens
print(elements_data.keys())  # Output: dict_keys(['F', 'Cl', 'Br', 'I', 'At', 'Ts'])
print(type(elements_data))

dict_keys(['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'Cn', 'Nh', 'Fl', 'Mc', 'Lv', 'Ts', 'Og'])
<class 'dict'>





| **Type**       | **Source Example** | **Data Format** | **Storage/Query Method** |
|--------------|-----------------|--------------|-----------------|
| **File-Based Data** | JSON, CSV, Pickle, HDF5 | Structured | File I/O (pandas, NumPy) |
| **Relational Databases** | PubChem SQL, NOAA Climate Database | SQL (Structured) | Queries (SQL, Relational Model) |
| **NoSQL Databases** | MongoDB, Firebase | JSON, BSON (Semi-Structured) | Queries (NoSQL, Document Store) |
| **APIs (REST)** | PubChem API, OpenWeather API, NASA API | JSON, XML, CSV | RESTful Requests (`GET`, `POST`) |
| **SPARQL APIs (Linked Data)** | Wikidata, PubChem RDF, DBpedia, Gene Ontology | RDF/XML, Turtle, SPARQL-JSON | SPARQL Queries (Graph-Based) |
| **Graph Databases (RDF Stores)** | Blazegraph, Virtuoso, Neo4j (for Linked Data) | RDF, GraphML | Queries (SPARQL, Cypher) |
| **Web Scraping** | Wikipedia, Research Articles, Google Scholar | HTML (Semi-Structured) | Parsing (BeautifulSoup, Scrapy) |
| **Streaming Data** | Twitter API, IoT Sensor Networks | JSON, Avro, Protobuf | WebSockets, Kafka, MQTT |

## How Databases and Web Services Fit Into the Picture
They are an interesting hybrid, they store non-volatile data from the perspective of the python program, but access to them introduces elements of volatility:
1. **Non-Volatile:** Data persists beyond program execution, just like files.
2. **Volatile:** Unlike files, a database can be unavailable (e.g., server downtime), and data integrity can be affected by concurrent access.

In a structured **progression from volatile to non-volatile** data, databases fit in as **external but queryable structures**:
- Unlike files, they support **efficient querying** (e.g., SQL queries).
- Unlike in-memory Python data structures, they **persist beyond execution**.
- Unlike JSON or CSV files, they allow **dynamic updates and complex relationships**.



### Brief Overview of APIs
We will use Application Program Interfaces (APIs) throughout this course as they enable software programs to communicate with each other. The API defines a set of rules that allow the two programs to communicate with each other, and when that communication involves external data structures the API defines:

 **1. Endpoint (Exposes Data)**
   - The URL where the API listens.
   - Example: `https://query.wikidata.org/sparql`

 **2. Request/Response Model (How Data is Sent and Received)**
   - Defines the protocol and request method.
   - **REST APIs**: `GET`, `POST` over HTTP.
   - **SPARQL APIs**: SPARQL Query over HTTP (`GET`/`POST`).
   - **Database APIs**: SQL queries over TCP/IP.

 **3. Data Format (How Data is Structured)**
   - **REST APIs**: JSON, XML, CSV.
   - **SPARQL APIs**: SPARQL-JSON, RDF/XML, Turtle, N-Triples.
   - **Databases**: Tabular data (Relational Tables).
   - **GraphQL APIs**: JSON (Custom queries).
     
We will dive deeper into APIs as the course proceeds, but you need to understand what they are and what they do.

# What is a Database? 

A **database** is like a super-organized digital filing system where you store, manage, and retrieve information efficiently. Instead of using multiple text files or spreadsheets, a database helps keep data structured, searchable, and scalable.  

Imagine you’re running a **chemical inventory** for a lab. You could store data in a spreadsheet (`CSV file`), but as the dataset grows, searching, updating, and ensuring accuracy become difficult. A **database** solves this by organizing data into a structured system that allows for efficient searching, sorting, and updating.

---

## SQL vs. NoSQL: Two Ways to Organize a Database
Databases come in two major types: **SQL (Structured Query Language) databases** and **NoSQL (Not Only SQL) databases**.  

### SQL Databases: Like a Well-Organized Filing Cabinet
📂 **Think of SQL databases as an Excel spreadsheet where everything is stored in structured tables.**  
- Data is organized into **tables** with **rows (records)** and **columns (fields)**.  
- You must **predefine the structure** (e.g., a table for chemical compounds must always have columns like “Name,” “Formula,” and “Molecular Weight”).  
- Data is retrieved using **SQL queries**, like asking a librarian for a specific book.  
- Best for structured data with **relationships** (e.g., linking a chemical sample to a supplier).  
- Examples: **SQLite, PostgreSQL, MySQL, Microsoft SQL Server.**  

📌 **Analogy:**  
- If you store lab results in a filing cabinet with labeled folders, SQL databases ensure everything follows a strict structure.
- Example SQL query:  
  ```sql
  SELECT * FROM chemicals WHERE molecular_weight > 200;
  ```
  *(Find all chemicals with a molecular weight above 200.)*

---

### NoSQL Databases: Like a Digital Whiteboard
📝 **NoSQL databases are more flexible, like sticky notes on a whiteboard that can change anytime.**  
- Instead of structured tables, data is stored as **documents, key-value pairs, graphs, or columns**.  
- You don’t need a strict structure—different records can have different fields.  
- Great for **large, flexible datasets** that may change often (e.g., tracking real-time sensor data).  
- Common in big data, web applications, and IoT (Internet of Things).  
- Examples: **MongoDB (document-based), Redis (key-value), Cassandra (column-based).**  

📌 **Analogy:**  
- If SQL is like an Excel spreadsheet, **NoSQL is like a collection of Google Docs**—each document can have a different format.  
- Example MongoDB NoSQL document (JSON-like format):  
  ```json
  {
    "name": "Acetone",
    "formula": "C3H6O",
    "properties": {
      "molecular_weight": 58.08,
      "boiling_point": 56.5
    }
  }
  ```
  *(Flexible—some documents may have extra fields, some may not.)*

---

#### How to Choose?
| Feature         | SQL Databases (Structured) | NoSQL Databases (Flexible) |
|---------------|-------------------------|-------------------------|
| **Data Structure** | Organized in tables (rows & columns) | Documents, key-value, graphs |
| **Flexibility**  | Fixed structure (schema) | Dynamic structure (schema-less) |
| **Best For** | Structured, relational data (e.g., patient records, inventory) | Big data, fast-changing data (e.g., IoT, social media) |
| **Query Language** | Uses SQL (structured queries) | Uses APIs, JSON-like queries |
| **Examples** | SQLite, PostgreSQL, MySQL | MongoDB, Redis, Firebase |

---

#### Putting It All Together
- **SQL is best for structured, well-defined data** (like lab inventory).  
- **NoSQL is better for rapidly changing, unstructured data** (like real-time sensor readings).  
- Both are **external data storage** solutions that move data from volatile (RAM) to non-volatile storage, ensuring persistence.  


# Semantic Web: A Scientific Data Perspective
The **Semantic Web** is an extension of the traditional World Wide Web that enables machines to understand and process data in a structured and meaningful way. It transforms the web from a collection of documents intended for human reading into a globally linked network of **structured data** that software can interpret, query, and reason over.

At the heart of the Semantic Web are **linked data technologies**, which define **data relationships** instead of just formatting data into tables or files. The Semantic Web allows Python scripts to query machine-readable datasets. This means scientists can directly access structured data on molecules, atmospheric data, physical constants, and more. The key components include:

1. **RDF (Resource Description Framework)**
   - The **data model** of the Semantic Web.
   - Represents **facts as triples**: **subject → predicate → object**.
   - Example (Chemical Properties in RDF):
     ```
     <Acetone>   <has_molecular_weight>   "58.08 g/mol"
     ```
   - Enables **machine-readable relationships**, making **chemical knowledge searchable**.

2. **OWL (Web Ontology Language)**
   - Extends RDF with **logic & reasoning**.
   - Defines **ontologies**: hierarchical classifications of concepts.
   - Example: If **Acetone is a Ketone**, and **Ketones are Organic Compounds**, OWL **infers** that **Acetone is an Organic Compound**.

3. **SPARQL (Query Language for RDF)**
   - The SQL of the Semantic Web.
   - Allows **querying linked datasets** across the web.

4. **Linked Open Data (LOD)**
   - RDF-based datasets are interlinked across disciplines.
   - **Example Datasets**:
     - **Wikidata** (General knowledge)
     - **PubChem RDF** (Chemical data)
     - **DBpedia** (Structured Wikipedia data)
     - **Gene Ontology** (Biological data)

## **Semantic Web vs Traditional Databases**
| **Feature**          | **Relational Databases (SQL, NoSQL)** | **Semantic Web (RDF, OWL, SPARQL)** |
|--------------------|--------------------------------|--------------------------------|
| **Structure**      | Tables (structured schema)    | Graph-based (triples: subject-predicate-object) |
| **Queries**       | SQL                            | SPARQL |
| **Flexibility**   | Rigid schema                  | Dynamic relationships (easily extendable) |
| **Interoperability** | Limited to specific database engines | Global linking of datasets (LOD cloud) |
| **Examples**      | PostgreSQL, MongoDB           | Wikidata, PubChem RDF, DBpedia |



#  Web Scraping and Databases: A Hybrid Approach
While web scraping is often used to extract data for immediate use, it does not store it. A powerful workflow would be:
- **Scrape data** from online sources.
- **Store it in a structured database (SQL or NoSQL)** for long-term analysis.
- **Query it later** instead of repeatedly scraping.


## Web Scraping as a Data Acquisition Method
Web scraping is a method of extracting **external data** from structured or semi-structured sources on the web and transforming it into a usable format. Unlike databases or file storage, web scraping **does not inherently store data**—it is a way to retrieve and structure data from the web dynamically. It allows access to **data stored in HTML web pages** that might not be available via an API.

### Web Scraping vs. APIs
| Feature         | Web Scraping | APIs |
|---------------|-------------|------|
| **Access** | Extracts data from web pages (HTML tables, text, lists) | Queries structured data from a service (often JSON or XML) |
| **Structure** | Often semi-structured (needs parsing) | Well-structured |
| **Reliability** | Pages may change, breaking the scraper | More stable (unless API changes) |
| **Use Case** | Extracting tables, research data, metadata from articles | Accessing structured datasets (PubChem, NCBI, weather data) |

Thus, **web scraping is an alternative to APIs when structured access is unavailable**.

---

## Web Scraping as a Bridge from Classical Literature to Structured Data
Scientific data has historically been communicated through **journal articles, textbooks, and reports**. Many modern scientific knowledge repositories (e.g., Wikipedia, research databases) still store information in text-based formats rather than structured databases. Web scraping allows you to:

- Extract **tabular data** (like chemical properties from Wikipedia or patents).
- Retrieve **text-based metadata** (such as author names, abstracts, and citations).
- Collect **non-tabular structured information** (like structured web pages with lists of elements).

By applying **text parsing, table extraction, and structured storage**, web scraping allows researchers to **convert human-readable content into machine-readable data**.
