# Introduction and Relational Databases

## Database Introduction

### 7 Aspects of Database Systems

- massive: terabytes.
- persistent: the data is what sits there and then program will start up, it will operate on the data, the program will stop and the data will still be there. 
- safe: hardware, software, power, users.
- multi-user: concurrency control.
- convenient: 
    - physical data independence, data is actually stored and laid out on disk is independent of the way that programs think about the structure of the data.
    - high-level query language, declarative
- efficient: thousands of queries/updates persecond.
- reliable: 99.99999% up time.

### A Few of Aspects Surrounding Database Systems

- Database applications may be programmed via "framework".
- DBMS may run in conjunction with "middleware".
- Data-intensive application may not use DBMS at all. e.g.: Hahoop is a processing framework for running operations on data that's stored in files.

### Key Concepts

- data model: 
    - set of records. 
    - XML documents: an XML document captures data, instead of a set of records, as a hierarchical structure, of labeled values; 
    - graph data model: all data in the database is in the form of nodes and edges.
- schema vs data: like types and variables in program language.
- data definition data (DDL): to set up schema.
- data manipulation/query language (DML): to query and modify.

### Key People

- DBMS implementer: build system.
- database designer: establish schema.
- database application developer: programs that operate on database. It's not necessary to have a one-to-one coupling between database and programs. For example, have a sales database where some applications are actually inserting the sales as they happen, while others are analyzing the sales.
- database administrator: loads data, keep running smoothly, tuning.

## Relational Database

- used by all major commercial databse systems
- very simple model
- query with high-level languages: simple yet expressive
- efficient implementations.

### Key Concepts

- database = set of `relations` (or `tables`). table name can be singular or palural, keep consistent.
- each relation has a set of named `attributes` (or `columns`).
- each `tuple` (or `row`) has a value for each attribute.
- each attribute has a `type` (or `domain`), atomic type or structured type.
- `schema`: structural description of relations in database.
- `instance`: actual contents at given point of time.
- `NULL`: special value for 'unknown' or 'undefined'.
- `key`: attribute whose value is unique in each tuple, or set of attributes whose combined values are unique. 
    - database systems for efficiency tend to build special index structures or store the database in a particular way.
    - if one relation in a relational database wants to refer to tuples of another, there 's no concept of pointer in relational databases. Therefore, the first relation will typically refer to a tuple in the second relation by its unique key.
    
```sql
/* creating table example */

Create Table Student
    (ID, name, GPA, photo)

Create Table College
    (name string, state char(2), enrollment integer)
```

## Querying Relational Database

### Steps in Creating and Using a Relational Database

1. Design schema, create using DDL;
2. "Bulk load" initial data;
3. Repeat and can be used by muti-users: execute queries and modifications;

### Ad-hoc Queries in High-level Languages

- Ad-hoc: you can pose queries that you didn't think of in advance.
- High-level: you can write in a fairly compact fashion rather complicated queries and you don't have to write the algorithms that get the data out of the database.
- Sample query questions: 
    - some are easy to pose, some are harder;
    - some query are more efficent, some are harder;
    - "Query Language" (DML) is not only for query but also for modify;

```sql
/* all students with GPA > 3.7 applying to Stanford and MIT only */

/* all engineering departments in CA with < 500 applicants */

/* college with highest average accept rate over last 5 years */
```

### Queries Return Relations

- compositional;
- closed: get back the same type of object that you query;

### Query Languages

- Relational Algebra: formal;
- SQL: actual/implemented, does have as its foundation relational algebra;

```sql
/* relational algebra */

Π ID (
    σ GPA>3.7 ^ cName='Stanford' (Student ⋈ Apply)
)

/* SQL */

SELECT Student.ID FROM Student, Apply
WHERE Student.ID=Apply.ID AND GPA>3.7 AND cName='Stanford'
```

# Relational Algebra

## Select, Project, Join

```sql
/* Example ins this section:
College (cName, state, enrollment)
Student (sID, sName, GPA, sizeHS)
Apply (sID, cName, major, decision)
*/
```

- __select__: pick certain rows;

```sql
/* students with GPA>3.7 */
σ GPA>3.7
Student

/* students with GPA>3.7 and HS<1000 */
σ GPA>3.7 ^ sizeHS<1000) 
Student

/* Aplications to Stanford CS major */
σ cName='Stanford' ^ major='CS'
Apply
```

- __project__: pick certain columns;

```sql
/* ID and dicision of all applicants */
Π ID, decision
Apply

/* ID and sName of students with GPA>3.7 */
Π ID, decision (
    σ GPA>3.7 
    Student
)
-- compose selection and projection is useful, but compose two selection or two projection is not good practice.
```

- Relational Algebra treat duplicates not like SQL:
    - SQL: multisets/bags, which means duplicates will not be eleminated;
    - Relational Algegra: sets, which means duplicates will be eleminated;
    
```sql
/* Dealing with duplicates */

/* following projection will have duplicates in SQL */
/* list of application majors and decisions */
Π maijor, decision
Apply
-- no duplicates in result
```

- cross join: (cross-product, or Certesian Product, [笛卡尔积](https://zh.wikipedia.org/wiki/%E7%AC%9B%E5%8D%A1%E5%84%BF%E7%A7%AF), )

```sql
/* Names and GPAs of students with sizeHS>1000 who applied to CS and were rejected */

/* all the following expressions are correct */
Π name,GPA (
    σ Student.sID=Apply.sID(Student ✖️ Apply) ^ sizeHS>1000 ^ major='CS' ^ decision='reject' 
)

Π name,GPA (
    σ Student.sID=Apply.sID ^ sizeHS>1000 ^ major='CS' ^ decision='reject'
    (Student ✖️ Π sID,major,decision Apply) 
)

Π name,GPA (
    σ Student.sID=Apply.sID (
        (σ sizeHS>1000 Student) ✖️ (σ major='CS' ^ decision='reject' Apply)
    )
)
```

- natural join: 
    - enforce equality on all attributes with same name automatically;
    - eliminate one copy of duplicate attributes;


```sql
/* the following expression is true */
Exp1 ⋈ Exp2 === Π schema(E1) U schema(E2) (
    σ E1.A1=E2.A1 ^ E1.A2=E2.A2 ^ ... (Exp1 ✖️ Exp2)
)
```

```sql
/* Names and GPAs of students with sizeHS>1000 who applied to CS and were rejected */
Π name,GPA (
    σ sizeHS>1000 ^ major='CS' ^ decision='reject' (Student ⋈ Apply)
)

/* Names and GPAs of students with sizeHS>1000 who applied to CS and were rejected at college with enroment>2000 */
Π name,GPA (
    σ sizeHS>1000 ^ major='CS' ^ decision='reject' ^ enroment>2000 (Student ⋈ (Apply ⋈ College))
)
-- College and Apply should be joined first by cName or Student and Apply should be joined first by sID
```


- theta join:
    - Basic operation implemented in DBMS;
    - Term `join` often means theta join;

```sql
/* the following expression is true */
Exp1 ⋈θ Exp2 === σ θ (Exp1 ✖️ Exp2)
-- where θ is the condition
```

## Set Operators, Renaming, Notation