# SQL

SQL, which stands for *Structured Query Language*, is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data. SQL is a nonprocedural, declarative programme language. This means that the codes is focused on what we want to obtain.

You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity. For example, a table might represent employees in a company or purchases made, but not both.

Each row, or *record*, of a table contains information about a single entity. For example, in a table representing employees, each row represents a single person. Each column, or *field*, of a table contains a single attribute for all rows in the table. For example, in a table representing employees, we might have a column containing first and last names for all employees.

An entity is the smallest unit that can contain a meaningful set of data. The rows (records) represents the horizontal entity while the fields (columns) the vertical entity.

# Main Components

The SQL's syntax comprises several types of statements that allow us to perform various commands and operations.

### Data Definition Language (DDL) - A set of statements that allow the user to define or modify data structures and objects, such as table.

- the CREATE statement - used for creating entire databases and database objects as tables: 

ex: 
~~~~sql
CREATE TABLE sales (purchase_number  INT);
~~~~

- the ALTER statement - used when altering existing objects (ADD, REMOVE and RENAME) 

ex:
~~~~sql
ALTER TABLE sales
ADD COLUMN date_of_purchase DATE;
~~~~

- the DROP statement - used for deleting a database object:

ex:
~~~~sql
DROP TABLE customers;
~~~~
- the RENAME statement - allow us to rename an object:
ex:
~~~~sql
RENAME TABLE customers to customer_data;
~~~~

- the TRUNCATE statement - instead of deleting an entire table through DROP, we can also remove its data and continue to have the table as an object in the database.

### Data Manipulation Language (DML) - its statements allow us to manipulate the data in the tables of a database

- the SELECT statement - used to retrieve data from database objects, like tables;
- the INSERT statement - used to insert data into tables
- the UPDATE statement - allow us to renew existing data of the tables
- the DELETE statement - functions similarly to the TRUNCATE statement from DDL, but instead of remove all the records contained in a table, we can specify precisely what we would like to be removed 

Commands:
~~~~sql
SELECT ... FROM ...
INSERT INTO ... VALUES
UPDATE... SET... WHERE
DELETE... FROM... WHERE
~~~~

### Data Control Language (DCL) - it is a sintaxe containing only two statements: GRANT and REVOKE, which allow us to manage the rights users have in a database

- The GRANT statement - gives (or grants) certain permissions to users
- the REVOKE clause - used to revoke permissions and privileges of database users, it is the exact opposite of GRANT

### Transaction Control Language (TCL) - not every change you make to a database is saved automatically we have to.

- the COMMIT statement - related to INSERT, DELETE, UPDATE, and will save the changes we have made, allowing the other users to have access to the modified database (changes cannot be undone)
- the ROLLBACK clause - the clause that allow us to undo any changes we have made but don't want to be saved permanently (reverts to the last non-committed state

# Queries

A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases. In SQL, keywords are not case-sensitive.

Whenever you would like to refer to an SQL object in your queries, you must specify the database to which it is applied.

In SQL, you can select data from a table using a *SELECT* statement. For example, the following query selects the *name* column from the *people* table:
~~~~sql
SELECT name
FROM people;
~~~~

It's also good practice to include a semicolon at the end of your query. This tells SQL where the end of your query is!

## SELECT statement

allows you to extract a fraction of the entire data set

- used to retrieve data from database objects, like tables
- used to "query data from a database"
- the SELECT statement goes with FROM
- '*' a wildcard character, means "all" and "everything"

### SLECTing 

This query selects two columns, *name* and *birthdate*, from the *people* table:
~~~~sql
SELECT name, birthdate
FROM people;
~~~~

Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there's a handy shortcut:
~~~~sql
SELECT *
FROM people;
~~~~

If you only want to return a certain number of results, you can use the *LIMIT* keyword to limit the number of rows returned:
~~~~sql
SELECT *
FROM people
LIMIT 10;
~~~~

#### SELECT DISTINCT
If you want to select all the unique values from a column, you can use the *DISTINCT* keyword.
~~~~sql
SELECT DISTINCT language
FROM films;
~~~~


#### Learning to COUNT
The COUNT statement returns the number of rows in one or more columns.
For example, this code gives the number of rows in the people table:
~~~~sql
SELECT COUNT(*)
FROM people;
~~~~
If you want to count the number of non-missing values in a particular column, you can call COUNT on just that column.
For example, to count the number of birth dates present in the people table:
~~~~sql
SELECT COUNT(birthdate)
FROM people;
~~~~

It's also common to combine COUNT with DISTINCT to count the number of distinct values in a column.
~~~~sql
SELECT COUNT(DISTINCT birthdate)
FROM people;
~~~~

## Filtering Results
### WHERE

The WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
- '=' equal
- '<>' not equal
- '<' less than
- '>' greater than
- '<=' less than or equal to
- '>=' greater than or equal to

For example, you can filter text records such as *title*. The following code returns all films with the title *'Metropolis'*:
~~~~sql
SELECT title
FROM films
WHERE title = 'Metropolis';
~~~~

Notice that the *WHERE* clause **always** comes after the *FROM* statement!

### WHERE AND
Often, you'll want to select data based on multiple conditions. You can build up your WHERE queries by combining multiple conditions with the AND keyword.
~~~~sql
SSELECT title
FROM films
WHERE release_year > 1994
AND release_year < 2000;
~~~~

You can add as many AND conditions as you need!

### WHERE AND OR
What if you want to select rows based on multiple conditions where some but not all of the conditions need to be met? For this, SQL has the OR operator.
~~~~sql
SELECT title
FROM films
WHERE release_year = 1994
OR release_year = 2000;
~~~~

When combining AND and OR, be sure to enclose the individual clauses in parentheses, like so:
~~~~sql
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');
~~~~

The exact opposite of this operation is NOT LIKE.

### BETWEEN
BETWEEN keyword provides a useful shorthand for filtering values within a specified range. It's important to remember that BETWEEN is inclusive, meaning the beginning and end values are included in the results!

~~~~sql
SELECT title
FROM films
WHERE release_year
BETWEEN 1994 AND 2000;
~~~~

NOT BETWEEN - will refer to an interval composed of two parts:
- an interval below the first value indicated
- a second interval above the second value
- the beginnind and end values are not included

### IN - NOT IN
Instead of use OR/AND a lot of times for the same field we can use IN.
The IN operator allows you to specify multiple values in a WHERE clause, making it easier and quicker to specify multiple OR conditions!

In this case the query will get all the employees with the name Cathie, Nathan or Mark.
~~~~sql
SELECT *
FROM employees
WHERE first_name IN ('Cathie', 'Nathan', 'Mark');
~~~~

The NOT IN function selects all that is not equal to the listed, working as the opposite of IN

In this case the query will get all the employees that is not named Cathie, Nathan or Mark.
~~~~sql
SELECT *
FROM employees
WHERE first_name NOT IN ('Cathie', 'Nathan', 'Mark');
~~~~

### IS NOT NULL / IS NULL
In SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL. For example, to count the number of missing birth dates in the people table:
~~~~sql
SELECT COUNT(*)
FROM people
WHERE birthdate IS NULL;
~~~~

Sometimes, you'll want to filter out missing values so you only get results which are not NULL. To do this, you can use the IS NOT NULL operator.

~~~~sql
SELECT COUNT(*)
FROM people
WHERE birthdate IS NOT NULL;
~~~~

### LIKE - NOT LIKE
In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:
- '%' is a substitute for a **sequence** of characters
- '-' helps you match a **single** character

In this case we used to obtain all employees that the name begins with 'Mar'.
~~~~sql
SELECT *
FROM employees
WHERE first_name LIKE('Mar%');
~~~~

### SELECT DISTINCT
Select distinct  - select all distinct, different data values.

~~~~sql
SELECT DISTINCT gender
FROM employees;
~~~~

## Aggegate Functions

Aggregate functions are applied on multiple rows of a single columns of a table and return an output of a single value.

- COUNT() - counts the number of non-null records in a field
          - if indicates COUNT(*) it return the number of all rows of the table, NULL values included
- SUM() - sums all the non-null values in a column
- MIN() - returns the minimum value from the entire list
- MAX() - returns the maximum value from the entire list
- AVG() - calculates the average of all non-null values belongin to a certain column of a table
- ROUND(#, decimal_places) - applied to the single values that aggegate functions return

Ex: Returns the number of distinct first names in the table employees
~~~~sql
SELECT COUNT(DISTINCT first_name)
FROM employees;
~~~~


### IFNULL() and COALESCE()
IFNULL(expression_1, expression_2) - returns the first of the two indicated values if the data value found in the table is not null, and returns the second value if there is a null value

- it cannot contain more than two parameters
~~~~sql
SELECT dept_no, IFNULL(dept_name, 'Department name not provided') as dept_name
FROM departments_dup;
~~~~

COALESCE(expression_1, expression_2 …, expression_N) - allows you to insert N arguments in the parentheses

- think of COALESCE() as IFNULL() with more than two parameters
- COALESCE() will always return a single value of the ones we have within parentheses, and this value will be the first non-null value of this list, reading the values from left to right

### ALIASING 
SQL allows you to do something called aliasing. Aliasing simply means you assign a temporary name to something. To alias, you use the AS keyword

For example:
~~~~sql
SELECT MAX(budget) AS max_budget,
       MAX(duration) AS max_duration
FROM films;
~~~~

If we want to get the title and duration in hours for all films:
~~~~sql
SELECT title, duration/60.0 AS duration_hours
FROM films;
~~~~

### ORDER BY
The *ORDER BY* keyword is used to sort results in ascending or descending order according to the values of one or more columns.
By default *ORDER BY* will sort in ascending order. If you want to sort the results in descending order, you can use the DESC keyword. 

For example: gives you the titles of films sorted by release year, from newest to oldest.
~~~~sql
SELECT title
FROM films
ORDER BY release_year DESC;
~~~~

*ORDER BY* can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on. For example, sorts on birth dates first (oldest to newest) and then sorts on the names in alphabetical order. **The order of columns is important!**

~~~~sql
SELECT birthdate, name
FROM people
ORDER BY birthdate, name;
~~~~

### GROUP BY
In SQL, GROUP BY allows you to group a result by one or more columns.

- GROUP BY must be placed immediately after the WHERE conditions, if any, and just before the ORDER BY clause

~~~~sql
SELECT first_name, COUNT(first_name) as n_first_name
FROM employees
GROUP BY first_name
ORDER BY first_name;
~~~~

### HAVING
HAVING is a clause frequently used with groupby because refines the output from records that do not satisfy a certain condition.

- HAVING needs to be inserted between the GROUP BY and the ORDER BY clauses
- HAVING is like WHERE but applied to the GROUP BY block
- HAVING can be applied for subsets from the aggregated groups, while in the WHERE block this is forbidden

~~~~sql
SELECT first_name, COUNT(first_name) as n_first_name
FROM employees
WHERE hire_date > '1999-01-01'
GROUP BY first_name
HAVING COUNT(first_name)>250
AND COUNT(first_name)<270
ORDER BY first_name;
~~~~

### LIMIT
To have just a certain quantity of records for the output we can use LIMIT
~~~~sql
SELECT *
FROM salaries
ORDER BY salary DESC
LIMIT 10;
~~~~

# Relational Databases and Relational Schemas

- A relational database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database.

- Relation schema defines the design and structure of the relation like it consists of the relation name, set of attributes/field names/column names. every attribute would have an associated domain.

# Query information_schema with SELECT

*information_schema* is a meta-database that holds information about your current database. i*nformation_schema* has multiple tables you can query with the known SELECT '*' FROM syntax:

- 'tables': information about all tables in your current database
- 'columns': information about all columns in all of the tables in your current database
- ...

~~~~sql
-- Query the right table in information_schema
SELECT * 
FROM information_schema.tables
-- Specify the correct table_schema value
WHERE table_schema = 'public';
~~~~

Now have a look at the columns in the table university_professors by selecting all entries in information_schema.columns that correspond to that table.

~~~~sql
-- Query the right table in information_schema to get columns
SELECT column_name, data_type 
FROM information_schema.columns 
WHERE table_name = 'university_professors' AND table_schema = 'public';
~~~~

## CREATE TABLEs
The syntax for creating simple tables is as follows:
~~~~sql
-- Create a table for the professors entity type
CREATE TABLE professors (
 firstname text,
 lastname text
);
-- Print the contents of this table
SELECT * 
FROM professors
~~~~

Adding columns to existing tables is easy, especially if they're still empty.
To add columns you can use the following SQL query:
~~~~sql
ALTER TABLE professors
ADD COLUMN university_shortname text;
~~~~

To rename a column name
~~~~sql
ALTER TABLE table table_name
RENAME COLUMN old_name TO new_name;
~~~~

To delete a column
~~~~sql
ALTER TABLE table table_name
DROP COLUMN column_name;
~~~~

#### INSERT INTO
To insert values manually we can use the following pattern:
~~~~sql
INSERT INTO table_name (column_1, column_2, …, column_n)
VALUES (value_1, value_2, …, value_n);
~~~~
- we must put the VALUES in the exact oreder we have listed the column names

or we can insert from another table using the following pattern:
~~~~sql
INSERT INTO table_a
SELECT DISTINCT column_name1, column_name2, ...
FROM table_b;
~~~~
This selects all distinct values in table table_a

## CONSTRAINTS

### Type CASTs
If you know that a certain column stores numbers as text, you can *cast* the column to a numeric form, i.e. to integer.

In the following example the 'fee' column is a string field
~~~~sql
-- Calculate the net amount as amount + fee
SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount 
FROM transactions;
~~~~

### Change types with ALTER COLUMN
The syntax for changing the data type of a column is straightforward. The following code changes the data type of the column_name column in table_name to varchar(10):
~~~~sql
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(10)
~~~~

### Convert types USING a function
If you don't want to reserve too much space for a certain varchar column, you can truncate the values before converting its type.
For this, you can use the following syntax:
~~~~sql
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(x)
USING SUBSTRING(column_name FROM 1 FOR x)
~~~~
You should read it like this: Because you want to reserve only x characters for column_name, you have to retain a SUBSTRING of every value, i.e. the first x characters of it, and throw away the rest. This way, the values will fit the varchar(x) requirement.

### Disallow NULL values with SET NOT NULL
NOT NULL Constraint - the not null restriction is applied through the NOT NULL Constraint
- when you insert values in the table you cannot leave the respective field empty
- Don't confuse a NULL value with the value of 0 or with a "NONE" response

~~~~sql
-- Disallow NULL values in firstname
ALTER TABLE professors 
ALTER COLUMN firstname SET NOT NULL;
~~~~

### Make your columns UNIQUE with ADD CONSTRAINT
If you want to add a unique constraint to an existing table, you do it like that:

~~~~sql
ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);
~~~~

## Keys and Superkeys
What is a key?
- Attribute(s) that identify a record uniquely
- As long as attributes can be removed: superkey
- If no more attributes can be removed without losing the uniqueness property: minimal superkey or key

### Identify keys with SELECT COUNT DISTINCT
There's a very basic way of finding out what qualifies for a key in an existing, populated table:

1. Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey.

2. Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key.

### Primary Keys
Primary Key - a column (or a set of columns) whose value exists and is unique for every record

- each table can have one and only one primary key;
- may be composed of a set of columns, where its combination is unique;
- the primary keys are the unique identifiers of a table;
- cannot contains null values;
- not all tables you work with will have a primary key.

In a Relational Schemas the field of the Primary Key in the table is usually record on top of the other fields. The primary keys is always underlined.

~~~~sql
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
~~~~

### Surrogate keys
A surrogate key is any column or set of columns that can be declared as the primary key instead of a “real” or natural key.
- Primary keys should be built from as few columns as possible
- Primary keys should never change over time

#### Add a SERIAL surrogate key
~~~sql
-- Add the new column to the table
ALTER TABLE professors 
ADD COLUMN id serial;

-- Make id a primary key
ALTER TABLE professors 
ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
~~~~

#### CONCATenate columns to a surrogate key
Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the CONCAT() function.

~~~~sql
-- Count the number of distinct rows with columns make, model
SELECT COUNT(DISTINCT(make, model)) 
FROM cars;

-- Add the id column
ALTER TABLE cars
ADD COLUMN id varchar(128);

-- Update id with make + model
UPDATE cars
SET id = CONCAT(make, model);

-- Make id a primary key
ALTER TABLE cars
ADD CONSTRAINT id_pk PRIMARY KEY(id);

-- Have a look at the table
SELECT * FROM cars;
~~~~

### Foreing Keys
A Foreign key (FK) points to the primary key (PK) of another table
- The domain of FK must be equal to domain of PK
- Each value of FK must exist in PK of the other table (FK constraint or referential integrity)
- A FKs are not actual keys, because duplicates and null values are allowed.

You want the professors table to reference the universities table. You can do that by specifying a column in professors table that references a column in the universities table.
~~~~sql
-- Rename the university_shortname column
ALTER TABLE professors
RENAME COLUMN university_shortname TO university_id;

-- Add a foreign key on professors referencing universities
ALTER TABLE professors
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id);
~~~~

#### JOIN tables linked by a foreign key
~~~~sql
-- Select all professors working for universities in the city of Zurich
SELECT professors.lastname, universities.id, universities.university_city
FROM professors
JOIN universities
ON professors.university_id = universities.id
WHERE universities.university_city = 'Zurich';
~~~~

### REFERENTIAL INTEGRITY
A record referencing another table must refer to an ecisting record in that table
- Foreign Keys prevents violations
- It can be used 'ON DELETE CASCADE'

### this part needs adjustment - begin
In the next steps we will create the following database:

![Creating a Database](img/Creating_Database_Part1.png)

To create a database we must always specify the type of data that will be inserted in each column of the table.

Strings - the text formart  in SQL: Digits, symbols, or blank spaces can also be used in the srting formart.
- Character (CHAR) has a fixed storage where we must pass the maximum number of symbols the varaibale will have. Ex: CHAR(6) will have a maximum of 6 symbols , and always 6 bytes of size. Has a maxium of 255 bytes;

- Variabable character (VARCHAR) has a variable storage we can pass the maximum, but the size will depend of the quantity of symbols. Has maximum of 65,535 bytes.

VARCHAR is more responsive to the data value inserted than CHAR. On the other hand CHAR is 50% faster than VARCHAR.

- Eumbarate (ENUM) - we can pass exactly the symbols that can be used.


Integers - whole numbers with no decimal point.

## My SQL Constraint

Constraints are specific rules, or limits, that we define in our tables. Why constraints?

- Constraints give the data structure
- Help with consistency, and thus data quality
- Data quality is a business advantage / data science prerequisite

The role of constraints is to outline the existing relationships between different tables in our database.
e.g. NOT NULL

You must define them in SQL through their respective constraints.

A Foreing Key in SQL is defined through a foreign key constraint:
the foreign key maintain the referential integrity within the database

ON DELETE CASCADE - if a specific value from the parent table's primary key has been deleted, all the record from the child table referring to this value will be removed as well.

Unique Keys in MySQL have the same role as indexes - the reverse is not true.

- index of a table - an organizational unit that helps retrieve data more easily
- it takes more time to update a table because indexes must be updated, too, and that is time consuming.

DEFAULT Constraint - helps us assign a particular default value to every row of a column
- a value different from the default can be stored in a field where the indicate DEFAULT constraint has been applied, only if specifically indicated.

### Primary Key - a column (or a set of columns) whose value exists and is unique for every record

- each table can have one and only one primary key;
- may be composed of a set of columns, where its combination is unique;
- the primary keys are the unique identifiers of a table;
- cannot contains null values;
- not all tables you work with will have a primary key.

In a Relational Schemas the field of the Primary Key in the table is usually record on top of the other fields. The primary keys is always underlined.

### Foreign Key - idetifies the relationships between tables, not the tables themselves
- A Foreign Key is a Primary Key in another table

In a Relatioal Schema:
- A Foreign Key field must be identified by an (FK) after the name of the field.
- An arrow must be set starting at the field of the foreign key in a table and ending at the field of the Primary Key at the another table (both with the same name)

### Unique Key - used whenever you would like to specify that you don't want to see duplicate data in a given field

In case of an unique key:
- It is possible to have null values;
- A table can have more than one unique key field;
- The unique key can comprise a single column of a table or more than one column, which the combination is unique.

### Relationships - tell you how much of the data from a foreign key field can be seen in the primary key column of the table the data is related to and vice versa.

- One-to-many type of relationship: One value from the table where the field is a Primary Key can be found many times in the table where the field is a foreign key.
Ex: (one value from the customer_id column under the "Customers" table can be found many times in the "customer_id" column in the "Sales" table.)

### this part needs adjustment - end

## The UPDATE Statement
The UPDATE Statement is used to update the values of existing records in a table

To UPDATE a record it must be used the following pattern:
~~~~sql
UPDATE table_name
SET column_1 = value_1, column_2 = value_2 …
WHERE conditions;
~~~~

- if you don’t provide a WHERE condition, all rows of the table will be updated

So if we want to update an specific data from an employee we can use:
In this case we have updated just the employee that has the number equal to 9990
~~~~sql
UPDATE employees
SET 
    first_name = 'Stella',
    last_name = 'Parkinson',
    birth_date = '1990-12-31'
    gender = 'F'
WHERE emp_no = 9990;
~~~~

## The DELETE Statement
Used to delete records in a table
~~~~sql
DELETE employees
WHERE emp_no = 9990;
~~~~

## DROP vs TRUNCATE vs DELETE
DROP
- you won’t be able to roll back to its initial state, or to the last COMMIT statement
- use DROP TABLE only when you are sure you aren’t going to use the table in question anymore

TRUNCATE 
- TRUNCATE ~ DELETE without WHERE
- when truncating, auto-increment values will be reset

DELETE 
- remove records row by row

TRUNCATE vs DELETE without WHERE
- TRUNCATE delivers the output much quicker than DELETE
- auto-increment values are not reset with DELETE