# SQL

SQL, which stands for Structured Query Language, is a language for interacting with data stored in something called a relational database.

You can think of a relational database as a collection of tables. A table is just a set of rows and columns, like a spreadsheet, which represents exactly one type of entity. For example, a table might represent employees in a company or purchases made, but not both.

Each row, or record, of a table contains information about a single entity. For example, in a table representing employees, each row represents a single person. Each column, or field, of a table contains a single attribute for all rows in the table. For example, in a table representing employees, we might have a column containing first and last names for all employees.

An entity is the smallest unit that can contain a meaningful set of data.
The rows (records) represents the horizontal entity while the fields (columns) the vertical entity.

SQL is a nonprocedural, declarative programme language. This means that the codes is focused on what we want to obtain.

## Main Components

The SQL's syntax comprises several types of statements that allow us to perform various commands and operations.

Data Definition Language (DDL) - A set of statements that allow the user to define or modify data structures and objects, such as table.
- the CREATE statement - used for creating entire databases and database objects as tables: 
ex: CREATE TABLE sales (purchase_number  INT);

- the ALTER statement - used when altering existing objects (ADD, REMOVE and RENAME) 
ex:
ALTER TABLE sales
ADD COLUMN date_of_purchase DATE;

- the DROP statement - used for deleting a database object:
ex:
DROP TABLE customers;

- the RENAME statement - allow us to rename an object:
ex:
RENAME TABLE customers to customer_data;

- the TRUNCATE statement - instead of deleting an entire table through DROP, we can also remove its data and continue to have the table as an object in the database.

Data Manipulation Language (DML) - its statements allow us to manipulate the data in the tables of a database

- the SELECT statement - used to retrieve data from database objects, like tables;
- the INSERT statement - used to insert data into tables
- the UPDATE statement - allow us to renew existing data of the tables
- the DELETE statement - functions similarly to the TRUNCATE statement from DDL, but instead of remove all the records contained in a table, we can specify precisely what we would like to be removed 

Commands:
- SELECT ... FROM ...
- INSERT INTO ... VALUES
- UPDATE... SET... WHERE
- DELETE... FROM... WHERE

Data Control Language (DCL) - it is a sintaxe containing only two statements: GRANT and REVOKE, which allow us to manage the rights users have in a database

- The GRANT statement - gives (or grants) certain permissions to users
- the REVOKE clause - used to revoke permissions and privileges of database users, it is the exact opposite of GRANT

Transaction Control Language (TCL) - not every change you make to a database is saved automatically we have to.

- the COMMIT statement - related to INSERT, DELETE, UPDATE, and will save the changes we have made, allowing the other users to have access to the modified database (changes cannot be undone)

- the ROLLBACK clause - the clause that allow us to undo any changes we have made but don't want to be saved permanently (reverts to the last non-committed state

## Basic Database Terminology

Databases main goal is to organize huge amounts of data that can be quickly retrieved upon users request. Therefore, they must be compact, well-structured and efficient, in terms of speed of data extraction.

Relational Database Management System creates relations between tables and databases. 

Database Designer - plot the entire database system on a canvas using a visualization tool, that can be done using an Entity-Relationship (ER) diagram  or using a Relational Schema, which represents precisely an existing idea of how the database must be organized.

Database management comprises the database design, creation and manipulation.

Database administrator provide daily maintenance of the database.

### Relational Schemas

Primary Key - a column (or a set of columns) whose value exists and is unique for every record

- each table can have one and only one primary key;
- may be composed of a set of columns, where its combination is unique;
- the primary keys are the unique identifiers of a table;
- cannot contains null values;
- not all tables you work with will have a primary key.

In a Relational Schemas the field of the Primary Key in the table is usually record on top of the other fields. The primary keys is always underlined.

Foreign Key - idetifies the relationships between tables, not the tables themselves
- A Foreign Key is a Primary Key in another table

In a Relatioal Schema:
- A Foreign Key field must be identified by an (FK) after the name of the field.
- An arrow must be set starting at the field of the foreign key in a table and ending at the field of the Primary Key at the another table (both with the same name)

Unique Key - used whenever you would like to specify that you don't want to see duplicate data in a given field

In case of an unique key:
- It is possible to have null values;
- A table can have more than one unique key field;
- The unique key can comprise a single column of a table or more than one column, which the combination is unique.

Relationships - tell you how much of the data from a foreign key field can be seen in the primary key column of the table the data is related to and vice versa.

- One-to-many type of relationship: One value from the table where the field is a Primary Key can be found many times in the table where the field is a foreign key.
Ex: (one value from the customer_id column under the "Customers" table can be found many times in the "customer_id" column in the "Sales" table.)

## First Steps in SQL

In the next steps we will create the following database:

![Creating a Database](img/Creating_Database_Part1.png)

To create a database we must always specify the type of data that will be inserted in each column of the table.

Strings - the text formart  in SQL: Digits, symbols, or blank spaces can also be used in the srting formart.
- Character (CHAR) has a fixed storage where we must pass the maximum number of symbols the varaibale will have. Ex: CHAR(6) will have a maximum of 6 symbols , and always 6 bytes of size. Has a maxium of 255 bytes;

- Variabable character (VARCHAR) has a variable storage we can pass the maximum, but the size will depend of the quantity of symbols. Has maximum of 65,535 bytes.

VARCHAR is more responsive to the data value inserted than CHAR. On the other hand CHAR is 50% faster than VARCHAR.

- Eumbarate (ENUM) - we can pass exactly the symbols that can be used.


Integers - whole numbers with no decimal point.

## My SQL Constraint

Constraints are specific rules, or limits, that we define in our tables. Why constraints?

- Constraints give the data structure
- Help with consistency, and thus data quality
- Data quality is a business advantage / data science prerequisite

The role of constraints is to outline the existing relationships between different tables in our database.
e.g. NOT NULL

You must define them in SQL through their respective constraints.

A Foreing Key in SQL is defined through a foreign key constraint:
the foreign key maintain the referential integrity within the database

ON DELETE CASCADE - if a specific value from the parent table's primary key has been deleted, all the record from the child table referring to this value will be removed as well.

Unique Keys in MySQL have the same role as indexes - the reverse is not true.

- index of a table - an organizational unit that helps retrieve data more easily
- it takes more time to update a table because indexes must be updated, too, and that is time consuming.

DEFAULT Constraint - helps us assign a particular default value to every row of a column
- a value different from the default can be stored in a field where the indicate DEFAULT constraint has been applied, only if specifically indicated.

NOT NULL Constraint - the not null restriction is applied through the NOT NULL Constraint
- when you insert values in the table you cannot leave the respective field empty
- Don't confuse a NULL value with the value of 0 or with a "NONE" response

# Queries

A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist, since the data you need for your analyses will often live in databases. In SQL, keywords are not case-sensitive.

Whenever you would like to refer to an SQL object in your queries, you must specify the database to which it is applied.

In SQL, you can select data from a table using a *SELECT* statement. For example, the following query selects the *name* column from the *people* table:
~~~~sql
SELECT name
FROM people;
~~~~

It's also good practice to include a semicolon at the end of your query. This tells SQL where the end of your query is!

## SELECT statement

allows you to extract a fraction of the entire data set

- used to retrieve data from database objects, like tables
- used to "query data from a database"
- the SELECT statement goes with FROM
- '*' a wildcard character, means "all" and "everything"

### SLECTing 

This query selects two columns, *name* and *birthdate*, from the *people* table:
~~~~sql
SELECT name, birthdate
FROM people;
~~~~

Sometimes, you may want to select all columns from a table. Typing out every column name would be a pain, so there's a handy shortcut:
~~~~sql
SELECT *
FROM people;
~~~~

If you only want to return a certain number of results, you can use the *LIMIT* keyword to limit the number of rows returned:
~~~~sql
SELECT *
FROM people
LIMIT 10;
~~~~

#### SELECT DISTINCT
If you want to select all the unique values from a column, you can use the *DISTINCT* keyword.
~~~~sql
SELECT DISTINCT language
FROM films;
~~~~


#### Learning to COUNT
The COUNT statement returns the number of rows in one or more columns.
For example, this code gives the number of rows in the people table:
~~~~sql
SELECT COUNT(*)
FROM people;
~~~~
If you want to count the number of non-missing values in a particular column, you can call COUNT on just that column.
For example, to count the number of birth dates present in the people table:
~~~~sql
SELECT COUNT(birthdate)
FROM people;
~~~~

It's also common to combine COUNT with DISTINCT to count the number of distinct values in a column.
~~~~sql
SELECT COUNT(DISTINCT birthdate)
FROM people;
~~~~

## Filtering Results
### WHERE

The WHERE keyword allows you to filter based on both text and numeric values in a table. There are a few different comparison operators you can use:
- '=' equal
- '<>' not equal
- '<' less than
- '>' greater than
- '<=' less than or equal to
- '>=' greater than or equal to

For example, you can filter text records such as *title*. The following code returns all films with the title *'Metropolis'*:
~~~~sql
SELECT title
FROM films
WHERE title = 'Metropolis';
~~~~

Notice that the *WHERE* clause **always** comes after the *FROM* statement!

### WHERE AND
Often, you'll want to select data based on multiple conditions. You can build up your WHERE queries by combining multiple conditions with the AND keyword.
~~~~sql
SSELECT title
FROM films
WHERE release_year > 1994
AND release_year < 2000;
~~~~

You can add as many AND conditions as you need!

### WHERE AND OR
What if you want to select rows based on multiple conditions where some but not all of the conditions need to be met? For this, SQL has the OR operator.
~~~~sql
SELECT title
FROM films
WHERE release_year = 1994
OR release_year = 2000;
~~~~

When combining AND and OR, be sure to enclose the individual clauses in parentheses, like so:
~~~~sql
SELECT title
FROM films
WHERE (release_year = 1994 OR release_year = 1995)
AND (certification = 'PG' OR certification = 'R');
~~~~

The exact opposite of this operation is NOT LIKE.

### BETWEEN
BETWEEN keyword provides a useful shorthand for filtering values within a specified range. It's important to remember that BETWEEN is inclusive, meaning the beginning and end values are included in the results!

~~~~sql
SELECT title
FROM films
WHERE release_year
BETWEEN 1994 AND 2000;
~~~~

NOT BETWEEN - will refer to an interval composed of two parts:
- an interval below the first value indicated
- a second interval above the second value
- the beginnind and end values are not included

### IN - NOT IN
Instead of use OR/AND a lot of times for the same field we can use IN.
The IN operator allows you to specify multiple values in a WHERE clause, making it easier and quicker to specify multiple OR conditions!

In this case the query will get all the employees with the name Cathie, Nathan or Mark.
~~~~sql
SELECT *
FROM employees
WHERE first_name IN ('Cathie', 'Nathan', 'Mark');
~~~~

The NOT IN function selects all that is not equal to the listed, working as the opposite of IN

In this case the query will get all the employees that is not named Cathie, Nathan or Mark.
~~~~sql
SELECT *
FROM employees
WHERE first_name NOT IN ('Cathie', 'Nathan', 'Mark');
~~~~

### IS NOT NULL / IS NULL
In SQL, NULL represents a missing or unknown value. You can check for NULL values using the expression IS NULL. For example, to count the number of missing birth dates in the people table:
~~~~sql
SELECT COUNT(*)
FROM people
WHERE birthdate IS NULL;
~~~~

Sometimes, you'll want to filter out missing values so you only get results which are not NULL. To do this, you can use the IS NOT NULL operator.

~~~~sql
SELECT COUNT(*)
FROM people
WHERE birthdate IS NOT NULL;
~~~~

### LIKE - NOT LIKE
In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To accomplish this, you use something called a wildcard as a placeholder for some other values. There are two wildcards you can use with LIKE:
- '%' is a substitute for a **sequence** of characters
- '-' helps you match a **single** character

In this case we used to obtain all employees that the name begins with 'Mar'.
~~~~sql
SELECT *
FROM employees
WHERE first_name LIKE('Mar%');
~~~~

### SELECT DISTINCT
Select distinct  - select all distinct, different data values.

~~~~sql
SELECT DISTINCT gender
FROM employees;
~~~~

## Aggegate Functions

Aggregate functions are applied on multiple rows of a single columns of a table and return an output of a single value.

- COUNT() - counts the number of non-null records in a field
          - if indicates COUNT(*) it return the number of all rows of the table, NULL values included
- SUM() - sums all the non-null values in a column
- MIN() - returns the minimum value from the entire list
- MAX() - returns the maximum value from the entire list
- AVG() - calculates the average of all non-null values belongin to a certain column of a table
- ROUND(#, decimal_places) - applied to the single values that aggegate functions return

Ex: Returns the number of distinct first names in the table employees
~~~~sql
SELECT COUNT(DISTINCT first_name)
FROM employees;
~~~~


### IFNULL() and COALESCE()
IFNULL(expression_1, expression_2) - returns the first of the two indicated values if the data value found in the table is not null, and returns the second value if there is a null value

- it cannot contain more than two parameters
~~~~sql
SELECT dept_no, IFNULL(dept_name, 'Department name not provided') as dept_name
FROM departments_dup;
~~~~

COALESCE(expression_1, expression_2 …, expression_N) - allows you to insert N arguments in the parentheses

- think of COALESCE() as IFNULL() with more than two parameters
- COALESCE() will always return a single value of the ones we have within parentheses, and this value will be the first non-null value of this list, reading the values from left to right

### ALIASING 
SQL allows you to do something called aliasing. Aliasing simply means you assign a temporary name to something. To alias, you use the AS keyword

For example:
~~~~sql
SELECT MAX(budget) AS max_budget,
       MAX(duration) AS max_duration
FROM films;
~~~~

If we want to get the title and duration in hours for all films:
~~~~sql
SELECT title, duration/60.0 AS duration_hours
FROM films;
~~~~

### ORDER BY
The *ORDER BY* keyword is used to sort results in ascending or descending order according to the values of one or more columns.
By default *ORDER BY* will sort in ascending order. If you want to sort the results in descending order, you can use the DESC keyword. 

For example: gives you the titles of films sorted by release year, from newest to oldest.
~~~~sql
SELECT title
FROM films
ORDER BY release_year DESC;
~~~~

*ORDER BY* can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on. For example, sorts on birth dates first (oldest to newest) and then sorts on the names in alphabetical order. **The order of columns is important!**

~~~~sql
SELECT birthdate, name
FROM people
ORDER BY birthdate, name;
~~~~

### GROUP BY
In SQL, GROUP BY allows you to group a result by one or more columns.

- GROUP BY must be placed immediately after the WHERE conditions, if any, and just before the ORDER BY clause

~~~~sql
SELECT first_name, COUNT(first_name) as n_first_name
FROM employees
GROUP BY first_name
ORDER BY first_name;
~~~~

### HAVING
HAVING is a clause frequently used with groupby because refines the output from records that do not satisfy a certain condition.

- HAVING needs to be inserted between the GROUP BY and the ORDER BY clauses
- HAVING is like WHERE but applied to the GROUP BY block
- HAVING can be applied for subsets from the aggregated groups, while in the WHERE block this is forbidden

~~~~sql
SELECT first_name, COUNT(first_name) as n_first_name
FROM employees
WHERE hire_date > '1999-01-01'
GROUP BY first_name
HAVING COUNT(first_name)>250
AND COUNT(first_name)<270
ORDER BY first_name;
~~~~

### LIMIT
To have just a certain quantity of records for the output we can use LIMIT
~~~~sql
SELECT *
FROM salaries
ORDER BY salary DESC
LIMIT 10;
~~~~

# Relational Databases

## Query information_schema with SELECT

*information_schema* is a meta-database that holds information about your current database. i*nformation_schema* has multiple tables you can query with the known SELECT '*' FROM syntax:

- 'tables': information about all tables in your current database
- 'columns': information about all columns in all of the tables in your current database
- ...

~~~~sql
-- Query the right table in information_schema
SELECT * 
FROM information_schema.tables
-- Specify the correct table_schema value
WHERE table_schema = 'public';
~~~~

Now have a look at the columns in the table university_professors by selecting all entries in information_schema.columns that correspond to that table.

~~~~sql
-- Query the right table in information_schema to get columns
SELECT column_name, data_type 
FROM information_schema.columns 
WHERE table_name = 'university_professors' AND table_schema = 'public';
~~~~

## CREATE TABLEs
The syntax for creating simple tables is as follows:
~~~~sql
-- Create a table for the professors entity type
CREATE TABLE professors (
 firstname text,
 lastname text
);
-- Print the contents of this table
SELECT * 
FROM professors
~~~~

Adding columns to existing tables is easy, especially if they're still empty.
To add columns you can use the following SQL query:
~~~~sql
ALTER TABLE professors
ADD COLUMN university_shortname text;
~~~~

To rename a column name
~~~~sql
ALTER TABLE table table_name
RENAME COLUMN old_name TO new_name;
~~~~

To delete a column
~~~~sql
ALTER TABLE table table_name
DROP COLUMN column_name;
~~~~

#### INSERT INTO
To insert values manually we can use the following pattern:
~~~~sql
INSERT INTO table_name (column_1, column_2, …, column_n)
VALUES (value_1, value_2, …, value_n);
~~~~
- we must put the VALUES in the exact oreder we have listed the column names

or we can insert from another table using the following pattern:
~~~~sql
INSERT INTO table_a
SELECT DISTINCT column_name1, column_name2, ...
FROM table_b;
~~~~
This selects all distinct values in table table_a

## CONSTRAINTS

### Type CASTs
If you know that a certain column stores numbers as text, you can *cast* the column to a numeric form, i.e. to integer.

In the following example the 'fee' column is a string field
~~~~sql
-- Calculate the net amount as amount + fee
SELECT transaction_date, amount + CAST(fee AS integer) AS net_amount 
FROM transactions;
~~~~

### Change types with ALTER COLUMN
The syntax for changing the data type of a column is straightforward. The following code changes the data type of the column_name column in table_name to varchar(10):
~~~~sql
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(10)
~~~~

### Convert types USING a function
If you don't want to reserve too much space for a certain varchar column, you can truncate the values before converting its type.
For this, you can use the following syntax:
~~~~sql
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE varchar(x)
USING SUBSTRING(column_name FROM 1 FOR x)
~~~~
You should read it like this: Because you want to reserve only x characters for column_name, you have to retain a SUBSTRING of every value, i.e. the first x characters of it, and throw away the rest. This way, the values will fit the varchar(x) requirement.

### Disallow NULL values with SET NOT NULL
NOT NULL Constraint - the not null restriction is applied through the NOT NULL Constraint
- when you insert values in the table you cannot leave the respective field empty
- Don't confuse a NULL value with the value of 0 or with a "NONE" response

~~~~sql
-- Disallow NULL values in firstname
ALTER TABLE professors 
ALTER COLUMN firstname SET NOT NULL;
~~~~

### Make your columns UNIQUE with ADD CONSTRAINT
If you want to add a unique constraint to an existing table, you do it like that:

~~~~sql
ALTER TABLE table_name
ADD CONSTRAINT some_name UNIQUE(column_name);
~~~~

## Keys and Superkeys
What is a key?
- Attribute(s) that identify a record uniquely
- As long as attributes can be removed: superkey
- If no more attributes can be removed without losing the uniqueness property: minimal superkey or key

### Identify keys with SELECT COUNT DISTINCT
There's a very basic way of finding out what qualifies for a key in an existing, populated table:

1. Count the distinct records for all possible combinations of columns. If the resulting number x equals the number of all rows in the table for a combination, you have discovered a superkey.

2. Then remove one column after another until you can no longer remove columns without seeing the number x decrease. If that is the case, you have discovered a (candidate) key.

### Primary Keys
Primary Key - a column (or a set of columns) whose value exists and is unique for every record

- each table can have one and only one primary key;
- may be composed of a set of columns, where its combination is unique;
- the primary keys are the unique identifiers of a table;
- cannot contains null values;
- not all tables you work with will have a primary key.

In a Relational Schemas the field of the Primary Key in the table is usually record on top of the other fields. The primary keys is always underlined.

~~~~sql
ALTER TABLE table_name
ADD CONSTRAINT some_name PRIMARY KEY (column_name)
~~~~

### Surrogate keys
A surrogate key is any column or set of columns that can be declared as the primary key instead of a “real” or natural key.
- Primary keys should be built from as few columns as possible
- Primary keys should never change over time

#### Add a SERIAL surrogate key
~~~sql
-- Add the new column to the table
ALTER TABLE professors 
ADD COLUMN id serial;

-- Make id a primary key
ALTER TABLE professors 
ADD CONSTRAINT professors_pkey PRIMARY KEY (id);
~~~~

#### CONCATenate columns to a surrogate key
Another strategy to add a surrogate key to an existing table is to concatenate existing columns with the CONCAT() function.

~~~~sql
-- Count the number of distinct rows with columns make, model
SELECT COUNT(DISTINCT(make, model)) 
FROM cars;

-- Add the id column
ALTER TABLE cars
ADD COLUMN id varchar(128);

-- Update id with make + model
UPDATE cars
SET id = CONCAT(make, model);

-- Make id a primary key
ALTER TABLE cars
ADD CONSTRAINT id_pk PRIMARY KEY(id);

-- Have a look at the table
SELECT * FROM cars;
~~~~

### Foreing Keys
A Foreign key (FK) points to the primary key (PK) of another table
- The domain of FK must be equal to domain of PK
- Each value of FK must exist in PK of the other table (FK constraint or referential integrity)
- A FKs are not actual keys, because duplicates and null values are allowed.

You want the professors table to reference the universities table. You can do that by specifying a column in professors table that references a column in the universities table.
~~~~sql
-- Rename the university_shortname column
ALTER TABLE professors
RENAME COLUMN university_shortname TO university_id;

-- Add a foreign key on professors referencing universities
ALTER TABLE professors
ADD CONSTRAINT professors_fkey FOREIGN KEY (university_id) REFERENCES universities (id);
~~~~

#### JOIN tables linked by a foreign key
~~~~sql
-- Select all professors working for universities in the city of Zurich
SELECT professors.lastname, universities.id, universities.university_city
FROM professors
JOIN universities
ON professors.university_id = universities.id
WHERE universities.university_city = 'Zurich';
~~~~

### REFERENTIAL INTEGRITY
A record referencing another table must refer to an ecisting record in that table
- Foreign Keys prevents violations
- It can be used 'ON DELETE CASCADE'

## The UPDATE Statement
The UPDATE Statement is used to update the values of existing records in a table

To UPDATE a record it must be used the following pattern:
~~~~sql
UPDATE table_name
SET column_1 = value_1, column_2 = value_2 …
WHERE conditions;
~~~~

- if you don’t provide a WHERE condition, all rows of the table will be updated

So if we want to update an specific data from an employee we can use:
In this case we have updated just the employee that has the number equal to 9990
~~~~sql
UPDATE employees
SET 
    first_name = 'Stella',
    last_name = 'Parkinson',
    birth_date = '1990-12-31'
    gender = 'F'
WHERE emp_no = 9990;
~~~~

## The DELETE Statement
Used to delete records in a table
~~~~sql
DELETE employees
WHERE emp_no = 9990;
~~~~

## DROP vs TRUNCATE vs DELETE
DROP
- you won’t be able to roll back to its initial state, or to the last COMMIT statement
- use DROP TABLE only when you are sure you aren’t going to use the table in question anymore

TRUNCATE 
- TRUNCATE ~ DELETE without WHERE
- when truncating, auto-increment values will be reset

DELETE 
- remove records row by row

TRUNCATE vs DELETE without WHERE
- TRUNCATE delivers the output much quicker than DELETE
- auto-increment values are not reset with DELETE

# Joining Data in SQL

Joins - the SQL tool that allow us to construct a relationship between objects

- a join shows a result set, containing fields derived from two or more tables
- we must find a related column from the two tables that contains the same type of data
- we will be free to add columns from these two tables to our output
- the columns you use to relate tables must represent the same object such as id
- the tables you are considering need to be logically adjacent

### JOIN + WHERE
- JOIN: is used for connecting the 'table_a' and 'table_b'
- WHERE: used to define the condition or conditions that will determine which will be the connecting points between the two tables

## Inner Join
Inner joins extract only records in which the values in the related columns match. Null values, or values appearing in just one of the two tables and not appearing in the other, are not displayed.
The result will be empty when the matching values does not exist.

The SQL keyword for inner join can be JOIN or INNER JOIN they have the same function.
~~~~sql
SELECT
table_1.column_name(s), table_2.column_name(s)
FROM
table_1
JOIN
table_2 ON table_1.column_name = table_2.column_name;
~~~~
##### Using Aliases
~~~~sql
SELECT
t1.column_name, t1.column_name, …, t2.column_name, …
FROM
 -- table_1 t1 means table_1 as t1
table_1 t1
JOIN
table_2 t2 ON t1.column_name = t2.column_name;
~~~~

#### Dealing with duplicates
You cannot allow yourself to assume there are no duplicate rows in your data. Thus, you can use group by to deal with it
~~~~sql
SELECT t1.column_name, t1.column_name, t2.column_name, t2.column_name
FROM table_1 t1
JOIN table_2 t2 ON t1.column_name = t2.column_name
GROUP BY t1.column_name;
~~~~

### INNER JOIN via USING
When joining tables with a common field name you can use USING as a shortcut:
~~~~sql
SELECT *
FROM countries
INNER JOIN economies
USING(code)
~~~~

## Self Join
applied when a table must join itself
- if you would like to combine certain rows of a table with other rows fo the same table, you need a self-join
- the self-join will reference both implied tables and will treat them as two separate tables in its operations

### Case when and then
Often it's useful to look at a numerical field not as raw data, but instead as being in different categories or groups.
You can use CASE with WHEN, THEN, ELSE, and END to define a new grouping field.

Ex:
Using the countries table, create a new field AS geosize_group that groups the countries into three groups:

- If surface_area is greater than 2 million, geosize_group is 'large'.
- If surface_area is greater than 350 thousand but not larger than 2 million, geosize_group is 'medium'.
- Otherwise, geosize_group is 'small'.

~~~~sql
SELECT name, continent, code, surface_area,
    -- 1. First case
    CASE WHEN surface_area > 2000000 THEN 'large'
        -- 2. Second case
        WHEN surface_area > 350000  THEN 'medium'
        -- 3. Else clause + end
        ELSE 'small' END
        -- 4. Alias name
        AS geosize_group
-- 5. From table
FROM countries;
~~~~

If we want to save the results we can use INTO

~~~~sql
SELECT name, continent, code, surface_area,
    CASE WHEN surface_area > 2000000
            THEN 'large'
       WHEN surface_area > 350000
            THEN 'medium'
       ELSE 'small' END
       AS geosize_group
INTO countries_plus
FROM countries;

SELECT country_code, size,
  CASE WHEN size > 50000000
            THEN 'large'
       WHEN size > 1000000
            THEN 'medium'
       ELSE 'small' END
       AS popsize_group
INTO pop_plus       
FROM populations
WHERE year = 2015;

-- 5. Select fields
SELECT c.name, c.continent, c.geosize_group, p.popsize_group
-- 1. From countries_plus (alias as c)
FROM countries_plus c
  -- 2. Join to pop_plus (alias as p)
  INNER JOIN pop_plus p
    -- 3. Match on country code
    ON c.code = p.country_code
-- 4. Order the table    
ORDER BY geosize_group;
~~~~


## LEFT AND RIGHT JOIN
The LEFT JOIN can deliver a list with all records from the left table, including that does not match any rows from the right table

The RIGHT JOIN has its funcionality identical to LEFT JOIN, with the only difference being that the direction of the operation is inverted.
- right joins are seldom applied in practice.


## FULL JOINS
A FULL JOIN combines a LEFT JOIN and a RIGHT JOIN bringing in all records from both the left and the right table and keep track of the missing values accodingly.

## CROSS JOINS
A a CROSS JOIN will take the values from a certain table and connect them with all the values from the tables we want to join it with.
- connects all the values, not just those that match
- the Cartesian product of the values of two or more sets
- particularly useful when the tables in a database are not well connected
- Recall that cross joins do not use ON or USING


### Tips and Tricks for JOINS
- one should look for a key columns, which are common between the tables involved in the analysis and are necessary to solve the task to hand
- these columns do not neet to be foreign or private keys;

# THEORY CLAUSES

## UNIONs
used to combine a few SELECT statements in a single output
- you can think of it as a tool that allows you to unify tables
### UNION
UNION displays only distinct values in the output
- UNION uses more computational resources (power and storage space)

### UNION ALL
UNION ALL retrieves the duplicates as well

Both can be used by the following approach:
~~~~sql
SELECT N columns
FROM table_1
UNION ALL SELECT N columns
FROM table_2;
~~~~

It is important to know that:
- We have to select the same number of columns from each table.
- These columns should have the same name, should be in the same order, and should contain related data types.

## INTERSECT
INTERSECT only includes those records in common to both tables and fields selected.
- INTERSECT looks for RECORDS in common, not individual key fields like what a join does to match.

## EXCEPT
EXCEPT allows you to include only the records that are in one table, but not the other.
- Only the records that appear in the left table BUT DO NOT appear in the right table are included.

## Semi-joins and Anti-joins
Are used to determine which records to keep in the left table. 
- Semi-joins: In order to combine the two tables together we use a WHERE clause and then use the first query as the condition to check in the WHERE clause.

~~~~sql
-- Select distinct fields
SELECT DISTINCT name
  -- From languages
  FROM languages
-- Where in statement
WHERE country_code IN
  -- Subquery
  (SELECT country_code
   FROM countries
   WHERE region = 'Middle East')
-- Order by name
ORDER BY name;
~~~~

- Anti-joins: Fill in the other space with a NOT to exclude those selected in the subquery.

# Subqueries

Subqueries are queries embedded in a query. They are also called inner queries or nested queries

- they are part of another query, caller an outer query
- a subquery should always be placed within parentheses
- a subquery may return a single value (a scalar), a single row, a single column, or an entire table
- you can have a lot moren than one subquery in your outer query
- allow for better structuring of the outer query
    - thus, each inner query can be thought of in isolation

1. the SQL engine starts by running the inner query
2. then it uses it returned output, which is intermediate, to execute the outer query
3. it is possible to nest inner queries within other inner queries
    - in that case, the SQL engine would execute the innermost query first, and then each subsequent query, until it runs the outermost query last

In the following example we will 

Select all fields from populations with records corresponding to larger than 1.15 times the average of the life_expectancy field, considering data only for the 2015 year. 

~~~sql
-- Select fields
SELECT *
  -- From populations
  FROM populations
-- Where life_expectancy is greater than
WHERE life_expectancy > 1.15 *
  -- 1.15 * subquery
  (SELECT AVG(life_expectancy)
   FROM populations
   WHERE year = 2015)
  AND year = 2015;
~~~
 
## Subqueries with EXISTS - NOT EXISTS Nested Inside WHERE

Exists checks whether certain row values are found within a subquery
- this check is conducted row by row
- it returns a Boolean value
    - if a row value of a subquery exists it returns TRUE - then, the corresponding record of the outer query is extracted
    - if a row value of a subquery doesn't exists it returns FALSE - then, no row value from the outer query is extracted
    
# Subqueries nested in SELECT and FROM
You will use this to determine the number of languages spoken for each country, identified by the country's local name!
~~~~sql
-- Select fields
SELECT local_name, subquery.lang_num
  -- From countries
  FROM countries,
  	-- Subquery (alias as subquery)
  	(SELECT code, COUNT(name) as lang_num
  	 FROM languages
     GROUP BY code) AS subquery
  -- Where codes match
  WHERE countries.code = subquery.code
-- Order by descending number of languages
ORDER BY lang_num DESC;
~~~~

## SQL Views
View is a a virtual table whose contents are obtained from an existing table or tables, called base tables
- think a view object as a view into the base table
- the view itself does not contain any real data; the data is physically stored in the base table
~~~~sql
CREATE VIEW view_name AS
SELECT column_1, column_2,..., colum_n
FROM tablem_name;
~~~~

Why use Views?
-  A view acts as a shortcut for writing the same SELECT statement every time a new request has been made
- saves a lot of coding time
- occupies no extra memory
- acts as a dynamic table because it instantly reflects data and structural changes in the base table

Don’t forget they are not real, physical data sets, meaning we cannot insert or update the information that has already been extracted.
- they should be seen as temporary virtual data tables retrieving information from base tables

## Stored Routines
A routine (in a context other than computer science) is a usual, fixed action, or series of actions, repeated periodically. While, the stored routine is an SQL statement, or a set of SQL statements, that can be stored on the database server.

- whenever a user needs to run the query in question, they can call, reference, or invoke the routin
- they can be sored procedures or functions (used-defined)
- A procedure can or cannot have parameters, which represent certain values that the procedure will use to complete the calculation

To create a procedure, we can use:

~~~~sql
DELIMITER $$
CREATE PROCEDURE sp_name()
BEGIN
  -- statements
END $$
DELIMITER ;
~~~~

Approaches to call a procedures:
~~~~sql
-- 1:
CALL database_name.procedure_name();

-- 2:
-- IF HAS ALREADY USED THE FOLLOWING COMMAND:
USE database_name
-- CAN CALL DIRECTLY FOR THE PROCEDURE
CALL procedure_name();
~~~~

### Store Procedures with an Input Parameter
- a stored routine can perform a calculation that transforms an input value in an output value
- store procedures can take an input value and then use it in the query, or queries, written in the body of the procedure

~~~~sql
DELIMITER $$
CREATE PROCEDURE procedure_name(IN parameter DATA_TYPE)
BEGIN
    -- statements;
END$$
DELIMITER ;
~~~~

### Store Procedures with an Output Parameter
- OUT Parameter it will represent the variable containing the output value of the operation executed by the query of the stored procedure

~~~~sql
DELIMITER $$
CREATE PROCEDURE procedure_name(IN in_parameter DATA_TYPE, OUT out_parameter DATA_TYPE)
BEGIN
    SELECT _ _ _
    INTO out_parameter FROM ...;
END$$
DELIMITER ;
~~~~
- every time you create a procedure containing both an IN and an OUT parameter, we have to use SELCT-INTO structure.

#### SQL Variables
To create a variable:
~~~~sql
SET @v_variable_name = 0;
CALL procedure_name(in_parameter, v_variable_name);
SELECT @v_variable_name;
~~~~

### User-Defined Functions
- here you have no OUT parameters to define between the parentheses after the object’s name
- all parameters are IN, and since this is well known, you need not explicitly indicate it with the word, ‘IN’
- although there are no OUT parameters, there is a ‘return value’, which is obtained after running the query contained in the body of the function

~~~~sql
DELIMITER $$
    CREATE FUNCTION function_name(parameter data_type) RETURNS data_type
    DECLARE variable_name data_type
    BEGIN
        SELECT …
    RETURN variable_name;
    END$$
DELIMITER;
~~~~

- we cannot call a function!
- we can select it, indicating an input value within parentheses
~~~~sql
SELECT function_name(input_value);
~~~~

### Conceptual Differences
- Stored precedure can have multiple OUT parameters
- User-Defined function can return a single value only
    - if you need to just one value to be returned, then you can use a function (recommended)
    
- The INSERT, UPDATE and DELETE statements only can be used with stored procedures
- You can easily include a function as one of the columns inside a SELECT statement

## Advanced Topics

### Variables
There are three types of SQL variables

- Local varibales is a variable that is visible only in the BEGIN – END block in which it was created
    - DECLARE is a keyword that can be used when creating local variables only
- Session varibale is a variable that exists only for the session in which you are operating
    - it is defined on our server, and it lives there
    - it is visible to the connection being used only
    - To create a MySQL session variable we use SET @variable_name;
- Global vairables apply to all connections related to a specific server
    - you cannot set just any variable as global
    - a specific group of pre-defined variables in MySQL is suitable for this job. They are called system variables
    - To create a global variable:
        ~~~~SQL
            SET GLOBAL var_name = value;
            -- or
            SET @@global.var_name = value;
        ~~~~

## Indexes
the index of a table functions like the index of a book
- data is taken from a column of the table and is stored in a certain order in a distinct place, called an index
- the larger a database is, the slower the process of finding the record or records you need
- we can use an index that will increase the speed of searches related to a table

~~~~sql
    CREATE INDEX index_name
    ON table_name (column_1, column_2, …); -- these must be fields from your data table you will search frequently
~~~~

### Composite Indexes
They are applied to multiple columns, not just a single one
- carefully pick the columns that would optimize your search!
- primary and unique keys are MySQL indexes, they represent columns on which a person would typically base their search

# CASE statements
Case statements are SQL's version of an "IF this THEN that" statement. Case statements have three parts -- a WHEN clause, a THEN clause, and an ELSE clause. 

~~~~sql
CASE WHEN x = 1 THEN 'a'
     WHEN x = 2 THEN 'b'
     ELSE 'c' END as new_colum
~~~~

- When you have completed your statement, be sure to include the term END and give it an alias.

~~~~sql
-- Identify the home team as Bayern Munich, Schalke 04, or neither
SELECT 
	CASE WHEN hometeam_id = 10189 THEN 'FC Schalke 04'
         WHEN hometeam_id = 9823 THEN 'FC Bayern Munich'
         ELSE 'Other' END AS home_team,
	COUNT(id) AS total_matches
FROM matches_germany
-- Group by the CASE statement alias
GROUP BY home_team;
~~~~

## CASE WHEN ... AND THEN
- Add multiple logical conditions to your WHEN clause!
~~~~sql
SELECT date, hometeam_id, awayteam_id,
    CASE WHEN hometeam_id = 8455 AND home_goal > away_goal THEN 'Chelsea home win!' 
    WHEN awayteam_id = 8455AND home_goal < away_goal THEN 'Chelsea away win!' 
    ELSE 'Loss or tie :(' END AS outcome
    FROM match
WHERE hometeam_id = 8455 OR awayteam_id = 8455
~~~~


~~~~sql

SELECT 
	date,
	CASE WHEN hometeam_id = 8634 THEN 'FC Barcelona' 
         ELSE 'Real Madrid CF' END as home,
	CASE WHEN awayteam_id = 8634 THEN 'FC Barcelona' 
         ELSE 'Real Madrid CF' END as away,
	-- Identify all possible match outcomes
	CASE WHEN home_goal > away_goal AND hometeam_id = 8634 THEN 'Barcelona win!'
        WHEN home_goal >  away_goal AND hometeam_id = 8633 THEN 'Real Madrid win!'
        WHEN home_goal < away_goal AND awayteam_id = 8634 THEN 'Barcelona win!'
        WHEN home_goal < away_goal AND awayteam_id = 8633 THEN 'Real Madrid win!'
        ELSE 'Tie!' END AS outcome
FROM matches_spain
WHERE (awayteam_id = 8634 OR hometeam_id = 8634)
      AND (awayteam_id = 8633 OR hometeam_id = 8633);
~~~~

Using CASE tp exclude games where Bologna not won:
We put the CASE WHEN after the WHERE using IS NOT NULL
~~~~sql
-- Select the season, date, home_goal, and away_goal columns
SELECT 
	season,
    date,
	home_goal,
	away_goal
FROM matches_italy
WHERE 
-- Exclude games not won by Bologna
	CASE WHEN hometeam_id = 9857 AND home_goal > away_goal THEN 'Bologna Win'
		WHEN awayteam_id = 9857 AND away_goal > home_goal THEN 'Bologna Win' 
		END IS NOT NULL;
~~~~

## CASE WHEN with aggregate functions
~~~~sql
SELECT season,
       COUNT(CASE WHEN hometeam_id = 8560 AND home_goal > away_goal THEN id END) as home_wins,
       COUNT(CASE WHEN awayteam_id = 8560 AND home_goal < away_goal THEN id END) as away_wins
FROM match
GROUP BY season;
~~~~

### Percentages with CASE and AVG 
~~~~sql
SELECT  season,
        ROUND(AVG(CASE WHEN hometeam_id = 8455 AND home_goal > away_goal THEN 1 
                  WHEN hometeam_id = 8455AND home_goal < away_goal THEN 0 END), 2) AS pct_homewins,
        ROUND(AVG(CASE WHEN awayteam_id = 8455 AND away_goal > home_goal THEN 1
                       WHEN awayteam_id = 8455AND away_goal < home_goal THEN 0 END), 2) AS pct_awaywins
FROM match
GROUPBY season;
~~~~

# Subqueries Again

## Simple subqueries
- Is only processed once in the entire statement

### Subqueries in the WHERE clause
~~~~sql
SELECT home_goal
FROM match
WHERE home_goal > (
    SELECT AVG(home_goal)
    FROM match);
~~~~

~~~~sql
SELECT  team_long_name,  team_short_name AS abbr 
FROM team
WHERE team_api_id IN (
    SELECT hometeam_id
    FROM match
    WHERE country_id = 15722);
~~~~

### Subqueries in the FROM statement
- Useful tool to restructure and transform your data
    - transforming data from long to wide before selecting
    - prefiltering data
- Calculating aggregates of aggregates

~~~~sql
SELECT
	-- Select country name and the count match IDs
    c.name AS country_name,
    COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id 
           FROM match
           -- Filter the subquery by matches with 10+ goals
           WHERE (home_goal + away_goal) >=10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;
~~~~

~~~~sql
SELECT
	-- Select country, date, home, and away goals from the subquery
    country,
    date,
    home_goal,
    away_goal
FROM 
	-- Select country name, date, and total goals in the subquery
	(SELECT c.name AS country, 
     	    m.date, 
     		m.home_goal, 
     		m.away_goal,
           (m.home_goal + m.away_goal) AS total_goals
    FROM match AS m
    LEFT JOIN country AS c
    ON m.country_id = c.id) AS subq
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;
~~~~

### Subqueries in SELECT
- are used to return a single aggregated value
~~~~sql
SELECT date,  (home_goal + away_goal) AS goals,
       (home_goal + away_goal) -
       (SELECT AVG(home_goal + away_goal)
        FROM match
        WHERE season = '2011/2012') AS diff 
FROM match 
WHERE season = '2011/2012'; 
~~~~

- Need to return a SINGLE value, will generate an error otherwise
- Make sure you have all filters in the right places
    - Properly filter both the main and the subquery!
   
### Best Practices
- Subqueries can be multiple included in SELECT, FROM, WHERE..
- FORMAT YOUR QUERIES!
- Annotate your queries - what it does?

# Correlated Subqueries
Correlated Subqueries are a special kind of subquery that use values from the outer query in order to generate the final results.
- The subquery is re-executed each time a new row in the final data set is returned, in order to properly generate each new piece of information.
- Correlated subqueries are used for special types of calculations, such as advanced joining, filtering, and evaluating of data in the database.

