# SQL Introduction

In [None]:
import pandas as pd
from sqlalchemy import create_engine, text
from IPython.display import display

In [None]:
user = "root"
password = ""
db_name = "small_ScienceStreaming"
port = 3306
host = "127.0.0.1" # if host is not recognised, try using "localhost"
connection_string = f"mysql+mysqldb://{user}:{password}@{host}:{port}/{db_name}"
engine = create_engine(connection_string)

In [None]:
def q(query, engine=engine):
    with engine.begin() as conn:
        return pd.read_sql_query(text(query), conn)

### Typical Request

```sql
SELECT <results> -- what will be included in the final output
FROM <first_table> -- original table we will use
LEFT JOIN <other_table> -- type of join
ON <first_table>.id = <other_table>.id -- specify the join key
WHERE <condition(s)> -- how we are going to filter
GROUP BY <columns_to_agregate> -- on what columns are we going to perform agregations
HAVING <condition(s)> -- we can filter after performing a GROUP BY
```

### SELECT, LIMIT and *
```sql
SELECT <column(s)>
FROM <table_name> LIMIT <number>
```

**>>>** Use the previous syntax to display the first 2 lines of every table in your database. Identify the primary keys and the foreign key.

In [None]:
# Code here!


### COUNT, alias and ";"
```sql
SELECT COUNT(*)FROM <my_table>;

SELECT COUNT(*) (AS) <alias>
FROM <table> (AS) <alias>;
```

**>>>** Use the previous syntax to count the number of line for each table. Use an alias to shorten the name of the table to 1, 2 or 3 characters. Use an alias to rename the results.

In [None]:
# Code here!


### UNION

Used to concatenate different ``select`` outputs.

```sql
SELECT * FROM <table1>
UNION
SELECT * FROM <table2>;
```

**>>>** Display in a single query the number of row for each table. Use strings (constants) to name the tables. When using COUNT(), the first alias you use will give the name of the entire column.

In [None]:
# Code here!


### JOINS

<div>
<img src="files/sql_joins.png" alt="CPU" width="100%" align='left'/> </div>

A join stitches two tables and puts on the same row records with matching fields according to the type of JOIN you choose.

When we write just "JOIN", it usually means "INNER JOIN".

There's also the CROSS JOIN. No need to specify a key as this type of JOIN will perform a cartesian product between the two tables. If table1 has 30 rows and table2 has 40 rows, the CROSS JOIN of table1 and table2 will have 1200 rows.

```sql
SELECT <colonnes>
FROM <table1> (<alias>)
LEFT JOIN <table2> (<alias>)
ON <table1>.id = <table2>.id
```

The "." operator allows to identify a column in the table.

**>>>** Use the JOIN clause to do the following:

- From the "planning" table, perform a **LEFT JOIN** with the "cours" table to look at which course name corresponds to each ID.

- Compare the number of rows in this table before and after. What kind of relationship do these two tables have? (1 to 1? 1 to n? n to n?)

- Perform a second join afterwards to display the names of the teachers who teach the different courses.

In [None]:
# Code here! (first join)


In [None]:
# Code here! (compare tables size before and after join)


In [None]:
# Code here  (second join)


**>>>** Perform a left join on "contacts" table to display the ContactId, the start and end dates and the subscription, and the subscription price.

**>>>** Compare the number of rows in this table with the same table without the join. What can you conclude from this?

In [None]:
# Code here!


In [None]:
# Code here!



**>>>** Perform a double left join from the "visionnages" table with the "planning" table and then the "cours" table to retrieve the name of the course that has been viewed.

**>>>** Compare the number of rows in this table against the same table without the join. What can you conclude from this?

In [None]:
# Code here!


In [None]:
# Code here!


### Where

```sql
SELECT COUNT(*)
FROM <table1> <alias>
WHERE <condition1>
```
We can add multiple conditions, and add parenthesis to prioritise some expressions.

```sql
SELECT COUNT(*)
FROM <table1> <alias>
WHERE <condition1>
AND (<condition2> OR <condition3>)
```

**>>>** Display people who have the same first name as you.

In [None]:
# Code here!


**>>>** Display people called 'Charlotte' who live in the department 75 or 13.

In [None]:
# Code here!


A person is only considered to have actually "seen" a video if they have watched it for at least 15 minutes without stopping.

**>>>** Use a WHERE to filter the "visionnages" table and display the number of views longer than 15 minutes, but only for the "Live" type viewings.

In [None]:
# Code here!



**>>>** Use the same query than the last exercice but this time only take into account the views for November 2020. You can use a BETWEEN to shorten the size of your code.

In [None]:
# Code here!


### GROUP BY and ORDER BY

<div>
<img src="files/sql_groupby.png" alt="CPU" width="100%" align='left'/> </div>


Some common functions:
- SUM(): Returns the sum or total of each group.
- COUNT(): Returns the number of rows in each group.
- AVG(): Returns the average of each group.
- MIN(): Returns the minimum value for each group.
- MAX(): Returns the maximum value for each group.

Example :

```sql
SELECT <column1>, COUNT(*)
FROM <first_table>
GROUP BY <column1>
```

To sort the results we can use an ORDER BY. You can specify ASC (default) or DESC at the end.

```sql
SELECT <column1>, COUNT(*) rows_number
FROM <first_table>
GROUP BY <column1>
ORDER BY rows_number DESC
```

**>>>** Display the number of men and women in the contacts table. Use aliases to shorten table names and name result columns. Sort the results so the first line is the largest group.

In [None]:
# Code here!


**>>>** Calculate, for each user, the sum of time spent watching videos (in seconds and in hours), the number of connections, the average time spent, their longest duration viewing as well as the minimum duration. Use aliases to shorten code and rename columns. Filter the rows to take only views > 900. Use an ORDER BY DESC on the sum of time watching videos to establish a ranking.

In [None]:
# Code here!


**>>>** Using the previous query again, calculate the same statistics but the first time to see if men and women differ. 

**>>>** Then a second time comparing departments 75 (Paris), 69 (Rhône) and 13 (Bouches-du-Rhône).

In [None]:
# Code here!


In [None]:
# Code here!


### Having

```sql
SELECT <results> -- what will be included in the final output
FROM <first_table> -- original table we will use
LEFT JOIN <other_table> -- type of join
ON <first_table>.id = <other_table>.id -- specify the join key
WHERE <condition(s)> -- how we are going to filter
GROUP BY <columns_to_agregate> -- on what columns are we going to perform agregations
HAVING <condition(s)> -- we can filter after performing a GROUP BY
```

HAVING is used to filter agregated results. Whereas WHERE is used to filter rows, HAVING can filter groups.


**>>>** Display customers who have watched more than 10 hours of video. Don't forget to include only views longer than 15 minutes (900 seconds). Sort the result by decreasing hours.

In [None]:
# Code here !



### Variables in MySql

```sql
SET @<variable> = <valeur>, @<variable_2> = <valeur>;
```

This instruction will not work if executed inside a jupyter notebook because no row is returned. To create a variable in a jupyter notebook, use the basic python syntax.

### Dates in MySql

MySql understands dates and operations can be applied on them.

In [None]:
my_date = "CAST('2022-11-01 11:29:32.0000000' AS DATETIME)"
q(f"""
SELECT "DAY" unit, DAY({my_date}) result
UNION
SELECT "MONTH", MONTH({my_date})
UNION
SELECT "YEAR", YEAR({my_date})
UNION
SELECT "HOUR", HOUR({my_date})
UNION
SELECT "REMOVE", DATE_ADD({my_date}, INTERVAL -2 MONTH)
UNION
SELECT "ADD", DATE_ADD({my_date}, INTERVAL 2 YEAR);
""")

**>>>** Display the number of views (one view is a view that lasted at least 15mns), for each month and year in the data.

In [None]:
# Code here!


### CASE

```sql
CASE
WHEN <condition> THEN <valeur>
WHEN <condition> THEN <valeur>
ELSE <valeur>
END
```
**>>>** Taking only the departments of 75, 59, 68, 13, 06 and 83. Display the surname and first name of each customer, as well as a new column indicating 'northerner' or 'southerner' according to their origin.

**>>>** Then count the number of "northerners" and "southerners"

Tip: you can use the following syntax to shorten your code:
```sql
WHERE codeDept IN ('75', '59', '68', '13', '06', '83');
```

In [None]:
# Code here!

In [None]:
# Code here! (count)


### Today's date

```sql
SELECT CURDATE();
```
The today date is the date of the server, this one is often in UTC (formerly called GMT), so Greenwich time. You need to add 1h during winter and 2h in summer to find the current time in France.



### Difference between two dates

You can compute the number of year, month, day, hour, minute, seconds using this syntax:

In [None]:
q("""
SELECT TIMESTAMPDIFF(MONTH, '2023-05-29', '2023-06-28') month_diff
""")

In [None]:
q("""
SELECT TIMESTAMPDIFF(MONTH, '2023-05-29', '2023-06-30') month_diff
""")

**>>>** We want to know for each customer who has had a paid subscription the total amount they have paid to the company up to today's date. The price of the subscription is monthly. The price will therefore have to be multiplied by the number of months. If the duration is less than one month, it will have to be multiplied by 1.

**Tip**:
- The ``round()`` function also works in SQL.
- The different from operator in SQL is `<>`

In [None]:
# Code here!


**>>>** Write a query that displays the number of views for each customer with an occurrence in the views table for August 2020.

**Tip**: You can use BETWEEN to delimit dates as follows:

```sql
WHERE <colonne> BETWEEN <date1> AND <date2>
```

In [None]:
# Code here!


### Subqueries

```sql
WITH <subquery> AS
(SELECT <column(s)> FROM <table1>)
SELECT <column(s)> FROM <subquery>
```

**>>>** A subscriber can have several subscriptions. We want to know the distribution of the number of subscriptions, just as a ``value_counts()`` would do. That is to say, how many people have had only one subscription, how many have had two, how many have had three, etc.

In [None]:
# Code here!

### Final query

Is it possible to predict the probability that a customer with a free subscription will eventually buy one?

To do this, we will export a tabular file which will contain (X) :
The status information of a person (age, gender, dept)
The total time spent watching videos
The course that each client has watched the most, and the associated teacher.

And of course whether the person has ever had a paid subscription or not (y)!

In [None]:
# Code here!


In [None]:
%%time

q("""
WITH recap AS(
SELECT
vi.idContact
, co.nomCours
, co.idCours
, pr.idProf
, pr.prenom
, pr.nom
, SUM(secondesVues) AS temps_visionne_en_s
FROM visionnages vi
INNER JOIN planning pl ON vi.idPlanning = pl.idPlanning
INNER JOIN cours co ON pl.idCours = co.idCours
INNER JOIN profs pr ON pl.idProf = pr.idProf
GROUP BY
vi.idContact
, co.nomCours
, co.idCours
, pr.idProf
, pr.prenom
, pr.nom
ORDER BY idContact DESC),
classement AS (
SELECT 
idContact,
temps_visionne_en_s,
idCours,
nomCours,
idProf,
nom,
prenom,
ROW_NUMBER() OVER (
    PARTITION BY idContact 
    ORDER BY temps_visionne_en_s DESC) row_num
FROM 
recap),
top_cours_par_client AS
(SELECT * FROM classement WHERE row_num = 1),
r AS (
SELECT idContact,
SUM(prix) AS somme_prix_unitaire
FROM abonnements
GROUP BY idContact),
v AS (
SELECT idContact,
SUM(secondesVues) temps_total_visionne_en_s
FROM visionnages
GROUP BY idcontact)
SELECT
r.idContact
, c.sexe AS genre
, c.codeDept
, c.dateNaissance
, t.idCours AS top_idCours
, t.nomCours AS top_nomCours
, t.idProf AS top_idProf
, t.prenom AS top_prenomProf
, t.nom AS top_nomProf
, TIMESTAMPDIFF(YEAR, dateNaissance, CURDATE()) AS age -- optionnel
-- , v.temps_total_visionne_en_s, -- contient des NULL donc on fait un CASE :
, CASE WHEN v.temps_total_visionne_en_s IS NULL THEN 0
       ELSE v.temps_total_visionne_en_s END
       AS temps_total_visionne_en_s
, CASE WHEN somme_prix_unitaire = 0 THEN 0
       ELSE 1 END
       AS abo_payant
FROM r
INNER JOIN contacts c
ON r.idContact = c.idContact
INNER JOIN v
ON r.idContact = v.idContact
INNER JOIN top_cours_par_client t
ON r.idContact = t.idContact;
""")