# SD202 TP2 - Normalization and SQL

The objectives of this TP are the following:

1. Apply normalization 1NF -> 2NF -> 3NF -> BCNF
2. Perform SQL queries on the normalized database

In this lab, we are going to use a database containing wine information related to 'production' and 'sales'. 

Production <---> Wine <---> Sales

First, we are going to normalize it, and after that, we are going to write some SQL queries.

The __wine.db__ database contains the following tables:

We need to prepare the SQL environment:

In [26]:
import sqlite3

In [27]:
def printSchema(connection):
    ### Source: http://stackoverflow.com/a/35092773/4765776
    for (tableName,) in connection.execute(
        """
        select NAME from SQLITE_MASTER where TYPE='table' order by NAME;
        """
    ):
        print("{}:".format(tableName))
        for (
            columnID, columnName, columnType,
            columnNotNull, columnDefault, columnPK,
        ) in connection.execute("pragma table_info('{}');".format(tableName)):
            print("  {id}: {name}({type}){null}{default}{pk}".format(
                id=columnID,
                name=columnName,
                type=columnType,
                null=" not null" if columnNotNull else "",
                default=" [{}]".format(columnDefault) if columnDefault else "",
                pk=" *{}".format(columnPK) if columnPK else "",
            ))

In [28]:
conn = sqlite3.connect('wine.db')
c = conn.cursor()
print("Database schema:")
printSchema(conn)           # An usefull way to viualize the content of the database

Database schema:
Client:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
MASTER1:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: QTE(NUM)
  5: NP(NUM)
  6: NOM(TEXT)
  7: PRENOM(TEXT)
  8: REGION(TEXT)
MASTER2:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: DATES(NUM)
  5: LIEU(TEXT)
  6: QTE(NUM)
  7: NB(NUM)
  8: NOM(TEXT)
  9: PRENOM(TEXT)
  10: TYPE(TEXT)
  11: REGION(TEXT)
Place:
  0: LIEU(TEXT)
  1: REGION(TEXT)
Producer:
  0: NP(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: REGION(TEXT)
Quantity:
  0: NV(NUM)
  1: NP(NUM)
  2: QTE(NUM)
Sale:
  0: NV(NUM)
  1: NB(NUM)
  2: DATES(NUM)
  3: LIEU(TEXT)
  4: QTE(NUM)
achats:
  0: NV(NUM)
  1: NB(NUM)
  2: QTE(NUM)
  3: DATES(NUM)
  4: LIEU(TEXT)
buveurs:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
locations:
  0: LIEU(TEXT)
  1: REGION(TEXT)
producteurs:
  0: NP(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: REGION(TEXT)
quantite:
  0: NV(NUM)
  1: NP(NUM)
  2: QTE(NUM)
vins:

We recommend inline __%sql__ as an alternative to sqlite3 package

In [29]:
%load_ext sql
%sql sqlite:///wine.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


u'Connected: None@wine.db'

Now, we can see the content of the tables using SQL queries:

In [30]:
%sql SELECT DISTINCT NV, CRU, MILL, DEGRE FROM MASTER1;

Done.


NV,CRU,MILL,DEGRE
,,,
1.0,Mercurey,1980.0,11.5
2.0,Julienas,1974.0,11.3
3.0,Savigny les Beaunes,1978.0,12.1
4.0,Mercurey,1980.0,10.9
5.0,Pommard,1976.0,11.7
6.0,Mercurey,1981.0,11.2
7.0,Grands Echezeaux,1968.0,11.7
8.0,Cotes de Beaune Villages,1975.0,12.3
9.0,Chapelle Chambertin,1973.0,11.9


# PART I: Database normalization

The first task on this TP is the normalization of the wine data. In its current state both tables Master1 and Master2 are in the First Normal Form (1NF) and suffer from data redundancy, update, deletion and insertion anomalies. 

__1.1__ Convert table Master1 to the Second Normal Form (2NF), Third Normal Form (3NF) and Boyce-Codd Normal Form(BCNF).
* Explain your answer
* List functional dependencies
* Describe the schema of new tables and how they relate

In [31]:
%sql SELECT DISTINCT COUNT(*) AS nbr_doublon, NP, NP FROM MASTER1 GROUP BY NP, NP HAVING   COUNT(*) > 1;

Done.


nbr_doublon,NP,NP_1
6,,
3,1.0,1.0
4,2.0,2.0
2,4.0,4.0
5,5.0,5.0
2,7.0,7.0
2,9.0,9.0
3,10.0,10.0
2,11.0,11.0
3,12.0,12.0


### FD
* (NP, NV) is primary key because as the previous query shows the is a unicity.
* (NP, NV) --> NV, CRU, DEGRE, MILL, QTE, NP, NOM, PRENOM, REGION
* NV --> CRU, DEGRE, MILL
* NP --> NOM, PRENOM, REGION

+-----------------------------------------+
|                Producer                 |
+-----------------------------------------+
| NP:      Producer number                |
| NOM:     Producer's last name           |
| PRENOM:  Producer's first name          |
| REGION:  Wine growing region            |
+-----------------------------------------+

+----------------------------------------------+
|                     Wine                     |
+----------------------------------------------+
| NV:      Wine number                         |
| CRU:     Vineyard or group of vineyards      |
| DEGRE:   Alcohol content                     |
| MILL:    Vintage year                        |
+----------------------------------------------+

+----------------------------------------------+
|                     Quantity                 |
+----------------------------------------------+
| NV:      Wine number                         |  
| NP:      Producer number                     |
| QTE:     Number of bottles bought            |
+----------------------------------------------+


__1.2__ Convert table Master2 to the Second Normal Form (2NF), Third Normal Form (3NF) and Boyce-Codd Normal Form(BCNF).
* Explain your answer
* List functional dependencies
* Describe the schema of new tables and how they relate

In [32]:
%sql SELECT DISTINCT COUNT(*) AS nbr_doublon, NV, NB, DATES, LIEU FROM MASTER2 GROUP BY NV, NB, DATES, LIEU HAVING   COUNT(*) > 1;

Done.


nbr_doublon,NV,NB,DATES,LIEU


Now, we need to split the data from Master1 and Master2 into new tables. A table can be created from the result of a query. In the following example we will create a new table "dummy" to store the different values of alcohol content.

In [33]:
%sql DROP TABLE IF EXISTS dummy;

# Create dummy table
%sql CREATE TABLE dummy AS \
SELECT DISTINCT DEGRE \
FROM MASTER1;

print("\nContent of the database")
printSchema(conn)

print("\nContent of dummy")
%sql SELECT * FROM dummy

Done.
Done.

Content of the database
Client:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
MASTER1:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: QTE(NUM)
  5: NP(NUM)
  6: NOM(TEXT)
  7: PRENOM(TEXT)
  8: REGION(TEXT)
MASTER2:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: DATES(NUM)
  5: LIEU(TEXT)
  6: QTE(NUM)
  7: NB(NUM)
  8: NOM(TEXT)
  9: PRENOM(TEXT)
  10: TYPE(TEXT)
  11: REGION(TEXT)
Place:
  0: LIEU(TEXT)
  1: REGION(TEXT)
Producer:
  0: NP(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: REGION(TEXT)
Quantity:
  0: NV(NUM)
  1: NP(NUM)
  2: QTE(NUM)
Sale:
  0: NV(NUM)
  1: NB(NUM)
  2: DATES(NUM)
  3: LIEU(TEXT)
  4: QTE(NUM)
achats:
  0: NV(NUM)
  1: NB(NUM)
  2: QTE(NUM)
  3: DATES(NUM)
  4: LIEU(TEXT)
buveurs:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
dummy:
  0: DEGRE(NUM)
locations:
  0: LIEU(TEXT)
  1: REGION(TEXT)
producteurs:
  0: NP(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: REGION(TEXT)
quantite:
  

DEGRE
""
11.5
11.3
12.1
10.9
11.7
11.2
12.3
11.9
11.8


In [34]:
# Remove dummy table
%sql DROP TABLE IF EXISTS dummy;

Done.


[]

__1.3__ Create the new tables from Master1:

In [40]:
%sql DROP TABLE IF EXISTS Wine;
%sql DROP TABLE IF EXISTS Producer;
%sql DROP TABLE IF EXISTS Quantity;


# Create Wine table
%sql CREATE TABLE Wine AS SELECT DISTINCT NV, CRU, DEGRE, MILL FROM MASTER1;
%sql CREATE TABLE Producer AS SELECT DISTINCT NP, NOM, PRENOM, REGION FROM MASTER1;
%sql CREATE TABLE Quantity AS SELECT DISTINCT NV, NP, QTE FROM MASTER1;


print("\nContent of the database")
printSchema(conn)

print("\nContent of Wine")
%sql SELECT * FROM Wine

Done.
Done.
Done.
Done.
Done.
Done.

Content of the database
Client:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
MASTER1:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: QTE(NUM)
  5: NP(NUM)
  6: NOM(TEXT)
  7: PRENOM(TEXT)
  8: REGION(TEXT)
MASTER2:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: DATES(NUM)
  5: LIEU(TEXT)
  6: QTE(NUM)
  7: NB(NUM)
  8: NOM(TEXT)
  9: PRENOM(TEXT)
  10: TYPE(TEXT)
  11: REGION(TEXT)
Place:
  0: LIEU(TEXT)
  1: REGION(TEXT)
Producer:
  0: NP(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: REGION(TEXT)
Quantity:
  0: NV(NUM)
  1: NP(NUM)
  2: QTE(NUM)
Sale:
  0: NV(NUM)
  1: NB(NUM)
  2: DATES(NUM)
  3: LIEU(TEXT)
  4: QTE(NUM)
Wine:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
achats:
  0: NV(NUM)
  1: NB(NUM)
  2: QTE(NUM)
  3: DATES(NUM)
  4: LIEU(TEXT)
buveurs:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
locations:
  0: LIEU(TEXT)
  1: REGION(TEXT)
producteurs:
  0: NP(NU

NV,CRU,DEGRE,MILL
,,,
1.0,Mercurey,11.5,1980.0
2.0,Julienas,11.3,1974.0
3.0,Savigny les Beaunes,12.1,1978.0
4.0,Mercurey,10.9,1980.0
5.0,Pommard,11.7,1976.0
6.0,Mercurey,11.2,1981.0
7.0,Grands Echezeaux,11.7,1968.0
8.0,Cotes de Beaune Villages,12.3,1975.0
9.0,Chapelle Chambertin,11.9,1973.0


__1.4__ Create the new tables from Master2:

In [41]:
%sql DROP TABLE IF EXISTS Client;
%sql DROP TABLE IF EXISTS Sale;
%sql DROP TABLE IF EXISTS Place;

# Wine already exists

# Creation of tables
%sql CREATE TABLE Client AS SELECT DISTINCT NB, NOM, PRENOM, TYPE FROM MASTER2;
%sql CREATE TABLE Sale AS SELECT DISTINCT NV, NB, DATES, LIEU, QTE FROM MASTER2;
%sql CREATE TABLE Place AS SELECT DISTINCT LIEU, REGION FROM MASTER2;


print("\nContent of the database")
printSchema(conn)

print("\nContent of Wine")
%sql SELECT * FROM Client

Done.
Done.
Done.
Done.
Done.
Done.

Content of the database
Client:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
MASTER1:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: QTE(NUM)
  5: NP(NUM)
  6: NOM(TEXT)
  7: PRENOM(TEXT)
  8: REGION(TEXT)
MASTER2:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
  4: DATES(NUM)
  5: LIEU(TEXT)
  6: QTE(NUM)
  7: NB(NUM)
  8: NOM(TEXT)
  9: PRENOM(TEXT)
  10: TYPE(TEXT)
  11: REGION(TEXT)
Place:
  0: LIEU(TEXT)
  1: REGION(TEXT)
Producer:
  0: NP(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: REGION(TEXT)
Quantity:
  0: NV(NUM)
  1: NP(NUM)
  2: QTE(NUM)
Sale:
  0: NV(NUM)
  1: NB(NUM)
  2: DATES(NUM)
  3: LIEU(TEXT)
  4: QTE(NUM)
Wine:
  0: NV(NUM)
  1: CRU(TEXT)
  2: DEGRE(NUM)
  3: MILL(NUM)
achats:
  0: NV(NUM)
  1: NB(NUM)
  2: QTE(NUM)
  3: DATES(NUM)
  4: LIEU(TEXT)
buveurs:
  0: NB(NUM)
  1: NOM(TEXT)
  2: PRENOM(TEXT)
  3: TYPE(TEXT)
locations:
  0: LIEU(TEXT)
  1: REGION(TEXT)
producteurs:
  0: NP(NU

NB,NOM,PRENOM,TYPE
11.0,Breton,Andre,petit
13.0,Barthes,Roland,moyen
16.0,Balzac,Honore de,moyen
18.0,Celine,Louis Ferdinand,gros
20.0,Chateaubriand,Francois-Rene de,moyen
21.0,Corbiere,Tristan,petit
23.0,Corneille,Pierre,petit
25.0,Char,Rene,petit
27.0,Dumas,Alexandre,gros
29.0,Fournier,Alain,petit


# PART II: SQL QUERIES

In the second part of this TP you will create SQL queries to retrieve information from the database.

__2.1__ What are the different types of clients (buveurs) by volume of purchases?

In [42]:
%sql SELECT DISTINCT TYPE FROM Client WHERE TYPE <> 'None';

Done.


TYPE
petit
moyen
gros


__2.2__ What regions produce Pommard or Brouilly?

In [43]:
%%sql SELECT A.CRU, C.REGION  
FROM Wine AS A
INNER JOIN Quantity AS B ON A.NV = B.NV
INNER JOIN Producer AS C ON B.NP = C.NP 
WHERE A.CRU IN ('Pommard', 'Brouilly'); 

Done.


CRU,REGION
Pommard,Bourgogne
Pommard,Rhone
Brouilly,Bourgogne


__2.3__ What regions produce Pommard and Brouilly?

In [44]:
%%sql
SELECT C.REGION
FROM Wine AS A
INNER JOIN Quantity AS B ON A.NV = B.NV
INNER JOIN Producer AS C ON B.NP = C.NP
WHERE A.CRU ='Pommard'
intersect
SELECT C.REGION
FROM Wine AS A
INNER JOIN Quantity AS B ON A.NV = B.NV
INNER JOIN Producer AS C ON B.NP = C.NP
WHERE A.CRU ='Brouilly'

Done.


REGION
Bourgogne


__2.4__ Get the number of wines bught by CRU and Millésime

In [45]:
%%sql
SELECT * FROM (
SELECT a.CRU, a.MILL, SUM(b.QTE) AS Number_of_wines_bought
FROM Wine AS a
INNER JOIN Sale AS b ON a.NV = b.NV
GROUP BY a.CRU, a.MILL) WHERE Number_of_wines_bought > 0;

Done.


CRU,MILL,Number_of_wines_bought
Arbois,1980,8
Auxey Duresses,1914,80
Beaujolais Primeur,1983,7
Beaujolais Villages,1975,10
Beaujolais Villages,1976,120
Beaujolais Villages,1978,130
Beaujolais Villages,1979,520
Chapelle Chambertin,1973,30
Chateau Corton Grancey,1980,4
Chenas,1984,1


__2.5__ Retrieve the wine number (NV) of wines produced by more than three producers

In [67]:
%%sql 
SELECT A.NV , COUNT(A.NP) AS Number_of_producers
FROM Quantity AS A
WHERE A.NV IS NOT NULL
GROUP BY A.NV
HAVING COUNT(A.NP) > 3;

Done.


NV,Number_of_producers
45,5
78,5
89,4
98,5


__2.6__ Which producers have not produced any wine?

Done.


NP,NOM,PRENOM
3,Six,Paul
6,Marmagne,Bernard
8,Lioger d'Harduy,Gabriel
16,Barbin,Bernard
17,Faiveley,Guy
18,Tramier,Jean
19,Dupaquier,Roger
20,Lamy,Jean
21,Cornu,Edmond
26,Violot,Gilbert


In [55]:
%%sql SELECT A.NOM, A.PRENOM
FROM Producer AS A
INNER JOIN Quantity AS B ON A.NV = B.NV
INNER JOIN Producer AS C ON B.NP = C.NP
WHERE C.QTE IN 'None'

(sqlite3.OperationalError) no such table: None [SQL: u"SELECT A.NOM, A.PRENOM\nFROM Producer AS A\nINNER JOIN Quantity AS B ON A.NV = B.NV\nINNER JOIN Producer AS C ON B.NP = C.NP\nWHERE C.QTE IN 'None'"]


__2.7__ What clients (buveurs) have bought at least one wine from 1980?

Done.


NB,NOM,PRENOM
2,Artaud,Antonin
8,Aragon,Louis
44,Gide,Andre
45,Giono,Jean
50,Lautreamont,
61,Mallarme,Stephane


__2.8__ What clients (buveurs) have NOT bought any wine from 1980?

Done.


NB,NOM,PRENOM
1,Aristote,
3,Aron,Raymond
4,Apollinaire,Guillaume
5,Audiberti,Jacques
6,Arrabal,Fernando
7,Anouilh,Jean
9,Ajar,Emile
10,Andersen,Yann
11,Breton,Andre
12,Bataille,Georges


__2.9__ What clients (buveurs) have bought ONLY wines from 1980?

Done.


NB,NOM,PRENOM
44,Gide,Andre
45,Giono,Jean
50,Lautreamont,


__2.10__ List all wines from 1980

In [24]:
%sql \
SELECT * \
FROM vins \
WHERE MILL=1980 \
ORDER BY NV asc;

Done.


NV,CRU,MILL,DEGRE
1,Mercurey,1980,11.5
4,Mercurey,1980,10.9
16,Meursault,1980,12.1
20,Cote de Brouilly,1980,12.1
26,Chateau Corton Grancey,1980,
28,Volnay,1980,11.0
43,Fleurie,1980,11.4
74,Arbois,1980,12.0
78,Etoile,1980,12.0
79,Seyssel,1980,11.0


__2.11__ What are the wines from 1980 bought by NB=2?

Done.


NV,NB,DATES,LIEU,QTE,CRU,MILL,DEGRE
1,2,1977-11-02,BORDEAUX,33,Mercurey,1980,11.5


__2.12__ What clients (buveurs) have bought ALL the wines from 1980?

Done.


NB,NOM,PRENOM
44,Gide,Andre
