# Queryen van een database

### Connecteren op de database

In [41]:
import psycopg2

# Verbinding maken
# Database connection parameters
conn = psycopg2.connect(
    dbname="movies",  # of jouw database naam
    user="admin",  # standaard postgres gebruiker
    password="myPasswww00rD",  # wachtwoord dat je hebt ingesteld in de compose file
    host="mypostgres",  # de containernaam in het netwerk is de host
    port="5432"  # standaard PostgreSQL-poort
)

In [42]:
# Cursor aanmaken
cur = conn.cursor()

# Een eenvoudige query uitvoeren
cur.execute("SELECT version();")

# Het resultaat ophalen
version = cur.fetchone()
print("Database versie:", version)

Database versie: ('PostgreSQL 15.3 (Debian 15.3-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit',)


In [18]:
import pandas as pd
from sqlalchemy import create_engine
database_url = "postgresql+psycopg2://admin:myPasswww00rD@mypostgres:5432/movies"

# Maak een engine
engine = create_engine(database_url)

## Intro tot SQL

*Alles selecteren van een tabel* - gebruik van **LIMIT(n)** om het aantal resultaten te beperken.

In [19]:
query = """
SELECT *
FROM films
LIMIT(2);
"""

In [20]:
df = pd.read_sql_query(query, database_url)
df

Unnamed: 0,id,title,release_year,country,duration,language,certification,gross,budget
0,1,Intolerance: Love's Struggle Throughout the Ages,1916,USA,123,,Not Rated,,385907.0
1,2,Over the Hill to the Poorhouse,1920,USA,110,,,3000000.0,100000.0


*Enkel de unieke talen waarin films zijn opgenomen - DISTINCT keyword*

In [21]:
query = """

"""

In [22]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,language
0,Danish
1,Greek
2,Dzongkha
3,
4,Tamil


*Aliasing: kies de unieke talen van de films en alias als unieke_talen*

In [23]:
query = """

"""

In [24]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,unique_language
0,Danish
1,Greek
2,Dzongkha
3,
4,Tamil


## Views

Een **view** in PostgreSQL is een virtuele tabel die het resultaat van een opgeslagen SQL-query vertegenwoordigt. Het lijkt op een gewone tabel, maar slaat geen gegevens fysiek op; in plaats daarvan wordt de query uitgevoerd telkens wanneer je de view raadpleegt.

### **Kenmerken van een View**
1. **Virtueel**: Het bevat geen eigen gegevens, maar toont gegevens uit onderliggende tabellen.
2. **Dynamisch**: De gegevens in de view veranderen automatisch wanneer de onderliggende tabellen worden bijgewerkt.
3. **Gemak**: Het biedt een eenvoudige manier om complexe query’s herhaaldelijk te gebruiken.
4. **Beveiliging**: Je kunt toegang beperken door gebruikers alleen de view te laten raadplegen in plaats van de oorspronkelijke tabellen.

### **Voordelen**
- **Leesbaarheid**: Houdt query's schoon en overzichtelijk.
- **Herbruikbaarheid**: Voorkomt dat je steeds dezelfde complexe query hoeft te schrijven.
- **Beveiliging**: Kan gebruikt worden om bepaalde gegevens te verbergen (bijvoorbeeld gevoelige kolommen).

### **Nadelen**
- Views zijn langzamer dan tabellen, omdat ze de query telkens opnieuw uitvoeren.
- Niet altijd geschikt voor updates (tenzij met een *INSTEAD OF* trigger). 

Gebruik views wanneer je regelmatig complexe of samengestelde gegevens nodig hebt!

*Sla de resultaten van deze query op als een view genoemd films_languages*

In [51]:
# conn.rollback()

In [54]:
delete_query = "DROP VIEW IF EXISTS films_languages;"
cur.execute(delete_query)
conn.commit()

In [55]:
create_view_query = """

"""

cur.execute(create_view_query)
conn.commit()

In [56]:
# Controleer of de view films_languages is aangemaakt en wat de definitie ervan is:
query = """

"""
df = pd.read_sql_query(query, engine)
print(df)

        table_name                                    view_definition
0  films_languages   SELECT DISTINCT films.language AS unique_lang...


In [58]:
# nu kan je queryen op de view.  Vraag de eerste 3 talen.
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,unique_language
0,Danish
1,Greek
2,Dzongkha


## Intermediate SQL

*Selecteer het aantal records die een film_id bevatten in de reviews tabel, en tel deze:*

In [29]:
query = """

"""

In [30]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,count_film_id
0,4968


*Selecteer het aantal mensen in de people tabel, aliased als total_people*

In [31]:
query = """

"""

In [32]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,total_people
0,8397


List all film titles with missing budgets

In [33]:
query = """

"""

In [34]:
df = pd.read_sql_query(query, database_url)

In [35]:
df.head()

Unnamed: 0,no_budget_info
0,Pandora's Box
1,The Prisoner of Zenda
2,The Blue Bird
3,Bambi
4,State Fair


In [36]:
## Count the number of records in the people table

In [None]:
query = """
"""

In [16]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,no_budget_info
0,Pandora's Box
1,The Prisoner of Zenda
2,The Blue Bird
3,Bambi
4,State Fair


In [17]:
## Return the unique countries from the films table

In [18]:
query = """

"""

In [19]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,country
0,Soviet Union
1,Indonesia
2,Italy
3,Cameroon
4,Czech Republic


In [None]:
## Select film_id and imdb_score with an imdb_score over 7.0

In [20]:
query = """

"""

In [21]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,film_id,imdb_score
0,3934,7.1
1,74,7.6
2,1254,8.0
3,4841,8.1
4,3252,7.2


In [None]:
## Count the Spanish-language films

In [22]:
query = """

"""

In [23]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,count_spanish
0,40


Count the number of films with a language associated with them and use the alias count_language_known.

In [19]:
query = """

"""

In [20]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,count_language_known
0,4968


In [24]:
## Select the title and release_year for all German-language films released before 2000

In [25]:
query = """

"""

In [26]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,release_year
0,Metropolis,1927
1,Pandora's Box,1929
2,The Torture Chamber of Dr. Sadism,1967
3,Das Boot,1981
4,Run Lola Run,1998


In [28]:
## Find the title and year of films from the 1990 or 1999

In [29]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,release_year
0,Arachnophobia,1990
1,Back to the Future Part III,1990
2,Child's Play 2,1990
3,Dances with Wolves,1990
4,Days of Thunder,1990


In [30]:
## Select the title and release_year for films released between 1990 and 2000

In [36]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,release_year
0,Arachnophobia,1990
1,Back to the Future Part III,1990
2,Child's Play 2,1990
3,Dances with Wolves,1990
4,Days of Thunder,1990


## Filtering text

The LIKE and NOT LIKE operators can be used to find records that either match or do not match a specified pattern, respectively. They can be coupled with the wildcards % and _. The % will match zero or many characters, and _ will match a single character.

In [74]:
## Select the names that start with B

In [None]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

In [37]:
## Find the title and release_year for all films over two hours in length released in 1990 and 2000

In [43]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,release_year
0,Dances with Wolves,1990
1,Die Hard 2,1990
2,Ghost,1990
3,Goodfellas,1990
4,Mo' Better Blues,1990


In [41]:
## Combining filtering and selecting

Time for a little challenge. So far, your SQL vocabulary from this course includes COUNT(), DISTINCT, LIMIT, WHERE, OR, AND, BETWEEN, LIKE, NOT LIKE, and IN. In this exercise, you will try to use some of these together. Writing more complex queries will be standard for you as you become a qualified SQL programmer.

As this query will be a little more complicated than what you've seen so far, we've included a bit of code to get you started. You will be using DISTINCT here too because, surprise, there are two movies named 'Hamlet' in this dataset!

Follow the instructions to find out what 90's films we have in our dataset that would be suitable for English-speaking teens.

Count the unique titles from the films database and use the alias provided.
Filter to include only movies with a release_year from 1990 to 1999, inclusive.
Add another filter narrowing your query down to English-language films.
Add a final filter to select only films with 'G', 'PG', 'PG-13' certifications.

In [44]:
query = """

""" 
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,nineties_english_films_for_teens
0,310


## NULL values

In [45]:
## List all film titles with missing budgets

In [46]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,no_budget_info
0,Pandora's Box
1,The Prisoner of Zenda
2,The Blue Bird
3,Bambi
4,State Fair


## Summarizing Data

#### Aggregate functions

*SELECT MIN(budget) FROM films;* 

*SELECT MIN(country) AS min_country FROM films; -> alfabetisch!*

Use the SUM() function to calculate the total duration of all films and alias with total_duration.

In [10]:
query = """

"""

In [11]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,total_duration
0,534882


Calculate the average duration of all films and alias with average_duration.

In [35]:
query = """

"""

In [36]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,average_duration
0,107.947931


Find the most recent release_year in the films table, aliasing as latest_year.

In [37]:
query = """

"""

In [38]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,latest_year
0,2016


Find the duration of the shortest film and use the alias shortest_film.

In [39]:
query = """

"""

In [40]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,shortest_film
0,7


#### Summarizing Subsets

When combining aggregate functions with WHERE, you get a powerful tool that allows you to get more granular with your insights, for example, to get the total budget of movies made from the year 2010 onwards.

This combination is useful when you only want to summarize a subset of your data. In your film-industry role, as an example, you may like to summarize each certification category to compare how they each perform or if one certification has a higher average budget than another.

Let's see what insights you can gain about the financials in the dataset.

*ROUND idem als in Python; tweede argument leeg geeft tot INT.  Negatief getal vb -5 zal tot hondertallen, duizendtallen enz afronden.*

Use SUM() to calculate the total gross for all films made in the year 2000 or later, and use the alias total_gross.

In [4]:
query = """

"""

In [5]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,total_gross
0,150900926358


Calculate the average amount grossed by all films whose titles start with the letter 'A' and alias with avg_gross_A.

In [6]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,avg_gross_a
0,42776120.0


Calculate the lowest gross film in 1994 and use the alias lowest_gross.

In [7]:
query = """

"""

In [8]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,lowest_gross
0,125169


Calculate the highest gross film between 2000 and 2012, inclusive, and use the alias highest_gross.

In [9]:
query = """

"""

In [10]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,highest_gross
0,760505847


#### Using ROUND

Calculate the average facebook_likes to one decimal place and assign to the alias, avg_facebook_likes.

In [11]:
query = """

"""

In [12]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,avg_facebook_likes
0,7802.9


**ROUND() with a negative parameter**

Calculate the average budget from the films table, aliased as avg_budget_thousands, and round to the nearest thousand.

In [13]:
query = """

"""

In [14]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,avg_budget_thousands
0,39903000.0


#### Aliasing and arithmetic

*SELECT (4 / 3) -> return INT want argumenten zijn INTS*  
*SELECT (4.0 / 3.0) -> returnt float*

Select the title and duration in hours for all films and alias as duration_hours; since the current durations are in minutes, you'll need to divide duration by 60.0.

In [15]:
query = """

"""

In [16]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,duration_hours
0,Intolerance: Love's Struggle Throughout the Ages,2.05
1,Over the Hill to the Poorhouse,1.833333
2,The Big Parade,2.516667
3,Metropolis,2.416667
4,Pandora's Box,1.833333


In [17]:
## Round duration_hours to two decimal places

In [18]:
query = """

"""

In [19]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,duration_hours
0,Intolerance: Love's Struggle Throughout the Ages,2.05
1,Over the Hill to the Poorhouse,1.83
2,The Big Parade,2.52
3,Metropolis,2.42
4,Pandora's Box,1.83


In [20]:
## Select name from people and sort alphabetically

In [21]:
query = """

"""

In [22]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,name
0,50 Cent
1,Aaliyah
2,Aaron Ashmore
3,Aaron Hann
4,Aaron Hill


In [23]:
## Select the release year, duration, and title sorted by release year and duration

In [24]:
query = """

"""

In [25]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,duration,title
0,1916.0,123.0,Intolerance: Love's Struggle Throughout the Ages
1,1920.0,110.0,Over the Hill to the Poorhouse
2,1925.0,151.0,The Big Parade
3,1927.0,145.0,Metropolis
4,1929.0,100.0,The Broadway Melody


Calculate the percentage of people who are no longer alive and alias the result as percentage_dead.

In [27]:
query = """

"""

In [28]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,percentage_dead
0,9.372395


Find how many decades (period of ten years) the films table covers by using MIN() and MAX(); alias as number_of_decades.

In [29]:
query = """

"""

In [30]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,number_of_decades
0,10.0


-- Round duration_hours to two decimal places

In [31]:
query = """

"""

In [32]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,title,duration_hours
0,Intolerance: Love's Struggle Throughout the Ages,2.05
1,Over the Hill to the Poorhouse,1.83
2,The Big Parade,2.52
3,Metropolis,2.42
4,Pandora's Box,1.83


#### Sorting Results

*ORDER BY: na FROM*  
*Volgorde: ASC of DESC*  
*ORDER BY multiple fields: ORDER BY field_one, field_two*

Select the name of each person in the people table, sorted alphabetically.

In [33]:
query = """

"""

In [34]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,name
0,50 Cent
1,Aaliyah
2,Aaron Ashmore
3,Aaron Hann
4,Aaron Hill


Select the title and duration for every film, from longest duration to shortest.

In [59]:
query = """

"""

In [61]:
df = pd.read_sql_query(query, database_url)
df.tail()

Unnamed: 0,title,duration
4963,Anger Management,22.0
4964,"10,000 B.C.",22.0
4965,Wal-Mart: The High Cost of Low Price,20.0
4966,Vessel,14.0
4967,The Touch,7.0


Select the release_year, duration, and title of films ordered by their release year and duration, in that order.

In [37]:
query = """

"""

In [38]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,duration,title
0,1916.0,123.0,Intolerance: Love's Struggle Throughout the Ages
1,1920.0,110.0,Over the Hill to the Poorhouse
2,1925.0,151.0,The Big Parade
3,1927.0,145.0,Metropolis
4,1929.0,100.0,The Broadway Melody


Select the certification, release_year, and title from films ordered first by certification (alphabetically) and second by release year, starting with the most recent year.

In [39]:
query = """

"""

In [40]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,certification,release_year,title
0,,,Trapped
1,,,Get Real
2,,,Ghost Hunters
3,,,Hit the Floor
4,,,The Grand


#### GROUP BY

*ORDER BY komt ALTIJD NA GROUP BY!*

Select the release_year and count of films released in each year aliased as film_count.

In [41]:
query = """

"""

In [42]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,film_count
0,1954.0,5
1,1988.0,31
2,1959.0,3
3,1964.0,10
4,1969.0,10


Select the release_year and average duration aliased as avg_duration of all films, grouped by release_year.

In [45]:
query = """

"""

In [46]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,avg_duration
0,1954.0,140.6
1,1988.0,107.0
2,1959.0,136.666667
3,1964.0,119.4
4,1969.0,126.0


Select the release_year, country, and the maximum budget aliased as max_budget for each year and each country; sort your results by release_year and country.

In [48]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,country,max_budget
0,1916.0,USA,385907.0
1,1920.0,USA,100000.0
2,1925.0,USA,245000.0
3,1927.0,Germany,6000000.0
4,1929.0,Germany,


#### Answering business questions
In the real world, every SQL query starts with a business question. Then it is up to you to decide how to write the query that answers the question. Let's try this out.

Which release_year had the most language diversity?

Take your time to translate this question into code. We'll get you started then it's up to you to test your queries in the console.

"Most language diversity" can be interpreted as COUNT(DISTINCT ___). Now over to you.

In [49]:
query = """

"""

In [50]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,num_languages
0,2006.0,17
1,2015.0,15
2,2005.0,14
3,2008.0,13
4,2013.0,13


## Filter with HAVING

In PostgreSQL (en SQL in het algemeen) gebruik je **HAVING** en **WHERE** om rijen te filteren, maar ze worden toegepast in verschillende fasen van een query en voor verschillende doeleinden.

#### **WHERE**
- **Doel**: Filtert rijen **voor** de aggregatie (groepering) plaatsvindt.
- **Wanneer te gebruiken**: Als je individuele rijen wilt filteren op basis van hun waarden.

##### Voorbeeld:
Stel dat je alle films wilt vinden met een rating hoger dan 8.
```sql
SELECT title, rating
FROM films
WHERE rating > 8;
```

- **Filtert rijen vóór** een eventuele `GROUP BY` of aggregatiefuncties zoals `COUNT`, `SUM`, etc.

#### **HAVING**
- **Doel**: Filtert **na** de aggregatie (groepering) plaatsvindt.
- **Wanneer te gebruiken**: Als je gefilterde resultaten wilt op basis van een aggregatiefunctie.

##### Voorbeeld:
Stel dat je alle genres wilt vinden waar de gemiddelde rating hoger is dan 8.
```sql
SELECT genre, AVG(rating) AS avg_rating
FROM films
GROUP BY genre
HAVING AVG(rating) > 8;
```

- **Filtert groepen** die zijn gemaakt met `GROUP BY` op basis van de resultaten van aggregatiefuncties.

---

#### Verschillen tussen WHERE en HAVING

| **Kenmerk**        | **WHERE**                         | **HAVING**                       |
|---------------------|-----------------------------------|-----------------------------------|
| **Timing**          | Voor de aggregatie               | Na de aggregatie                 |
| **Toegepast op**    | Individuele rijen                | Groepen of aggregaties           |
| **Gebruik met aggregaties** | Niet toegestaan (bijv. `SUM`, `AVG`) | Vereist voor filteren met aggregaties |

---

#### **Combineren van WHERE en HAVING**
Je kunt beide gebruiken in dezelfde query, bijvoorbeeld:
```sql
SELECT genre, AVG(rating) AS avg_rating
FROM films
WHERE release_year > 2000 -- Filter individuele films
GROUP BY genre
HAVING AVG(rating) > 8; -- Filter groepen op basis van de gemiddelde rating
```

- **WHERE** filtert individuele rijen (alleen films vanaf 2000).
- **HAVING** filtert groepen (genres met een gemiddelde rating hoger dan 8). 

Kort samengevat: **Gebruik WHERE voor individuele rijen en HAVING voor aggregaties.**

Het filteren van gegroepeerde gegevens kan vooral handig zijn bij het werken met een grote dataset. Bij het werken met duizenden of zelfs miljoenen rijen stelt **HAVING** je in staat om alleen die groep gegevens te filteren die je wilt, zoals films die langer dan twee uur duren!

Oefen met het gebruik van **HAVING** om te ontdekken welke landen (of welk land) de meest gevarieerde filmcertificeringen hebben.

- Selecteer `country` uit de tabel `films` en krijg het aantal unieke certificeringen, met de alias `certification_count`.  
- Groepeer de resultaten op `country`.  
- Filter het unieke aantal certificeringen zodat alleen resultaten groter dan 10 worden weergegeven.

In [51]:
query = """

"""

In [52]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,country,certification_count
0,UK,11
1,USA,13


#### HAVING en sorting

Select the country and the average budget as average_budget, rounded to two decimal, from films.
Group the results by country.
Filter the results to countries with an average budget of more than one billion (1000000000).
Sort by descending order of the average_budget.

In [53]:
query = """

"""

In [54]:
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,country,average_budget
0,South Korea,1383960000.0
1,Hungary,1260000000.0


Select the release_year for each film in the films table, filter for records released after 1990, and group by release_year.

In [55]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year
0,2008
1,1991
2,2009
3,2005
4,2013


Modify the query to include the average budget aliased as avg_budget and average gross aliased as avg_gross for the results we have so far.

In [57]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,avg_budget,avg_gross
0,2008,41804890.0,44573510.0
1,1991,25176550.0,53844500.0
2,2009,37073290.0,46207440.0
3,2005,70323940.0,41159140.0
4,2013,40519040.0,56158360.0


Modify the query once more so that only years with an average budget of greater than 60 million are included.

In [58]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,avg_budget,avg_gross
0,2005,70323940.0,41159140.0
1,2006,93968930.0,39237860.0


In [59]:
# OF:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,avg_budget,avg_gross
0,2005,70323940.0,41159140.0
1,2006,93968930.0,39237860.0


Finally, order the results from the highest average gross and limit to one.

In [60]:
query = """

"""
df = pd.read_sql_query(query, database_url)
df.head()

Unnamed: 0,release_year,avg_budget,avg_gross
0,2005,70323940.0,41159140.0
