In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sqlalchemy.engine import create_engine

%matplotlib inline

In [None]:
%load_ext sql
%config SqlMagic.displaylimit=50
%config SqlMagic.autopandas=True

In [None]:
%sql postgresql://imdb:imdb_admin@postgres:5432/imdb_database

In [None]:
connection = create_engine('postgresql://imdb:imdb_admin@postgres:5432/imdb_database')

# Introduction to SQL (Structured Query Language)

##### Version 0.1

***

By Scott Coughlin (Northwestern IT Research Computing and Data Services)  
June 3rd 2024

[Session 21](https://github.com/LSSTC-DSFP/Session-21) is primarily concerned with handling our data with efficiency.

Ideally, for any and every task we want to desire solutions that operate *faster*. 

This can be accomplished many different ways:

$~~~~~~$build algorithms that execute faster

$~~~~~~$spread calculations over many different computers simultaneously

$~~~~~~$find a compact storage solution for the data so it can be accessed more quickly

In our introduction to SQL we will start with simple queries of existing tables, and discuss creating your own tables using `pandas` as a challenge problem. 

## Problem 1) IMDb Data

Throughout the session we will use information from the [Internet Movie Database (IMDb)](https://www.imdb.com/) to illustrate various principles regarding databases.

A quick note on the provenance of this data. The files we have used to populate this data set are from [this website](https://relational.fit.cvut.cz/dataset/IMDb) and it may not be a list of every single movie on IMDb (there are no movies after 2004).

#### Please note that you can make an SQL command call from a jupyter cell by adding "%sql" in front of the SQL command you want to run, see examples below
```
## Perform a SQL command and see the results of the query
%sql SELECT * FROM imbd_movies;

## If you save to a variable, in this case "result", then the variable will be a `pandas` DataFrame based on the result of the query
result = %sql SELECT * FROM imdb_directors ORDER BY first_name LIMIT 10; 
```

Please execute the cell below to list all of the table names in the imbd_database database. You will want these table names to answer the questions that follow.

In [None]:
%sql \dt+

In [None]:
imdb_movies = %sql SELECT * FROM imdb_movies;
imdb_directors = %sql SELECT * FROM imdb_directors
imdb_movies_directors = %sql SELECT * FROM imdb_movies_directors
imdb_movies_genres = %sql SELECT * FROM imdb_movies_genres

**Problem 1a**

Using SQL, SELECT 10 movies from the imbd_movies table. Select 10 directors from imbd_directors and order by `first_name`.

**Problem 1b**

Using SQL, how many movies are there? How many directors are there? 

*Write your answer here*

**Problem 1c**

Using SQL, determine how many movies are there after the year 2000?

*Write your answer here*

**Problem 1d**

How many different movie genres are there?

*Write your answer here*

## Problem 2) Joins

We started this exercise with a goal of being efficient. And yet, the data have been organized across 4 different files (each sheet is effectively a unique csv file).  

**Problem 2a**

Join `imdb_movies` and `imdb_movies_genres` together

*write your answer here*

**Problem 2b**

Join `imdb_movies`, `imdb_movies_directors` and `imdb_directors` together

*write your answer here*

## Problem 3) Groups and Aggregates

Now that we know why the data has been organized in this way, we can leverage this unique structure in order to learn interesting properties of the data. 

**Problem 3a**

In which year were the most movies made according to IMDb?

*write your answer here*

**Problem 3b**

How many "Action" movies where made after the year 1980? Before the year 1980?

*write your answer here*

**Problem 3c**

Select all films made by `Scorsese`. How many are there?

*write your answer here*

**Problem 3c**

According the the IMDb data, which director has directed the most movies?

*write your answer here*

**Problem 3d**

According the the IMDb data, which director has directed the most movies in each genre?

*write your answer here*

## Challenge Problem) Make your own tables

**Problem 1a**

Create a new TABLE.

**Problem 1b**

INSERT 3 rows into the TABLE you made above

**Problem 1c**

Create a pandas DataFrame and save as a SQL table

*** hint look at the `pandas.to_sql` documentation and note that we already made a "connection" variable called `connection` ***