# SQL Databases

## Connecting to a SQL database

You can query SQL database tables with the `read_sql` command.

### Setting up your connection with SQLAlchemy
Pandas relies on a third-party library called [SQLAlechmy][1] to establish a connection to a database.

### Connection string
To make the connection, we need to pass a connection string to the `create_engine` function. The general form of a connection string is the following:

`dialect+driver://username:password@host:port/database`

Read more about [engine configuration here][2].

### Connection string for sqlite

We will be using sqlite databases in this notebook. Its [connection string][3] is even simpler:

`sqlite:///<path_to_db>`


## The Chinook Database
A proper relational database will have a diagram depicting the tables, columns and their data types, and relationships between them.

![](images/chinook_er.jpg)

[1]:https://www.sqlalchemy.org/
[2]: https://docs.sqlalchemy.org/en/latest/core/engines.html
[3]: https://docs.sqlalchemy.org/en/latest/core/engines.html#sqlite

## Primary and Foreign Keys
A key component of relational databases is the idea of primary and foreign keys. A primary key is a column whose value uniquely identifies each row in the table. A foreign key is a primary key located in a different table than where it is the primary key. A foreign key is not unique and can appear any number of times within its table.

In the above diagram, all the primary keys have a little key symbol next to them. For example, in the `tracks` table, `TrackId` is a primary key and (should) guarantee us that each value in that column is unique.

The `tracks` table has several foreign keys in it as well - `AlbumId`, `MediaTypeId`, and `GenreId`.

### Relationships between tables
The relationships between the tables are mapped with lines in the diagram. These lines connect a column of one table to a column in another.

Notice the symbol right before the line connects to each table. The symbols with a single "prong" mean that there is one (or at most one) unique values in that column. The multiple pronged symbol means that there each value can appear more than once.

For example, look at the single-pronged symbol from the `media_types` table connected to the multi-pronged symbol at the `tracks` table. This means that for each `MediaTypeId` in the `media_types`, it might be found multiple times in the `tracks` table. 

Looking at the relationship in the opposite direction - each `MediaTypeId` in the tracks table is found exactly one time in the `media_types` table.

This is called a one-to-many or a many-to-one relationship. Two single-pronged symbols are a one-to-one relationship. Tables can be set up so there are many-to-many relationships, but this is discouraged.

## Preparing the connection
Let's import the `create_engine` function and pass it the location of the database (relative to our current path).

In [None]:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///../data/databases/chinook.db')

### Back to Pandas
We can import an entire table from the database directly as a Pandas DataFrame with the `read_sql` function. Let's import the `tracks` table.

In [None]:
tracks = pd.read_sql('tracks', con=engine)
tracks.head()

### Use raw SQL
Pass `read_sql` an actual sql query as a string.

In [None]:
tracks.columns

In [None]:
query = """select name, composer, milliseconds 
           from tracks 
           where milliseconds > 200000 and composer is not null """
long_tracks = pd.read_sql(query, engine)
long_tracks.head(10)

## Joining tables in Pandas with `merge`

The `merge` method allows us to join two Pandas DataFrames together based on the values within one or more columns. It follows sql-style logic and allows for inner, left, right, or outer joins.

### Getting the media type name in our tracks table
The `tracks` table has a column called `MediaTypeId` but does not directly store the name of this media type in the table itself.

Let's join the `tracks` table with the `media_types` table to get the name of the media along with the track information in a single table.

In [None]:
media_types = pd.read_sql('media_types', engine)
media_types.head()

In [None]:
tracks_media = tracks.merge(media_types, on='MediaTypeId')
tracks_media.head()

In [None]:
tracks.shape

In [None]:
tracks_media.shape

### Explanation

THe `on` parameter is set to the column name (or names) that is used to join the two tables. The column name must appear in both tables. Notice that the resulting table has a single additional column `Name_y`. Even though the `media_types` table had two columns, Pandas keeps only the non-joining columns in the resulting table.

Pandas will append a suffix to any column names that appear in both tables as to differentiate them. You can control the suffix with the suffixes parameter like this:

In [None]:
tracks.merge(media_types, on='MediaTypeId', suffixes=('_left', '_right')).head()

### Different column names when joining
If the column names for the joining tables are not the same, use the `left_on` and `right_on` parameters to specify their names explicitly. For instance, let's change the joining column in the `tracks` table.

In [None]:
tracks2 = tracks.rename(columns={'MediaTypeId': 'MTID'})
tracks2.head()

In [None]:
tracks2.merge(media_types, left_on='MTID', right_on='MediaTypeId').head()

## Exercises
Read the tables into Pandas to answer the questions. Do not answer them with raw sql statements

### Exercise 1
<span  style="color:green; font-size:16px">How many media types does each track have? Answer this by looking at the data diagram and then programmatically.</span>

### Exercise 2
<span  style="color:green; font-size:16px">Which track has sold the most copies?</span>

### Exercise 3
<span  style="color:green; font-size:16px">Which playlist has the most tracks?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Which playlist, that has at least 15 tracks has on average the most expensive tracks?</span>

### Exercise 5
<span  style="color:green; font-size:16px">Find the most sold genre per country.</span>

### Exercise 6
<span  style="color:green; font-size:16px">Find the name and email of each employee's boss. Make use of the suffix arguments to better label the merged data. Be sure to include employees that don't have bosses. This is called a recursive relationship.</span>

### Exercise 7
<span  style="color:green; font-size:16px">Which artists have the longest tracks on average? Return answer in minutes.</span>