# Relational Algebra, SQL, and Joins


The goals of this recitation are review relational algebra, and to get some practice with SQL queries and joins.

A relational database is one type of database. It uses a structure that allows us to identify and access data in relation to another piece of data in the database. Data in a relational database is organized into tables.

### Relational Algebra
The formal theoretical way of working with data stored in a relational model.

Elements:
1. Rows > Tuples
2. Relation > Set of tuples.


Operations:
- Projection, selection, rename, join, etc..

We will introduce these abstract concepts through a practical approach.

### SQL - Structured Query Language

A language to perform relational algebra operations. (e.g. selection, projection, joins, etc…)

- We write queries in SQL to retrieve data and answer questions about it.
- Declarative Language (not procedural) - You describe what the result you want is, NOT how to obtain the result.

Using an SQL query, you can create and delete, or modify tables, as well as select, insert, and delete data from existing tables.

NOTE: The exact syntax of SQL may vary depending on the underlying database you are using. But most are very similar.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sqlite3


In [None]:
# when you run this cell, it'll ask you to input "Y/n" for uninstalling sqlalchemy
# make sure to manually input "Y" for the code to continue running
!pip uninstall sqlalchemy
!pip install sqlalchemy==1.4.46
!pip install pandasql
import pandasql as ps
# Set up a database
conn = sqlite3.connect('recitationTest.db')

Found existing installation: SQLAlchemy 2.0.20
Uninstalling SQLAlchemy-2.0.20:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/SQLAlchemy-2.0.20.dist-info/*
    /usr/local/lib/python3.10/dist-packages/sqlalchemy/*
Proceed (Y/n)? Y
  Successfully uninstalled SQLAlchemy-2.0.20
Collecting sqlalchemy==1.4.46
  Downloading SQLAlchemy-1.4.46-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sqlalchemy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.4.46 which is incompatible.[0m[31m
[0mSuccessfully installed sqlalchemy-1.4.46
Collecting pandasql
  Downloading pandasql-0.7.3.t

### Our Dataset

#### Spaceship Management Database

We want to keep track of:
- `crew.csv`: Crew members master file
- `roles.csv`: member and their roles on the spaceship(captain,scientist, etc…)
- `equipment.csv`: Equipment (centrifuge, lab gloves, soldering stations, etc..).
- `worklog.csv`: Hours and what days crew members worked.
- `manages.csv`: Which crew members manage which equipment.

We want to ask questions about this data.

### Import Data

Please download the data onto your local drive and drop them into the folder within this Colab. You can navigate to the folder by selecting the 📁 folder icon in the left sidebar, and drag & drop the csv files into it.

- crew.csv [Download](https://drive.google.com/file/d/1vpUdssqCn9EVn9KGAANTHkc7IkcpwqyN/view?usp=sharing)
- roles.csv [Download](https://drive.google.com/file/d/1x0ASBcsXg7jDtal4726R7I1kSffnI5hB/view?usp=sharing)
- equipment.csv [Download](https://drive.google.com/file/d/172C83HtkP0SjxF_teGgS0i9Ii9jcSPlz/view?usp=sharing)
- worklog.csv [Download](https://drive.google.com/file/d/1wwnbUGAuGQR611qfr2CHIAVyR0NBpFXL/view?usp=sharing)
- manages.csv [Download](https://drive.google.com/file/d/1thRZBRfmyMl4rDdSy9-OlZZxKSmIeS5z/view?usp=sharing)

In [None]:
crew_df = pd.read_csv("crew.csv")
roles_df = pd.read_csv("roles.csv")
equipment_df = pd.read_csv("equipment.csv")
manages_df = pd.read_csv("manages.csv")
worklog_df = pd.read_csv("worklog.csv")

### Examining the Data

In [None]:
crew_df.head()

Unnamed: 0,id,name,rank,role_id
0,1,Jane,10,1.0
1,2,Dan,9,2.0
2,3,Alex,4,3.0
3,4,Jen,4,4.0
4,5,Brandon,1,


In [None]:
crew_df.dtypes

id           int64
name        object
rank         int64
role_id    float64
dtype: object

In [None]:
roles_df.head()

Unnamed: 0,role_id,name
0,1,captain
1,2,scientist
2,3,engineer
3,4,engineer 2


In [None]:
roles_df.dtypes

role_id     int64
name       object
dtype: object

In [None]:
equipment_df.head()

Unnamed: 0,id,name
0,1,Centrifuge
1,2,Soldering Station
2,3,Notebook
3,4,Chemical Z


In [None]:
equipment_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      4 non-null      int64 
 1   name    4 non-null      object
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes


In [None]:
manages_df.head()

Unnamed: 0,id,crew_id,equip_id
0,1,2,1
1,2,3,2
2,3,1,3
3,4,2,4
4,5,1,4


In [None]:
manages_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   id        5 non-null      int64
 1   crew_id   5 non-null      int64
 2   equip_id  5 non-null      int64
dtypes: int64(3)
memory usage: 248.0 bytes


In [None]:
worklog_df.head()

Unnamed: 0,id,crew_id,day,hours
0,1,1,1,10
1,2,2,1,5
2,3,3,1,8
3,4,4,1,12
4,5,1,2,5


In [None]:
worklog_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   id       9 non-null      int64
 1   crew_id  9 non-null      int64
 2   day      9 non-null      int64
 3   hours    9 non-null      int64
dtypes: int64(4)
memory usage: 416.0 bytes


Now that we have formed dataframes for our tables, we can use pandasql.
The idea of pandasql is to make Python speak SQL!
You can find more information here: https://community.alteryx.com/t5/Data-Science-Blog/pandasql-Make-python-speak-SQL/ba-p/138435


Suppose we just want to list down the names of crew members! 'Select' helps in retrieving rows and columns which we would like to see

In [None]:
### Select only the names of crew members
query_crew_names = '''
SELECT name
FROM crew_df'''
crew_names_df = ps.sqldf(query_crew_names, locals())
crew_names_df

Unnamed: 0,name
0,Jane
1,Dan
2,Alex
3,Jen
4,Brandon


#### Conditional Retrieval

We use the WHERE clause to apply a condition to our retrieval.

In [None]:
#Load the crew table
#####Retrieve all tuples where crew members have rank either 10 or 4 and their name starts with letter J
query_conditional = """
SELECT *
FROM crew_df
WHERE (rank=10 OR rank=4)
AND name
LIKE 'j%'"""
#                         '''
#####Retrieve all tuples where crew members whose name starts with letter J and ends with N
# alt_query_conditional = '''SELECT * FROM crew_df WHERE name LIKE 'j%n'
# '''
crew_rank_df = ps.sqldf(query_conditional, locals()) #Selecting all columns here
crew_rank_df

Unnamed: 0,id,name,rank,role_id
0,1,Jane,10,1.0
1,4,Jen,4,4.0


#### Ordering

You can order your results by values in the columns.

Let’s retrieve the equipment list in increasing lexicographic
order.

In [None]:
query_ordering3 = '''
SELECT *
FROM manages_df
ORDER BY crew_id ASC'''
manages_order_df = ps.sqldf(query_ordering3, locals())
manages_order_df

Unnamed: 0,id,crew_id,equip_id
0,3,1,3
1,5,1,4
2,1,2,1
3,4,2,4
4,2,3,2


In [None]:
query_ordering2 = '''
SELECT *
FROM manages_df
ORDER BY crew_id ASC, equip_id DESC'''
manages_order_df = ps.sqldf(query_ordering2, locals())
manages_order_df

Unnamed: 0,id,crew_id,equip_id
0,5,1,4
1,3,1,3
2,4,2,4
3,1,2,1
4,2,3,2


Use DESC for descending.

You can order by multiple columns. List from highest priority to least. If there is an equal value in a column, the next one in the list will be used. E.g. ORDER BY name, id

#### Distinct Values

You can retrieve a unique set of values only. For example, let’s retrieve a list of all ranks that are assigned to our crew members (without any duplicates).

In [None]:
query_allRanks = '''
SELECT rank
FROM crew_df'''
all_ranks_df = ps.sqldf(query_allRanks, locals())
all_ranks_df

Unnamed: 0,rank
0,10
1,9
2,4
3,4
4,1


In [None]:
query_distinctRanks = '''
SELECT DISTINCT(rank) AS rank
FROM crew_df'''
distinct_ranks_df = ps.sqldf(query_distinctRanks, locals())
distinct_ranks_df

Unnamed: 0,rank
0,10
1,9
2,4
3,1


#### Null Values

Unless you specify in the schema (e.g. when creating the table), all values could take on NULL (except for primary key).

In [None]:
query_null = '''
SELECT *
FROM crew_df
WHERE role_id IS NULL'''
null_row_df = ps.sqldf(query_null, locals())
null_row_df

Unnamed: 0,id,name,rank,role_id
0,5,Brandon,1,


### Relationships

Tables have relationships amongst themselves.

One to One: A record in a table is associated with one and only one record in another table. (Crew members will be assigned only one role )

One to Many: A record in a table is associated with more than one record in another table. (A crew member can have multiple records in worklog entries)

Many to Many: Multiple records in a table are associated with multiple records in another table (Crew members can manage multiple equipments, and equipments can be managed by multiple crew members)

<p align = "center">
<img src = "https://imgur.com/5kbMODk.png" width= "900" align ="center"/>





A primary key is a unique identifier for a row.

By storing a primary key for another table we can reference a row in the “foreign” other table. This reference column is referred to as a foreign key.

<p align = "center">
<img src = "https://imgur.com/TzNsc8F.png" width= "900" align ="center"/>




#### Many to Many in Relational Model

In order to allow multiple relationships for each pair of rows across two tables, we need to have a dedicated table for the relationship itself.

Let’s express the crew “manages” equipment relationship.

<p align = "center">
<img src = "https://imgur.com/oWoxPdH.png" width= "900" align ="center"/>


#### Querying with Relationships

We use the JOIN command to query with relationships.

Let us visualize our two tables again:

In [None]:
roles_df

Unnamed: 0,role_id,name
0,1,captain
1,2,scientist
2,3,engineer
3,4,engineer 2


In [None]:
crew_df

Unnamed: 0,id,name,rank,role_id
0,1,Jane,10,1.0
1,2,Dan,9,2.0
2,3,Alex,4,3.0
3,4,Jen,4,4.0
4,5,Brandon,1,


Now what if we want to fetch roles of all crew members

In [None]:
query_cremember_role = '''
SELECT crew_df.name, roles_df.name
FROM crew_df
JOIN roles_df
ON crew_df.role_id = roles_df.role_id'''
crewMember_role_df = ps.sqldf(query_cremember_role, locals())

crewMember_role_df

Unnamed: 0,name,name.1
0,Jane,captain
1,Dan,scientist
2,Alex,engineer
3,Jen,engineer 2


#### Aliasing

Table and column names can get messy. We can use the AS operator to alias column names and table names in our queries.


We can alias the columns in the query result as follows.

In [None]:
query_alias = '''
SELECT C.name AS name, R.name AS role
FROM crew_df C
JOIN roles_df R
ON C.role_id = R.role_id
'''
crewMember_role_aliased = ps.sqldf(query_alias, locals())
crewMember_role_aliased


Unnamed: 0,name,role
0,Jane,captain
1,Dan,scientist
2,Alex,engineer
3,Jen,engineer 2


#### Multiple Joins

To query information from our many-to-many relationships, we can use multiple joins.

Find the equipments handled by the crew members:

In [None]:
query_manyTomany = '''
SELECT C.name AS name, E.name AS equipment
FROM crew_df C
JOIN manages_df M
ON C.id = M.crew_id
JOIN equipment_df E
ON M.equip_id = E.id
'''

# Crew name from crew_df (C)
# manages_df (M)
# Equipment name from equipment_df (E)

crewMember_equipment = ps.sqldf(query_manyTomany, locals())
crewMember_equipment

Unnamed: 0,name,equipment
0,Jane,Notebook
1,Jane,Chemical Z
2,Dan,Centrifuge
3,Dan,Chemical Z
4,Alex,Soldering Station



<img src = "https://i.stack.imgur.com/VQ5XP.png" width= "400" align ="center"/>



INNER JOIN: Returns records that have matching values in both tables

LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table

RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table

FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

#### Aggregate Operation

Counting (SUM), averaging (AVG), minimum (MIN), maximum (MAX).


In [None]:
# No. of crew members
crewCount = ps.sqldf('''
SELECT COUNT(*) AS count
FROM crew_df
''', locals())
crewCount


Unnamed: 0,count
0,5


In [None]:
# Find the average rank of each crew member
avgRank = ps.sqldf('''
SELECT MIN(rank)
AS MinRank FROM crew_df
''', locals())
avgRank

Unnamed: 0,MinRank
0,1


#### Grouping

Let’s see how many crew members we have for each rank value.

In [None]:
# rank
# No. of crew members with this rank

rank_count = ps.sqldf('''
SELECT rank, COUNT(name) AS count
FROM crew_df
GROUP BY rank
''', locals())

rank_count

Unnamed: 0,rank,count
0,1,1
1,4,2
2,9,1
3,10,1


#### Conditional Grouping

In [None]:
# HAVING
rank_count_5 = ps.sqldf('''
SELECT rank, COUNT(name) AS count
FROM crew_df
GROUP BY rank
HAVING rank > 5
''', locals())

rank_count_5

Unnamed: 0,rank,count
0,9,1
1,10,1


Lets see how querying together works.

How many hours has each crew member worked in total? We want crew member name and number of hours.


In [None]:
crew_hours = ps.sqldf('''
SELECT c.name, SUM(w.hours) AS hours
FROM crew_df C
JOIN worklog_df W
ON C.id = W.crew_id
GROUP BY C.id
''', locals())

crew_hours

Unnamed: 0,name,hours
0,Jane,15
1,Dan,13
2,Alex,17
3,Jen,22


**Some other questions:**

Using crew_hours, find the people who have worked for more than 13 hours. Return name and number of hours worked where hours are sorted in descending order.

In [None]:
#TODO

other_1 = ps.sqldf('''
SELECT name, SUM(hours) AS sum_hours
FROM worklog_df
JOIN crew_df
ON crew_df.id = worklog_df.crew_id
GROUP BY crew_id
HAVING SUM(hours) > 13
ORDER BY SUM(hours) DESC
''', locals())

other_1


Unnamed: 0,name,sum_hours
0,Jen,22
1,Alex,17
2,Jane,15


Find the equipements handled by crew members and only return those crew members who are ranked either 4 or 10 and their name contains a letter e.

In [None]:
#solution
# TO BE DISCUSSED IN Recitation


other_2 = ps.sqldf('''
SELECT cname, equipment_df.name FROM equipment_df
JOIN (SELECT manages_df.equip_id as eid, crew_df.name AS cname
      FROM crew_df
      JOIN manages_df
      ON crew_df.id = manages_df.crew_id
      WHERE crew_df.name
      LIKE "%e%" AND crew_df.rank IN ('4', '10') )
ON eid = equipment_df.id
''', locals())

other_2


Unnamed: 0,cname,name
0,Jane,Notebook
1,Jane,Chemical Z
2,Alex,Soldering Station


#### Deleting a Table

We’re not going to run this one, but here it is for reference:

In [None]:
# conn.execute(''' DROP TABLE roles ''')

#### Some take aways

Sequence of priority :

FROM & JOINs determine & filter rows

WHERE more filters on the rows

GROUP BY combines those rows into groups

HAVING filters groups

SELECT projects data from a database (columns)

ORDER BY arranges the remaining rows/groups

LIMIT filters on the remaining rows/groups