In [None]:
import pandas as pd
import sqlalchemy

# Movie Ratings

You've started a new movie-rating website, and you've been collecting data on reviewers' ratings of various movies. There's not much data yet, but you can still try out some interesting queries. Here's the schema: 

Movie ( mID, title, year, director ) 
English: There is a movie with ID number mID, a title, a release year, and a director. 

Reviewer ( rID, name ) 
English: The reviewer with ID number rID has a certain name. 

Rating ( rID, mID, stars, ratingDate ) 
English: The reviewer rID gave the movie mID a number of stars rating (1-5) on a certain ratingDate. 

## Connect to the database

In [None]:
engine = sqlalchemy.create_engine(
    "postgresql+psycopg2://", 
    connect_args={"database": "postgres", "user": "sherlock", "host": "/var/run/postgresql"}
)
con = engine.connect()
con.execute("SET schema 'input'")

## Load the data in the db
If you want to reset the database, simply rerun this cell

In [None]:
from sqlalchemy.sql import text
con.execute(open("movie-ratings.sql").read())
con.execute(open("social.sql").read())

In [None]:
q = """
SELECT * 
FROM Rating 
limit 3
"""
df = pd.read_sql(q, con)
df

Find the titles of all movies directed by Steven Spielberg. 


In [None]:
q = """
select title
from Movie
where director='Steven Spielberg';
"""
df = pd.read_sql(q, con)
df

Find all years that have a movie that received a rating of 4 or 5, and sort them in increasing order. 

In [None]:
q = """
select distinct year
from Movie, Rating
where Rating.mID = Movie.mID and stars >= 4
order by Year;
"""
df = pd.read_sql(q, con)
df

Find the titles of all movies that have no ratings. 


In [None]:
 q = """
select title
from Movie
where mID not in (select mID 
                  from Rating)
"""
df = pd.read_sql(q, con)
df

Some reviewers didn't provide a date with their rating. Find the names of all reviewers who have ratings with a NULL value for the date. 


In [None]:
 q = """
select distinct name
from Reviewer, Rating
where Reviewer.rID = Rating.rID and ratingDate is NULL;
"""
df = pd.read_sql(q, con)
df

Write a query to return the ratings data in a more readable format: reviewer name, movie title, stars, and ratingDate. Also, sort the data, first by reviewer name, then by movie title, and lastly by number of stars. 
 

In [None]:
 q = """
select name, title, stars, ratingDate
from Movie, Rating, Reviewer
where Movie.mID = Rating.mID and Reviewer.rID = Rating.rID
order by name, title, stars;
"""
df = pd.read_sql(q, con)
df

For all cases where the same reviewer rated the same movie twice and gave it a higher rating the second time, return the reviewer's name and the title of the movie. 

In [None]:
 q = """
select name, title
from Movie, Reviewer, (select R1.rID, R1.mID
  from Rating R1, Rating R2
  where R1.rID = R2.rID 
  and R1.mID = R2.mID
  and R1.stars < R2.stars
  and R1.ratingDate < R2.ratingDate) C
where Movie.mID = C.mID
and Reviewer.rID = C.rID;
"""
df = pd.read_sql(q, con)
df

For each movie that has at least one rating, find the highest number of stars that movie received. Return the movie title and number of stars. Sort by movie title. 

In [None]:
 q = """
select title, stars
from Movie, ( select Movie.mID, stars
              from Movie, Rating
              where Movie.mID = Rating.mID
              except
              select R1.mID, R1.stars
              from Rating R1, Rating R2
              where R1.mID = R2.mID
              and R1.stars < R2.stars) Stars
where Movie.mID = Stars.mID
order by title;
"""
df = pd.read_sql(q, con)
df

For each movie, return the title and the 'rating spread', that is, the difference between highest and lowest ratings given to that movie. Sort by rating spread from highest to lowest, then by movie title. 

In [None]:
 q = """
select title, spread
from Movie, (
  select mID, max(stars) - min(stars) as spread
  from Rating
  group by mID
) RatingSpread
where Movie.mID = RatingSpread.mID
order by spread DESC, title;
"""
df = pd.read_sql(q, con)
df

Find the difference between the average rating of movies released before 1980 and the average rating of movies released after 1980. (Make sure to calculate the average rating for each movie, then the average of those averages for movies before 1980 and movies after. Don't just calculate the overall average rating before and after 1980.) 

In [None]:
 q = """
select avg(before_80.group_avg) - avg(post_80.group_avg) as difference
from (
  select Rating.mID, avg(stars) as group_avg
  from Rating, Movie
  where Rating.mID = Movie.mID
  and year <= 1980
  group by Rating.mID
) as before_80,
(
  select Rating.mID, avg(stars) as group_avg
  from Rating, Movie
  where Rating.mID = Movie.mID
  and year > 1980
  group by Rating.mID
) as post_80
"""
df = pd.read_sql(q, con)
df

# The Social Network

Students at your hometown high school have decided to organize their social network using databases. So far, they have collected information about sixteen students in four grades, 9-12. Here's the schema: 

Highschooler ( ID, name, grade ) 
English: There is a high school student with unique ID and a given first name in a certain grade. 

Friend ( ID1, ID2 ) 
English: The student with ID1 is friends with the student with ID2. Friendship is mutual, so if (123, 456) is in the Friend table, so is (456, 123). 

Likes ( ID1, ID2 ) 
English: The student with ID1 likes the student with ID2. Liking someone is not necessarily mutual, so if (123, 456) is in the Likes table, there is no guarantee that (456, 123) is also present. 

<img src="social.png">

In [None]:
df = pd.read_sql("SELECT * from Highschooler limit 3", con)
df

Find the names of all students who are friends with someone named Gabriel. 


In [None]:
 q = """
select H2.name
from Highschooler H1, Highschooler H2, Friend
where H1.ID = Friend.ID1
and H1.name = 'Gabriel'
and H2.ID = Friend.ID2;
"""
df = pd.read_sql(q, con)
df

For every student who likes someone 2 or more grades younger than themselves, return that student's name and grade, and the name and grade of the student they like. 

In [None]:
 q = """
select H1.name, H1.grade, H2.name, H2.grade
from Likes, Highschooler H1, Highschooler H2
where Likes.ID1 = H1.ID
and Likes.ID2 = H2.ID
and H1.grade >= (H2.grade + 2)
"""
df = pd.read_sql(q, con)
df

For every pair of students who both like each other, return the name and grade of both students. Include each pair only once, with the two names in alphabetical order. 

In [None]:
 q = """
select H1.name, H1.grade, H2.name, H2.grade
from Highschooler H1, Highschooler H2, (
  select L1.ID1, L1.ID2
  from Likes L1, Likes L2
  where L1.ID2 = L2.ID1
  and L1.ID1 = L2.ID2
) as Pair
where H1.ID = Pair.ID1
and H2.ID = Pair.ID2
and H1.name < H2.name
"""
df = pd.read_sql(q, con)
df

Find all students who do not appear in the Likes table (as a student who likes or is liked) and return their names and grades. Sort by grade, then by name within each grade. 

In [None]:
 q = """
select name, grade
from Highschooler
where ID not in (
  select ID1 from Likes
  union
  select ID2 from Likes
)
order by grade, name
"""
df = pd.read_sql(q, con)
df

For every situation where student A likes student B, but we have no information about whom B likes (that is, B does not appear as an ID1 in the Likes table), return A and B's names and grades. 

In [None]:
 q = """
select H1.name, H1.grade, H2.name, H2.grade, H3.name, H3.grade
from Likes L1, Likes L2, Highschooler H1, Highschooler H2, Highschooler H3
where L1.ID2 = L2.ID1
and L2.ID2 <> L1.ID1
and L1.ID1 = H1.ID and L1.ID2 = H2.ID and L2.ID2 = H3.ID
"""
df = pd.read_sql(q, con)
df

Find names and grades of students who only have friends in the same grade. Return the result sorted by grade, then by name within each grade. 

In [None]:
 q = """
select name, grade
from Highschooler, (
  select ID1 from Friend
  except
  -- students have friends in same grade
  select distinct Friend.ID1
  from Friend, Highschooler H1, Highschooler H2
  where Friend.ID1 = H1.ID and Friend.ID2 = H2.ID
  and H1.grade != H2.grade
) as Sample
where Highschooler.ID = Sample.ID1
order by grade, name
"""
df = pd.read_sql(q, con)
df

For each student A who likes a student B where the two are not friends, find if they have a friend C in common (who can introduce them!). For all such trios, return the name and grade of A, B, and C. 

In [None]:
 q = """
select H1.name, H1.grade, H2.name, H2.grade, H3.name, H3.grade
from Highschooler H1, Highschooler H2, Highschooler H3, Friend F1, Friend F2, (
  select * from Likes
  except
  -- A likes B and A/B are friends
  select Likes.ID1, Likes.ID2
  from Likes, Friend
  where Friend.ID1 = Likes.ID1 and Friend.ID2 = Likes.ID2
) as LikeNotFriend
where F1.ID1 = LikeNotFriend.ID1
and F2.ID1 = LikeNotFriend.ID2
-- has a shared friend
and F1.ID2 = F2.ID2
and H1.ID = LikeNotFriend.ID1
and H2.ID = LikeNotFriend.ID2
and H3.ID = F2.ID2
"""
df = pd.read_sql(q, con)
df

Find the difference between the number of students in the school and the number of different first names. 


In [None]:
 q = """
select count(ID) - count(distinct name) as difference
from Highschooler
"""
df = pd.read_sql(q, con)
df

Find the name and grade of all students who are liked by more than one other student. 


In [None]:
q = """
select name, grade
from Highschooler, (
  select count(ID1) as count, ID2
  from Likes
  group by ID2
) as LikeCount
where Highschooler.ID = LikeCount.ID2
and count > 1
"""
df = pd.read_sql(q, con)
df

## Modification

Add the reviewer Roger Ebert to your database, with an rID of 209. 


In [None]:
q = "insert into Reviewer(rID, name) values (209, 'Roger Ebert')"
con.execute(q)

Insert 5-star ratings by James Cameron for all movies in the database. Leave the review date as NULL. 


In [None]:
q = """
insert into Rating
  select Rating.rID, Movie.mID, 5 as stars, null as ratingDate
  from Rating, Movie, Reviewer
  where Rating.rID = Reviewer.rID
  and Reviewer.name = 'James Cameron';
"""
con.execute(q)

For all movies that have an average rating of 4 stars or higher, add 25 to the release year. (Update the existing tuples; don't insert new tuples.) 

In [None]:
q = """
update Movie
set year = year + 25
where mID in (
  select Movie.mId
  from Movie, Rating
  where Movie.mID = Rating.mID
  group by Movie.mID
  having avg(stars) >= 4
)
"""
con.execute(q)

Remove all ratings where the movie's year is before 1970 or after 2000, and the rating is fewer than 4 stars. 

In [None]:
q = """
delete from Rating
where mID in (
  select distinct Rating.mID
  from Movie, Rating
  where Movie.mID = Rating.mID
  and (Movie.year > 2000 or Movie.year < 1970)
)
and stars < 4
"""
con.execute(q)

It's time for the seniors to graduate. Remove all 12th graders from Highschooler. 


In [None]:
q = """
delete from Highschooler
where grade = 12
"""
con.execute(q)

If two students A and B are friends, and A likes B but not vice-versa, remove the Likes tuple. 


In [None]:
q = """
delete from Likes l
where l.ID1 in (
  select a.ID1 from (
    select L1.ID1, L1.ID2
    from Friend, Likes L1
    where Friend.ID1 = L1.ID1
    and Friend.ID2 = L1.ID2
    except
    select L1.ID1, L1.ID2
    from Likes L1, Likes L2
    where L1.ID1 = L2.ID2
    and L1.ID2 = L2.ID1
  ) a
)
"""
con.execute(q)

For all cases where A is friends with B, and B is friends with C, add a new friendship for the pair A and C. Do not add duplicate friendships, friendships that already exist, or friendships with oneself. (This one is a bit challenging; congratulations if you get it right.) 

In [None]:
q = """
insert into Friend
  select F1.ID1, F2.ID2
  from Friend F1, Friend F2
  where F1.ID2 = F2.ID1
  -- friends with oneself
  and F1.ID1 != F2.ID2
  -- already exist friendship
  except 
  select * from Friend
"""
con.execute(q)