<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/SQL_02_Groups%2C_Subqueries%2C_Sets%2C_and_More!.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#More SQL: Groups, Subqueries, Sets, and More!
##Database and SQL | Course Notes / Brendan Shea, PhD (Brendan.Shea@rctc.edu)
In this lecture, we'll be continuing our introduction to Structured Query Langauge (SQL), this time using a PostgreSQL (or "Postgres") Database containing information about Movies, Actors, Directors, and the Oscars they have won. 


Postgres requires a bit more work to set up (and it isn't built into Python, in the way SQLite is).  However, it is an enterprise-scale DBMS has capabilites that go beyond those of SQLite. In particular, Postgres is a "Server-Client" database that can deal with multiple users (separated across a network) simultaneously *writing* to a database. It is representative of the type of RDBMS that most large organizations (Mayo, IBM, etc.) will use to store and access data. 

The main topics we'll be covering in this lecture include:

1. How GROUP BY and HAVING can be used to filter results, and how this differs from WHERE.
2. How subqueries can be used to write "queries within queries."
3. Some of the ANSI and non-ANSI-standard "functions" that are included with enterprise RDBMS software.

With that in mind, let's get started!

To begin with, we'll need to download our database, load PostgreSQL, and connect to the database. Then, we'll diplay the schemas for our various database table.

Postgres 10 documentation lives here:
https://www.postgresql.org/docs/10/index.html

In [3]:
# Some UNIX and Pyhton utilites we need to install for the lab.
!pip install wget --quiet
!pip install sqlalchemy --quiet
!pip install ipython-sql --quiet
!pip install pgspecial --quiet

# Install postgresql server
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql
!sudo service postgresql start

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a postgres database with name `my_data` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS my_data;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE my_data;'

# Postgres variables
%env DB_NAME=my_data
%env DB_HOST=localhost
%env DB_PORT=5432
%env DB_USER=postgres
%env DB_PASS=postgres
;

 * Starting PostgreSQL 10 database server
   ...done.
ALTER ROLE
ERROR:  database "my_data" is being accessed by other users
DETAIL:  There is 1 other session using the database.
ERROR:  database "my_data" already exists
env: DB_NAME=my_data
env: DB_HOST=localhost
env: DB_PORT=5432
env: DB_USER=postgres
env: DB_PASS=postgres


''

In [4]:
# Now let's download the file we'll be using for this lab
!wget -N 'https://raw.githubusercontent.com/brendanpshea/database_sql/main/movie_dump.sql' -q

# Load our file and connect to the database
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME -f movie_dump.sql

# Finally, let's make a connnection with the databse
%load_ext sql
%sql postgresql://$DB_USER:$DB_PASS@$DB_HOST/$DB_NAME

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: postgres@my_data'

In [5]:
# Now let's diplay the table schema
%%sql 
SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

 * postgresql://postgres:***@localhost/my_data
5 rows affected.


schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
public,actor,postgres,,True,False,True,False
public,director,postgres,,True,False,True,False
public,movie,postgres,,True,False,True,False
public,oscar,postgres,,True,False,True,False
public,person,postgres,,True,False,True,False


In [6]:
%%sql 
\dt

 * postgresql://postgres:***@localhost/my_data
5 rows affected.


Schema,Name,Type,Owner
public,actor,table,postgres
public,director,table,postgres
public,movie,table,postgres
public,oscar,table,postgres
public,person,table,postgres


In [7]:
# Show the first 5 rows of each table
movie_df = %sql SELECT * FROM Movie LIMIT 5;
person_df = %sql SELECT * FROM Person LIMIT 5;
actor_df = %sql SELECT * FROM Actor LIMIT 5;
director_df = %sql SELECT * FROM Director LIMIT 5;
oscar_df = %sql SELECT * FROM Oscar LIMIT 5;
print('\nMovie\n', movie_df,'\nPerson\n',person_df, '\nActor\n', actor_df, 
      '\nDirector\n', director_df, '\nOscar\n', oscar_df)

 * postgresql://postgres:***@localhost/my_data
5 rows affected.
 * postgresql://postgres:***@localhost/my_data
5 rows affected.
 * postgresql://postgres:***@localhost/my_data
5 rows affected.
 * postgresql://postgres:***@localhost/my_data
5 rows affected.
 * postgresql://postgres:***@localhost/my_data
5 rows affected.

Movie
 +---------+------------------------------+------+--------+---------+-------+---------------+
|    id   |             name             | year | rating | runtime | genre | earnings_rank |
+---------+------------------------------+------+--------+---------+-------+---------------+
| 2488496 | Star Wars: The Force Awakens | 2015 | PG-13  |   138   |   A   |       1       |
| 4154796 |      Avengers: Endgame       | 2019 | PG-13  |   181   |  AVS  |       2       |
| 0499549 |            Avatar            | 2009 | PG-13  |   162   |  AVYS |       3       |
| 1825683 |        Black Panther         | 2018 | PG-13  |   134   |  AVS  |       4       |
| 4154756 |    Avenge

#1. GROUP BY and Group-Level Statistics
In the last lesson, we'll talked about the use of aggregate functions such as COUNT, MAX, MIN, AVG, and SUM. So far, though, we've only been applying these to the whole of our query results (that is, we end up counting ALL of the rows that our query returns). In many real-world situations, though, we actually want to apply these to statistics to different GROUPS of data. That's where GROUP BY comes in. 



```
SELECT columns FROM tables
[WHERE conditions]
[GROUP BY columns]
[ORDERED BY columns]
```

We'll look at a few examples.


In [8]:
# Let's list average run time of movies broken down by movie rating
%%sql
SELECT rating, ROUND(AVG(runtime),2) FROM Movie
  GROUP BY rating;

 * postgresql://postgres:***@localhost/my_data
8 rows affected.


rating,round
PG,112.19
,115.64
PG-13,127.45
R,126.52
M,124.0
G,116.41
GP,187.0
NC-17,114.5


In [9]:
# Or we could get both a COUNT and an AVERAGE runtime
%%sql
SELECT rating, COUNT(rating), AVG(runtime) FROM Movie
  GROUP BY rating;

 * postgresql://postgres:***@localhost/my_data
8 rows affected.


rating,count,avg
PG,143,112.18881118881119
,0,115.63503649635037
PG-13,234,127.45299145299144
R,166,126.52409638554217
M,5,124.0
G,39,116.4102564102564
GP,1,187.0
NC-17,2,114.5


In [10]:
# Now for a trickier one
# Let's find the list of the five most prolific actors

%%sql 
SELECT P.name, COUNT(A.actor_id) as "# of Movies" FROM Actor A 
  JOIN Person P ON P.id = A.actor_id
  GROUP BY A.actor_id, P.name
  ORDER BY COUNT(A.actor_id) DESC
  LIMIT 5;

 * postgresql://postgres:***@localhost/my_data
5 rows affected.


name,# of Movies
Tom Cruise,15
Tom Hanks,12
Will Smith,12
Robert Downey Jr.,11
Harrison Ford,11


#2. Putting limitation on groups with HAVING
Along with producing group-level statistics, we often want to restrict our query results to include only certain groups. So, for example, we might want to know "Which actors have appeared in at least 10 movies?" This requires both a GROUP BY (to collect our data about a particular data) and then excluding those "groups" (i.e., actors) that have appeared in less than 10 movies. 

This is where HAVING, which allows us to restrict the results of GROUP BY comes in:


```
SELECT columns FROM tables
[WHERE conditions]
[GROUP BY columns]
[HAVING conditions]
[ORDERED BY columns]
```



In [11]:
# Let's find the actors that have appered in at least 10 movies.
%%sql 
SELECT P.name, COUNT(A.actor_id) as "# of Movies" FROM Actor A 
  JOIN Person P ON P.id = A.actor_id
  GROUP BY P.name, A.actor_id
  HAVING COUNT(A.actor_id) >= 10
  ORDER BY COUNT(A.actor_id) DESC

 * postgresql://postgres:***@localhost/my_data
7 rows affected.


name,# of Movies
Tom Cruise,15
Will Smith,12
Tom Hanks,12
Robert Downey Jr.,11
Harrison Ford,11
Robert De Niro,10
Scarlett Johansson,10


In [12]:
# Or, let's find the list of movies that
# have won exactly 4 Oscars
# Note: Only certain types of Oscars are in our database!
%%sql 
SELECT M.name, M.year, COUNT(O.type) FROM Movie M
  JOIN Oscar O ON M.id = O.movie_id
  GROUP BY M.name, M.year
  HAVING COUNT(O.type) = 4;
  


 * postgresql://postgres:***@localhost/my_data
14 rows affected.


name,year,count
From Here to Eternity,1953,4
Terms of Endearment,1983,4
Ben-Hur,1959,4
Million Dollar Baby,2004,4
It Happened One Night,1934,4
West Side Story,1961,4
One Flew Over the Cuckoo's Nest,1975,4
Going My Way,1944,4
On the Waterfront,1954,4
Gone with the Wind,1939,4


#3. Simple Subqueries using WHERE and IN
Sub-queries allow us to write "queries within queries." They are used when need to interact with the data in multiple "steps."

So, for a common example, we want to find the MAXIMUM value for some column (our "inner query"), and then we want to find some information about the entity (or entities) that actually has that MAX value. This query has the form:


```
-- Find the entity with the highest value of c1

SELECT c1, c2, c3, ..FROM tables
WHERE c1 IN:
  (SELECT MAX(c1) FROM tables);
```

There are many, many types of subqueries and it is almost always *possible* to use subqueries. However, as a rule of thumb, one should only use subqueries if one genuinely needs them. They can be slow to run and difficult for other to maintain/understand.

In [13]:
# Example: Let's find the movie (or movies) with the longest run time

%%sql 
SELECT name, runtime FROM Movie
  WHERE runtime IN 
    (SELECT MAX(runtime) FROM Movie)


 * postgresql://postgres:***@localhost/my_data
1 rows affected.


name,runtime
Gone with the Wind,238


In [14]:
# Or, we could just find the list of movies
# that have a run time of 150% or more of the average
%%sql 
SELECT name, runtime FROM Movie
  WHERE runtime > 1.5 * 
    (SELECT AVG(runtime) FROM Movie)



 * postgresql://postgres:***@localhost/my_data
19 rows affected.


name,runtime
Titanic,194
Avengers: Age of Ultron,195
"Lord of the Rings: The Return of the King, The",201
Batman v Superman: Dawn of Justice,183
King Kong,187
Gone with the Wind,238
Pearl Harbor,183
"Green Mile, The",188
Schindler's List,197
Gandhi,188


#4. Subqueries in the HAVING clause
We can also subqueries in the HAVING clause, where we filter groups by the result of some other query. So, for example, we could find the number of movies acted in by actors who appeared in a Star Wars film. This query involves:

1. Inner Query -- Getting a full list of the actors (or really, the actor ids) of people who've appeared in films whose title  starts with "Star Wars."
2. Outer Query -- Producing a count of how many films total (not just Star Wars films) each of these people has appeared in.

These sorts of queries can appear really messy! My advice is to write and test the inner query (where you find the list of actor_ids of people who appeared in Star Wars films) first, and only then put it together with the outer query.




In [15]:
%%sql 
SELECT P.name, COUNT(A.actor_id) as "# of Movies" FROM Actor A 
  JOIN Person P ON P.id = A.actor_id
  GROUP BY A.actor_id, P.name
  HAVING A.actor_id IN 
    (SELECT A1.actor_id FROM Movie M1 
      JOIN Actor A1 ON A1.movie_id=M1.id
      WHERE M1.name LIKE '%Star Wars%');

 * postgresql://postgres:***@localhost/my_data
29 rows affected.


name,# of Movies
Woody Harrelson,4
Natalie Portman,6
John Boyega,2
Christopher Lee,1
Joonas Suotamo,1
Emilia Clarke,1
Billy Dee Williams,2
Jake Lloyd,1
Adam Driver,4
Alec Guinness,4


#5. Subqueries Using ALL or ANY
Sometimes we want to check whether our results are bigger (or smaller) than any of the values produced by a different query (i.e., the subquery). This is, in many mays, similar to the use of MAX (for ALL) or MIN (for ANY). However, ANY and ALL provide a much more straightforward way of doing such comparisons.

So, for example, let's say I'd like to answer questions such as the following 

*  "Which movies have a higher earnings rank than *all* of the Spider-Man movies?"
*  "How many movies have an earning rank lower than ANY of the Spider-Man movies in the database?"



In [16]:
# Which movies have a higher earnings rank than *all* of the Spider-Man movies?
%%sql
SELECT name, earnings_rank FROM Movie
  WHERE earnings_rank < 
    ALL(SELECT earnings_rank FROM Movie 
        WHERE name LIKE '%Spider-Man%'
        AND earnings_rank IS NOT NULL);

 * postgresql://postgres:***@localhost/my_data
34 rows affected.


name,earnings_rank
Star Wars: The Force Awakens,1
Avengers: Endgame,2
Avatar,3
Black Panther,4
Avengers: Infinity War,5
Titanic,6
Jurassic World,7
The Avengers,8
Star Wars: Episode VIII - The Last Jedi,9
Incredibles 2,10


In [17]:
# "How many movies have an earning rank lower than ANY of the Spider-Man movies in the database?"
%%sql
SELECT COUNT(*) FROM Movie
  WHERE earnings_rank < 
    ANY(SELECT earnings_rank FROM Movie WHERE name LIKE '%Spider-Man%'
        AND earnings_rank IS NOT NULL);

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


count
114


# 6. Subqueries in the SELECT Clause
Sometimes (rarely!), you'll want to include a subquery in the SELECT clause. This query will need toi re-run FOR EACH AND EVERY row of your results (which means this is $O(n^2)$ at best--really slow for those of you haven't taken algorithms). These are the kinds of queries that can be fun to write (and for CS professors to assign on homeowork), but that can give DBMS administrators nightmares.

However, to give you an example, we're going to run a query that answers the question 

*"Can you give me a list of movies, the year they were released, their earnings rank, and the average earnings rank for all movies released in the same year?"*

You'll notice a few things about this query:
1. We're going to need to (in effect) JOIN the Movie table with itself, since we want to repeatedy do searches for "Movies that came out in the same year as some other movie."
2. We're going to use a subquery to calculate a new average for every row (that is, every movie) in our database.
3. We're going to ROUND our result to 2 decimals.

Again, this is (in general) a bad way to write queries, since we are repeatedly calculating values (the average earnings rank of a movie released in a certain year) that ideally should only be calculated once. This isn't a big deal for 1,000 rows, given the speed of modern computers. It could be a HUGE deal for 1,000,000 rows, though, no matter how fast your computer is.


In [18]:
%%sql

SELECT name, earnings_rank, year, 
  (SELECT ROUND(AVG(earnings_rank), 2) 
    FROM Movie M2 WHERE M2.year = M.year)
    AS "Earning Rank Avg for Year"
  FROM Movie M
  LIMIT 10;

 * postgresql://postgres:***@localhost/my_data
10 rows affected.


name,earnings_rank,year,Earning Rank Avg for Year
Star Wars: The Force Awakens,1,2015,56.89
Avengers: Endgame,2,2019,46.64
Avatar,3,2009,103.4
Black Panther,4,2018,106.69
Avengers: Infinity War,5,2018,106.69
Titanic,6,1997,97.0
Jurassic World,7,2015,56.89
The Avengers,8,2012,103.36
Star Wars: Episode VIII - The Last Jedi,9,2017,87.23
Incredibles 2,10,2018,106.69


#7. Working with Dates
SQL also allows to work with dates in various ways. For examples, we can do things like:
1. Get the current date
2. Calculate the difference betweeen two dates
3. Extract month, day, or year
4. Many other things

The treatment of dates (unlike most of what we've talked about so far) is NOT fully specified by the SQL ANSI standard. So, different database software (MySQL, Postgres, Oracles, etc.) will handle them differently. You'll need to read the documentation, or check StackOverflow :). 

In [19]:
#First, let's set our timezone
# https://www.postgresql.org/docs/7.2/timezones.html
%%sql
SET TIMEZONE='America/Chicago';

 * postgresql://postgres:***@localhost/my_data
Done.


[]

In [20]:
# Let's get the current time!
%%sql
SELECT NOW();

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


now
2022-11-10 13:36:05.024853-06:00


In [21]:
# We can also do things like extract the year, month or day from dates
# Using MONTH(), YEAR(), and DAY()
# Here ":: Date" means "treat this string as a date"

%%sql
SELECT EXTRACT(Year FROM '1979-05-21' :: Date) as "Year",
  EXTRACT(Month FROM '1979-05-21' :: Date) as "Month",
  EXTRACT(Day FROM '1979-05-21' :: Date) as "Day"

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


Year,Month,Day
1979.0,5.0,21.0


In [22]:
# How long ago was Will Smith born? Let's find out!
%%sql 
SELECT name, dob, NOW()-dob AS "Age (days)"
  FROM Person
  WHERE name = 'Will Smith';

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


name,dob,Age (days)
Will Smith,1968-09-25,"19769 days, 14:36:15.627790"


In [23]:
# We can do the same thing with AGE()
%%sql
SELECT name, dob, AGE(dob) AS "Age (days)"
  FROM Person
  WHERE name = 'Will Smith';

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


name,dob,Age (days)
Will Smith,1968-09-25,"19755 days, 0:00:00"


In [24]:
# What's earliest date of birth recorded in our database?
# We can use MIN() to find out
%%sql 
SELECT name, dob FROM Person
  WHERE dob = 
    (SELECT MIN(dob) FROM Person)

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


name,dob
Albert Gran,1862-08-04


In [25]:
# In order to find the age in years, we could do the following:
%%sql 
SELECT name, dob, 
  EXTRACT(YEAR FROM AGE(dob)) as "Born this many years ago"
  FROM Person
  WHERE dob = 
    (SELECT MIN(dob) FROM Person)

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


name,dob,Born this many years ago
Albert Gran,1862-08-04,160.0


#8. Relational Algebra
SQL also supports all of the relational algebra operators such as UNION, INTERSECT, difference (EXCEPT) etc.

In [26]:
# Let's find the list of people who have (1) acted in at least 2 movies
# and (2) directed at least 2 movies
%%sql
SELECT name FROM Person
  JOIN Actor on Person.id = Actor.actor_id
  GROUP BY name
  HAVING COUNT(name) >= 2
  
INTERSECT

SELECT name from Person
    JOIN Director on Person.id = Director.director_id
    GROUP BY name
    HAVING COUNT(name) >= 2

 * postgresql://postgres:***@localhost/my_data
3 rows affected.


name
Mel Gibson
Clint Eastwood
Jon Favreau


In [27]:
# How about the list of people who have EItHER acted in 10 movies
# OR directed 10 movies?

%%sql
SELECT name FROM Person
  JOIN Actor on Person.id = Actor.actor_id
  GROUP BY name
  HAVING COUNT(name) >= 10
  
UNION

SELECT name from Person
    JOIN Director on Person.id = Director.director_id
    GROUP BY name
    HAVING COUNT(name) >= 10

 * postgresql://postgres:***@localhost/my_data
8 rows affected.


name
Harrison Ford
Steven Spielberg
Will Smith
Robert De Niro
Scarlett Johansson
Tom Hanks
Tom Cruise
Robert Downey Jr.


In [28]:
# Or we could find the names of actors named Steven who haven't directed
# Using EXCEPT
%%sql
SELECT name FROM Person
  JOIN Actor on Person.id = Actor.actor_id
  WHERE name LIKE 'Steven%'

EXCEPT

SELECT name FROM Person
  JOIN Director on Person.id = Director.director_id
  WHERE name LIKE 'Steven%'

 * postgresql://postgres:***@localhost/my_data
3 rows affected.


name
Steven Chester Prince
Steven Berkoff
Steven Weber


#9. Working with Numbers and "Derived Tables"
SQL also has the ability to work with numbers in various ways. As with dates, some of these functions go "beyond" ANSI standard SQL, and so you'll need to double-check the documentation for the details on how your particular "flavor" of SQL handles it.

https://www.postgresql.org/docs/10/functions-math.html

I've just provided a few examples here. We are going to:
1. Take an absolute value.
2. Generate a series of random numbers.
3. Give an example of a "derived table."

In [29]:
# Absolute value 
%%sql
SELECT ABS(-10.4356)

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


abs
10.4356


In [30]:
# Get a random int between 1 and 1000
%%sql
SELECT ROUND(RANDOM() * 1000)

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


round
89.0


In [31]:
# We can also get 5 random numbers
# using the GENERATE_SERIES function
%%sql
SELECT ROUND(RANDOM() * 1000) FROM GENERATE_SERIES(1,5)

 * postgresql://postgres:***@localhost/my_data
5 rows affected.


round
271.0
632.0
664.0
662.0
289.0


In [32]:
# OK, let's now generate a random number between 1 and 1000
# and do some math with them
# We'll need to use a "derived table" as shown here, since
# this allows us to "save" the random numbers we want to work with
%%sql 
SELECT my_number, SQRT(my_number), LN(my_number), POW(my_number, 2) FROM
  (SELECT ROUND(RANDOM() * 1000) as my_number -- This is a column alias
  FROM GENERATE_SERIES(1,10)) AS my_table -- the "AS" here gives us the derived table

 * postgresql://postgres:***@localhost/my_data
10 rows affected.


my_number,sqrt,ln,pow
95.0,9.74679434480896,4.55387689160054,9025.0
587.0,24.2280828791714,6.3750248198281,344569.0
170.0,13.0384048104053,5.13579843705026,28900.0
426.0,20.6397674405503,6.05443934626937,181476.0
160.0,12.6491106406735,5.07517381523383,25600.0
537.0,23.1732604525129,6.28599809450886,288369.0
839.0,28.9654967159205,6.73221070646721,703921.0
79.0,8.88819441731559,4.36944785246702,6241.0
756.0,27.495454169735,6.62804137617953,571536.0
78.0,8.83176086632785,4.35670882668959,6084.0


#10. Processing Strings
Finally, we can process Strings in various ways. There are a *lot* of different functions for doing this, and the exact syntax for these commands will again vary by the type of SQL you are using. 

https://www.postgresql.org/docs/10/functions-string.html

We're going to:
1. Explore how to make strings uppercase and lowercase
2. Compare the lenghths of various strings
3. Spit and conconcatenate Strings

In [33]:
# We can make strings upper case or lowercase
# We can even get their length
%%sql
SELECT name, UPPER(name), LOWER(name), LENGTH(name) FROM Person LIMIT 5;

 * postgresql://postgres:***@localhost/my_data
5 rows affected.


name,upper,lower,length
Linda Darnell,LINDA DARNELL,linda darnell,13
Elisabeth Moss,ELISABETH MOSS,elisabeth moss,14
Andrew Chavez,ANDREW CHAVEZ,andrew chavez,13
Hilary Duff,HILARY DUFF,hilary duff,11
Chuck Pfeiffer,CHUCK PFEIFFER,chuck pfeiffer,14


In [34]:
# Hmmm. I wonder the longest movie name in our database is?
# We need a subquery!
%%sql
SELECT name FROM Movie
  WHERE LENGTH(name) = 
    (SELECT MAX(LENGTH(name)) FROM Movie);

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


name
"The Chronicles of Narnia: The Lion, the Witch and the Wardrobe"


In [35]:
# We can concatenate (or "combine") strings 
%%sql

SELECT 'abc ' || '123'

 * postgresql://postgres:***@localhost/my_data
1 rows affected.


?column?
abc 123


In [36]:
# Let's put this to work
# Let's "Oscar Winner" to the names of people who've won
# best actor or actress award since 2012

%%sql 
SELECT 'Oscar Winner ' || P.name
  FROM Person P
  JOIN Oscar O ON O.person_id = P.id
  WHERE (O.type = 'BEST-ACTRESS' OR
  O.type = 'BEST-ACTOR') AND
  O.year > 2012


 * postgresql://postgres:***@localhost/my_data
16 rows affected.


?column?
Oscar Winner Joaquin Phoenix
Oscar Winner Renee Zellweger
Oscar Winner Rami Malek
Oscar Winner Olivia Colman
Oscar Winner Gary Oldman
Oscar Winner Frances McDormand
Oscar Winner Casey Affleck
Oscar Winner Emma Stone
Oscar Winner Leonardo DiCaprio
Oscar Winner Brie Larson


In [37]:
# We can play with substrings used "regex" expressions
# https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions
# The details here here beyond the scope of the class, but these are quite powerful :).
# Let's get the first three letters and last two letters of people's names
%%sql 
SELECT name, 
  substring(name FROM  '^...'), --first three letters
  substring(name FROM '..$'), -- last two letters
  substring(name FROM '^[A-Za-z0-9]*'), -- Gets the first name
  substring(name FROM '[A-Za-z0-9]*$') --Get the last name
  FROM Person
LIMIT 10;

 * postgresql://postgres:***@localhost/my_data
10 rows affected.


name,substring,substring_1,substring_2,substring_3
Linda Darnell,Lin,ll,Linda,Darnell
Elisabeth Moss,Eli,ss,Elisabeth,Moss
Andrew Chavez,And,ez,Andrew,Chavez
Hilary Duff,Hil,ff,Hilary,Duff
Chuck Pfeiffer,Chu,er,Chuck,Pfeiffer
Peter Segal,Pet,al,Peter,Segal
Charlie Tahan,Cha,an,Charlie,Tahan
Danielle Skraastad,Dan,ad,Danielle,Skraastad
Pierre Coffin,Pie,in,Pierre,Coffin
Marc Webb,Mar,bb,Marc,Webb
