In [1]:
import sqlite3
import pandas as pd

def execute(query, database_path='dataset/database.sqlite'):
    connection = sqlite3.connect(database_path)
    result = connection.execute(query).fetchall()
    column_names = [description[0] for description in connection.execute(query).description]
    df = pd.DataFrame(result, columns=column_names)
    connection.close()
    return df


In [2]:
from pandasql import sqldf

# Create helper function for easier query execution
execute_df = lambda q: sqldf(q, globals())

In [3]:
import sqlite3
import pandas as pd

def get_table_names(database_path='dataset/database.sqlite'):
    connection = sqlite3.connect(database_path)
    query = "SELECT name FROM sqlite_master WHERE type='table';"
    result = connection.execute(query).fetchall()
    table_names = [row[0] for row in result]
    connection.close()
    return table_names

# Get and print all table names in the database
tables = get_table_names()
print("Tables in the database:", tables)


Tables in the database: ['sqlite_sequence', 'Player_Attributes', 'Player', 'Match', 'League', 'Country', 'Team', 'Team_Attributes']


In [4]:
player_attributes = execute("SELECT * FROM Player_Attributes;")
player = execute("SELECT * FROM Player;")
match = execute("SELECT * FROM Match;")
league = execute("SELECT * FROM League;")
country = execute("SELECT * FROM Country;")
team = execute("SELECT * FROM Team;")
team_attributes = execute("SELECT * FROM Team_Attributes;")

# Basic Correlated Subqueries

Correlated subqueries are subqueries that reference one or more columns in the main query. Correlated subqueries depend on information in the main query to run, and thus, cannot be executed on their own.

Correlated subqueries are evaluated in SQL once per row of data retrieved -- a process that takes a lot more computing power and time than a simple subquery.

In this exercise, you will practice using correlated subqueries to examine matches with scores that are extreme outliers for each country -- above 3 times the average score!

In [5]:
# matches_2013_2014
query = """
SELECT 
	-- Select country ID, date, home, and away goals from match
	main.country_id,
    main.date,
    main.home_team_goal, 
    main.away_team_goal
FROM match AS main
WHERE 
	-- Filter the main query by the subquery
	(home_team_goal + away_team_goal) > 
        (SELECT AVG((sub.home_team_goal + sub.away_team_goal) * 3)
         FROM match AS sub
         -- Join the main query to the subquery in WHERE
         WHERE main.country_id = sub.country_id);
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,country_id,date,home_team_goal,away_team_goal
0,1,2011-10-29 00:00:00,4,5
1,1729,2009-11-22 00:00:00,9,1
2,1729,2010-01-16 00:00:00,7,2
3,1729,2011-08-28 00:00:00,8,2
4,1729,2012-12-29 00:00:00,7,3


# Correlated subquery with multiple conditions

Correlated subqueries are useful for matching data across multiple columns. In the previous exercise, you generated a list of matches with extremely high scores for each country. In this exercise, you're going to add an additional column for matching to answer the question -- what was the highest scoring match for each country, in each season?

In [6]:
query = """
SELECT 
	-- Select country ID, date, home, and away goals from match
	main.country_id,
    main.date,
    main.home_team_goal, 
    main.away_team_goal
FROM match AS main
WHERE 
	-- Filter for matches with the highest number of goals scored
	(home_team_goal + away_team_goal) = 
        (SELECT MAX(sub.home_team_goal + sub.away_team_goal)
         FROM match AS sub
         WHERE main.country_id = sub.country_id
               AND main.season = sub.season);
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,country_id,date,home_team_goal,away_team_goal
0,1,2008-10-25 00:00:00,7,1
1,1,2009-12-04 00:00:00,2,5
2,1,2009-12-26 00:00:00,5,2
3,1,2010-11-20 00:00:00,4,4
4,1,2010-09-19 00:00:00,5,3


# Nested simple subqueries

Nested subqueries can be either simple or correlated.

Just like an unnested subquery, a nested subquery's components can be executed independently of the outer query, while a correlated subquery requires both the outer and inner subquery to run and produce results.

In this exercise, you will practice creating a nested subquery to examine the highest total number of goals in each season, overall, and during July across all seasons.

In [7]:
query = """
SELECT
	-- Select the season and max goals scored in a match
	season,
    MAX(home_team_goal + away_team_goal) AS max_goals,
    -- Select the overall max goals scored in a match
   (SELECT MAX(home_team_goal + away_team_goal) FROM match) AS overall_max_goals,
   -- Select the max number of goals scored in any match in July
   (SELECT MAX(home_team_goal + away_team_goal) 
    FROM match
    WHERE id IN (
          SELECT id FROM match WHERE date BETWEEN '2013-07-01' AND '2013-07-31')) AS july_max_goals
FROM match
GROUP BY season;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,season,max_goals,overall_max_goals,july_max_goals
0,2008/2009,9,12,6
1,2009/2010,12,12,6
2,2010/2011,10,12,6
3,2011/2012,10,12,6
4,2012/2013,11,12,6


# Nest a subquery in FROM

What's the average number of matches per season where a team scored 5 or more goals? How does this differ by country?

Let's use a nested, correlated subquery to perform this operation. In the real world, you will probably find that nesting multiple subqueries is a task you don't have to perform often. In some cases, however, you may find yourself struggling to properly group by the column you want, or to calculate information requiring multiple mathematical transformations (i.e., an AVG of a COUNT).

Nesting subqueries and performing your transformations one step at a time, adding it to a subquery, and then performing the next set of transformations is often the easiest way to yield accurate information about your data. Let's get to it!

In [8]:
query = """
-- Select matches where a team scored 5+ goals
SELECT
	country_id,
    season,
	id
FROM match
WHERE home_team_goal >= 5 OR away_team_goal >= 5;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,country_id,season,id
0,1,2008/2009,4
1,1,2008/2009,55
2,1,2008/2009,57
3,1,2008/2009,79
4,1,2008/2009,112


In [9]:
query = """
-- Count match ids
SELECT
    country_id,
    season,
    COUNT(*) AS matches
-- Set up and alias the subquery
FROM (
	SELECT
    	country_id,
    	season,
    	id
	FROM match
	WHERE home_team_goal >= 5 OR away_team_goal >= 5 ) 
    AS subquery
-- Group by country_id and season
GROUP BY country_id, season;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,country_id,season,matches
0,1,2008/2009,9
1,1,2009/2010,5
2,1,2010/2011,11
3,1,2011/2012,11
4,1,2012/2013,12


In [10]:
query = """
SELECT
	c.name AS country,
    -- Calculate the average matches per season
	AVG(outer_s.matches) AS avg_seasonal_high_scores
FROM country AS c
-- Left join outer_s to country
LEFT JOIN (
  SELECT country_id, season,
         COUNT(id) AS matches
  FROM (
    SELECT country_id, season, id
	FROM match
	WHERE home_team_goal >= 5 OR away_team_goal >= 5) AS inner_s
  -- Close parentheses and alias the subquery
  GROUP BY country_id, season) AS outer_s
ON c.id = outer_s.country_id
GROUP BY country;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,country,avg_seasonal_high_scores
0,Belgium,9.571429
1,England,14.5
2,France,8.0
3,Germany,13.75
4,Italy,8.5


# Clean up with CTEs

In chapter 2, you generated a list of countries and the number of matches in each country with more than 10 total goals. The query in that exercise utilized a subquery in the FROM statement in order to filter the matches before counting them in the main query. Below is the query you created:
```
SELECT
  c.name AS country,
  COUNT(sub.id) AS matches
FROM country AS c
INNER JOIN (
  SELECT country_id, id 
  FROM match
  WHERE (home_goal + away_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country;
```
You can list one (or more) subqueries as common table expressions (CTEs) by declaring them ahead of your main query, which is an excellent tool for organizing information and placing it in a logical order.

In this exercise, let's rewrite a similar query using a CTE.

In [11]:
query = """
SELECT
  c.name AS country,
  COUNT(sub.id) AS matches
FROM country AS c
INNER JOIN (
  SELECT country_id, id 
  FROM match
  WHERE (home_team_goal + away_team_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,country,matches
0,England,4
1,France,1
2,Germany,1
3,Netherlands,2
4,Scotland,1


In [12]:
query = """
-- Set up your CTE
WITH match_list AS (
    SELECT 
  		country_id, 
  		id
    FROM match
    WHERE (home_team_goal + away_team_goal) >= 10)
-- Select league and count of matches from the CTE
SELECT
    l.name AS league,
    COUNT(match_list.id) AS matches
FROM league AS l
-- Join the CTE to the league table
LEFT JOIN match_list ON l.country_id = match_list.country_id
GROUP BY l.name;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,league,matches
0,Belgium Jupiler League,0
1,England Premier League,4
2,France Ligue 1,1
3,Germany 1. Bundesliga,1
4,Italy Serie A,0


# Organizing with CTEs

Previously, you modified a query based on a statement you completed in chapter 2 using common table expressions.

This time, let's expand on the exercise by looking at details about matches with very high scores using CTEs. Just like a subquery in FROM, you can join tables inside a CTE.

In [13]:
query = """
-- Set up your CTE
WITH match_list AS (
  -- Select the league, date, home, and away goals
    SELECT 
  		l.name AS league, 
     	m.date, 
  		m.home_team_goal, 
  		m.away_team_goal,
       (m.home_team_goal + m.away_team_goal) AS total_goals
    FROM match AS m
    LEFT JOIN league as l ON m.country_id = l.id)
-- Select the league, date, home, and away goals from the CTE
SELECT league, date, home_team_goal, away_team_goal
FROM match_list
-- Filter by total goals
WHERE total_goals >= 10;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,league,date,home_team_goal,away_team_goal
0,England Premier League,2009-11-22 00:00:00,9,1
1,England Premier League,2011-08-28 00:00:00,8,2
2,England Premier League,2012-12-29 00:00:00,7,3
3,England Premier League,2013-05-19 00:00:00,5,5
4,France Ligue 1,2009-11-08 00:00:00,5,5


# CTEs with nested subqueries

If you find yourself listing multiple subqueries in the FROM clause with nested statement, your query will likely become long, complex, and difficult to read.

Since many queries are written with the intention of being saved and re-run in the future, proper organization is key to a seamless workflow. Arranging subqueries as CTEs will save you time, space, and confusion in the long run

In [14]:
query = """
-- Set up your CTE
WITH match_list AS (
    SELECT 
  		country_id,
  	   (home_team_goal + away_team_goal) AS goals
    FROM match
  	-- Create a list of match IDs to filter data in the CTE
    WHERE id IN (
       SELECT id
       FROM match
       WHERE season = '2013/2014' AND date BETWEEN '2013-08-01' AND '2013-08-31'))
-- Select the league name and average of goals in the CTE
SELECT 
	l.name,
    AVG(match_list.goals)
FROM league AS l
-- Join the CTE onto the league table
LEFT JOIN match_list ON l.id = match_list.country_id
GROUP BY l.name;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,name,AVG(match_list.goals)
0,Belgium Jupiler League,
1,England Premier League,2.142857
2,France Ligue 1,2.2
3,Germany 1. Bundesliga,3.25
4,Italy Serie A,2.2


# Get team names with a subquery

Let's solve a problem we've encountered a few times in this course so far -- How do you get both the home and away team names into one final query result?

Out of the 4 techniques we just discussed, this can be performed using subqueries, correlated subqueries, and CTEs. Let's practice creating similar result sets using each of these 3 methods over the next 3 exercises, starting with subqueries in FROM.

In [15]:
query = """
SELECT 
	m.id, 
    t.team_long_name AS hometeam
-- Left join team to match
FROM match AS m
LEFT JOIN team as t
ON m.home_team_api_id = team_api_id;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,id,hometeam
0,1,KRC Genk
1,2,SV Zulte-Waregem
2,3,KSV Cercle Brugge
3,4,KAA Gent
4,5,FCV Dender EH


In [16]:
query = """
SELECT
	m.date,
    -- Get the home and away team names
    home.hometeam,
    away.awayteam,
    m.home_team_goal,
    m.away_team_goal
FROM match AS m

-- Join the home subquery to the match table
LEFT JOIN (
  SELECT match.id, team.team_long_name AS hometeam
  FROM match
  LEFT JOIN team
  ON match.home_team_api_id = team.team_api_id) AS home
ON home.id = m.id

-- Join the away subquery to the match table
LEFT JOIN (
  SELECT match.id, team.team_long_name AS awayteam
  FROM match
  LEFT JOIN team
  -- Get the away team ID in the subquery
  ON match.away_team_api_id = team.team_api_id) AS away
ON away.id = m.id;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,date,hometeam,awayteam,home_team_goal,away_team_goal
0,2008-08-17 00:00:00,KRC Genk,Beerschot AC,1,1
1,2008-08-16 00:00:00,SV Zulte-Waregem,Sporting Lokeren,0,0
2,2008-08-16 00:00:00,KSV Cercle Brugge,RSC Anderlecht,0,3
3,2008-08-17 00:00:00,KAA Gent,RAEC Mons,5,0
4,2008-08-16 00:00:00,FCV Dender EH,Standard de Liège,1,3


# Get team names with correlated subqueries

Let's solve the same problem using correlated subqueries -- How do you get both the home and away team names into one final query result?

This can easily be performed using correlated subqueries. But how might that impact the performance of your query? Complete the following steps and let's find out!

In [17]:
query = """
SELECT
    m.date,
   (SELECT team_long_name
    FROM team AS t
    -- Connect the team to the match table
    WHERE t.team_api_id = m.home_team_api_id) AS hometeam
FROM match AS m;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,date,hometeam
0,2008-08-17 00:00:00,KRC Genk
1,2008-08-16 00:00:00,SV Zulte-Waregem
2,2008-08-16 00:00:00,KSV Cercle Brugge
3,2008-08-17 00:00:00,KAA Gent
4,2008-08-16 00:00:00,FCV Dender EH


In [18]:
query = """
SELECT
    m.date,
    (SELECT team_long_name
     FROM team AS t
     WHERE t.team_api_id = m.home_team_api_id) AS hometeam,
    -- Connect the team to the match table
    (SELECT team_long_name
     FROM team AS t
     WHERE t.team_api_id = m.away_team_api_id) AS awayteam,
    -- Select home and away goals
     m.home_team_goal,
     m.away_team_goal
FROM match AS m;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,date,hometeam,awayteam,home_team_goal,away_team_goal
0,2008-08-17 00:00:00,KRC Genk,Beerschot AC,1,1
1,2008-08-16 00:00:00,SV Zulte-Waregem,Sporting Lokeren,0,0
2,2008-08-16 00:00:00,KSV Cercle Brugge,RSC Anderlecht,0,3
3,2008-08-17 00:00:00,KAA Gent,RAEC Mons,5,0
4,2008-08-16 00:00:00,FCV Dender EH,Standard de Liège,1,3


# Get team names with CTEs

You've now explored two methods for answering the question, How do you get both the home and away team names into one final query result?

Let's explore the final method - common table expressions. Common table expressions are similar to the subquery method for generating results, mainly differing in syntax and the order in which information is processed.

In [19]:
query = """
SELECT 
	-- Select match id and team long name
    m.id, 
    t.team_long_name AS hometeam
FROM match AS m
-- Join team to match using team_api_id and hometeam_id
LEFT JOIN team AS t 
ON m.home_team_api_id = t.team_api_id;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,id,hometeam
0,1,KRC Genk
1,2,SV Zulte-Waregem
2,3,KSV Cercle Brugge
3,4,KAA Gent
4,5,FCV Dender EH


In [20]:
query = """
-- Declare the home CTE
WITH home AS (
	SELECT m.id, t.team_long_name AS hometeam
	FROM match AS m
	LEFT JOIN team AS t 
	ON m.home_team_api_id = t.team_api_id)
-- Select everything from home
SELECT *
FROM home;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,id,hometeam
0,1,KRC Genk
1,2,SV Zulte-Waregem
2,3,KSV Cercle Brugge
3,4,KAA Gent
4,5,FCV Dender EH


In [21]:
query = """
WITH home AS (
  SELECT m.id, m.date, 
  		 t.team_long_name AS hometeam, m.home_team_goal
  FROM match AS m
  LEFT JOIN team AS t 
  ON m.home_team_api_id = t.team_api_id),
-- Declare and set up the away CTE
away AS (
  SELECT m.id, m.date, 
  		 t.team_long_name AS awayteam, m.away_team_goal
  FROM match AS m
  LEFT JOIN team AS t 
  ON m.away_team_api_id = t.team_api_id)
-- Select date, home_team_goal, and away_team_goal
SELECT 
	home.date,
    home.hometeam,
    away.awayteam,
    home.home_team_goal,
    away.away_team_goal
-- Join away and home on the id column
FROM home
INNER JOIN away
ON home.id = away.id;
"""
result = execute_df(query)

# Show results
result.head()

Unnamed: 0,date,hometeam,awayteam,home_team_goal,away_team_goal
0,2008-08-17 00:00:00,KRC Genk,Beerschot AC,1,1
1,2008-08-16 00:00:00,SV Zulte-Waregem,Sporting Lokeren,0,0
2,2008-08-16 00:00:00,KSV Cercle Brugge,RSC Anderlecht,0,3
3,2008-08-17 00:00:00,KAA Gent,RAEC Mons,5,0
4,2008-08-16 00:00:00,FCV Dender EH,Standard de Liège,1,3


# Which technique to use?

The previous three exercises demonstrated that, in many cases, you can use multiple techniques in SQL to answer the same question.

Based on what you learned, which of the following statements is true regarding differences in the use and performance of multiple/nested subqueries, correlated subqueries, and common table expressions?


- Correlated subqueries can allow you to circumvent multiple, complex joins.
- Common table expressions are declared first, improving query run time.
- Multiple or nested subqueries are processed first, before your main query.