# Data Manipulation in SQL

A [course](https://campus.datacamp.com/courses/data-manipulation-in-sql/) on Datacamp.

Please keep scrolling for a demo of my newly learned `SQL` skills.

## Explore Datasets
Use the `match`, `league`, and `country` tables to explore the data and practice your skills!
- **CTE's:** Use the `match`, `league`, and `country` tables to return the number of matches played in Great Britain versus elsewhere in the world.
    - "England", "Scotland", and "Wales" should be categorized as "Great Britain"
    - All other leagues will need to be categorized as "World".
- **Subqueries:** Use the `match` and `country` tables to return the countries in which the average number of goals (home and away goals) scored are greater than the average number of goals of all matches.
- **Case:** In a soccer league, points are assigned to teams based on the result of a game. Here, let's assume that 3 points are awarded for a win, 1 for a tie, and 0 for a defeat. Use the `match` table to calculate the running total of points earned by the team "Chelsea" (team id 8455) in the season "2014/2015".
    - The final output should have the match date, the points earned by Chelsea, and the running total.

## Available raw data:

**Match table:**

In [22]:
SELECT *
FROM soccer.match
LIMIT 3;

Unnamed: 0,id,country_id,season,stage,date,hometeam_id,awayteam_id,home_goal,away_goal
0,757,1,2011/2012,1,2011-07-29 00:00:00+00:00,1773,8635,2,1
1,758,1,2011/2012,1,2011-07-30 00:00:00+00:00,9998,9985,1,1
2,759,1,2011/2012,1,2011-07-30 00:00:00+00:00,9987,9993,3,1


**League table:**

In [21]:
SELECT *
FROM soccer.league
LIMIT 3;

Unnamed: 0,id,country_id,name
0,1,1,Belgium Jupiler League
1,1729,1729,England Premier League
2,4769,4769,France Ligue 1


**Country table:**

In [19]:
SELECT *
FROM soccer.country
LIMIT 3;

Unnamed: 0,id,name
0,1,Belgium
1,1729,England
2,4769,France


## Common Table Expressions (CTE)
Advantages:

- Can reference previous CTE in latter ones and in the main query
- Improves organization of code
- Improves query performance (run once and stored in memory)


**Let us use CTEs to view the matches with more than 10 goals.**

We would like to know the league, date and the score for both teams.

In [28]:
-- Set up the CTE
WITH match_list AS (
  -- Select the league, date, home, and away goals
    SELECT 
  		l.name AS league, 
     	m.date, 
  		m.home_goal, 
  		m.away_goal,
       (m.home_goal + m.away_goal) AS total_goals
    FROM soccer.match AS m
    LEFT JOIN soccer.league as l ON m.country_id = l.id)
	
-- Select the league, date, home, and away goals from the CTE
SELECT league, date, home_goal, away_goal
FROM match_list
-- Filter by total goals
WHERE total_goals >= 10
LIMIT 6;

Unnamed: 0,league,date,home_goal,away_goal
0,England Premier League,2011-08-28 00:00:00+00:00,8,2
1,England Premier League,2012-12-29 00:00:00+00:00,7,3
2,England Premier League,2013-05-19 00:00:00+00:00,5,5
3,Germany 1. Bundesliga,2013-03-30 00:00:00+00:00,9,2
4,Netherlands Eredivisie,2011-11-06 00:00:00+00:00,6,4
5,Spain LIGA BBVA,2013-10-30 00:00:00+00:00,7,3


**Let's use them to clean code...**

### Without CTE, using nested queries

In [27]:
SELECT
	-- Select the season and max goals scored in a match
	season,
    MAX(home_goal + away_goal) AS max_goals,
	
    -- Select the overall max goals scored in a match
   (SELECT MAX(home_goal + away_goal) FROM soccer.match) AS overall_max_goals,
   
   -- Select the max number of goals scored in any match in July
   (SELECT MAX(home_goal + away_goal) 
    FROM soccer.match
    WHERE id IN (
          SELECT id FROM soccer.match WHERE EXTRACT(MONTH FROM date) = 07)) AS july_max_goals
FROM soccer.match
GROUP BY season
LIMIT 4;

Unnamed: 0,season,max_goals,overall_max_goals,july_max_goals
0,2013/2014,10,11,7
1,2012/2013,11,11,7
2,2014/2015,10,11,7
3,2011/2012,10,11,7


### Using CTEs to simplify and shorten code:

In [15]:
-- See Vis Code

### Using CTEs for modularity and ease of reading:

In [16]:
-- to be done

## Subqueries

**Kinds: Correlated, Nested, SELECT, FROM, WHERE**

### Correlated Subqueries

**Advantages of using correlated subqueries:**

1. Avoid the limit of joins: Subqueries can be used to break down complex queries into smaller, more manageable parts.

2. Subqueries can be used to match specific conditions or values from one table to another, allowing for more flexible and targeted queries.

3. Subqueries can simplify complex queries by allowing for step-by-step filtering and aggregation of data.

4. Improve query performance: In some cases, subqueries can be more efficient than joins

**Disadvantages:**
High processing time because they are run multiple times

**Example 3.3** 

What was the highest scoring match for each `country`, in each `season`

The output is a list of matches, and if there are two matches with the same number of goals, they both are shown.

This code runs this many times: `countries` x `season` 

**// _Takes ±16 seconds to run on datacamp when not limited_**

In [29]:
SELECT 
	-- Select country ID, date, home, and away goals from match
	main.country_id,
    main.date,
    main.home_goal,
    main.away_goal
FROM soccer.match AS main
WHERE 
	-- Filter for matches with the highest number of goals scored
	(home_goal + away_goal) = 
        (SELECT MAX(sub.home_goal + sub.away_goal)
         FROM soccer.match AS sub
         WHERE main.country_id = sub.country_id
               AND main.season = sub.season)
LIMIT 10;

Unnamed: 0,country_id,date,home_goal,away_goal
0,1,2011-10-29 00:00:00+00:00,4,5
1,1,2012-11-17 00:00:00+00:00,2,6
2,1,2012-12-09 00:00:00+00:00,1,7
3,1,2013-01-19 00:00:00+00:00,2,6
4,1,2012-08-19 00:00:00+00:00,2,6
5,1,2014-04-19 00:00:00+00:00,2,4
6,1,2014-04-26 00:00:00+00:00,4,2
7,1,2015-01-17 00:00:00+00:00,1,7
8,1,2014-09-13 00:00:00+00:00,3,5
9,1729,2011-08-28 00:00:00+00:00,8,2


**_// Rewritten as CTE, runs in 5.5 sec on datacamp when not limited_**

In [31]:
-- CTE that computes the max number of goals per country per season

WITH max_nr_goals AS (
	SELECT sub.country_id, sub.season, MAX(sub.home_goal + sub.away_goal) AS max_goals
	FROM soccer.match AS sub
	INNER JOIN soccer.match as sub2
	ON sub.date = sub2.date
	WHERE sub2.country_id = sub.country_id
	   AND sub2.season = sub.season
	GROUP BY sub.country_id, sub.season
)

SELECT 
	-- Select country ID, date, home, and away goals from match
	main.country_id,
    main.date,
    main.home_goal,
    main.away_goal	
FROM soccer.match AS main
-- join with the CTE to be able to use it
INNER JOIN max_nr_goals
-- on season and country
ON max_nr_goals.season = main.season AND max_nr_goals.country_id = main.country_id
WHERE 
	-- Filter for matches with the highest number of goals scored
	(home_goal + away_goal) = max_nr_goals.max_goals
LIMIT 10;

Unnamed: 0,country_id,date,home_goal,away_goal
0,1,2011-10-29 00:00:00+00:00,4,5
1,1,2012-11-17 00:00:00+00:00,2,6
2,1,2012-12-09 00:00:00+00:00,1,7
3,1,2013-01-19 00:00:00+00:00,2,6
4,1,2012-08-19 00:00:00+00:00,2,6
5,1,2014-04-19 00:00:00+00:00,2,4
6,1,2014-04-26 00:00:00+00:00,4,2
7,1,2015-01-17 00:00:00+00:00,1,7
8,1,2014-09-13 00:00:00+00:00,3,5
9,1729,2011-08-28 00:00:00+00:00,8,2


### Nested Subqueries:

**Example 3.5**

    SELECT: Query
        SELECT: Sub-query
        SELECT: Sub-query
            WHERE: Sub-sub-query

Let's examine the highest total number of goals in each 'season', 'overall in all seasons', and 'during July across all seasons'.

In [17]:
SELECT
	-- Select the season and max goals scored in a match
	season,
    MAX(home_goal + away_goal) AS max_goals,
	
    -- Select the overall max goals scored in a match
   (SELECT MAX(home_goal + away_goal) FROM soccer.match) AS overall_max_goals,
   
   -- Select the max number of goals scored in any match in July
   (SELECT MAX(home_goal + away_goal) 
    FROM soccer.match
    WHERE id IN (
          SELECT id FROM soccer.match WHERE EXTRACT(MONTH FROM date) = 07)) AS july_max_goals
FROM soccer.match
GROUP BY season;

Unnamed: 0,season,max_goals,overall_max_goals,july_max_goals
0,2013/2014,10,11,7
1,2012/2013,11,11,7
2,2014/2015,10,11,7
3,2011/2012,10,11,7


### SELECT Subqueries:

**Example 2.9**

The following allows us to compare the league average, with the overall average.

Note: subqueries in SELECT must result in one line, otherwise we get an error.

In [32]:
SELECT
	-- Select the league name and average goals scored
	l.name AS league,
	ROUND(AVG(m.home_goal + m.away_goal),2) AS avg_goals,
    -- Subtract the overall average from the league average
	ROUND(AVG(m.home_goal + m.away_goal) - 
		(SELECT AVG(home_goal + away_goal)
		 FROM soccer.match 
         WHERE season = '2013/2014'),2) AS diff_to_overall
FROM soccer.league AS l
LEFT JOIN soccer.match AS m
ON l.country_id = m.country_id
-- Only include 2013/2014 results
WHERE season = '2013/2014'
GROUP BY l.name
LIMIT 5;

Unnamed: 0,league,avg_goals,diff_to_overall
0,Switzerland Super League,2.89,0.12
1,Poland Ekstraklasa,2.64,-0.13
2,Netherlands Eredivisie,3.2,0.43
3,Scotland Premier League,2.75,-0.02
4,France Ligue 1,2.46,-0.31


### FROM Subqueries:

**Example 2.7**

The following joins two tables, such that we can reference `country` by name from one table, and the other data from the `match` table.

In [1]:
SELECT
	-- Select country, date, home, and away goals from the subquery
    subq.country,
    subq.date,
    subq.home_goal,
    subq.away_goal
FROM 
	-- Select country name, date, home_goal, away_goal, and total goals in the subquery
	(SELECT c.name AS country, 
     	    m.date, 
     		m.home_goal, 
     		m.away_goal,
           (m.home_goal + m.away_goal) AS total_goals
    FROM soccer.match AS m
    LEFT JOIN soccer.country AS c
    ON m.country_id = c.id) AS subq
-- Filter by total goals scored in the main query
WHERE total_goals >= 10;

Unnamed: 0,country,date,home_goal,away_goal
0,England,2011-08-28 00:00:00+00:00,8,2
1,England,2012-12-29 00:00:00+00:00,7,3
2,England,2013-05-19 00:00:00+00:00,5,5
3,Germany,2013-03-30 00:00:00+00:00,9,2
4,Netherlands,2011-11-06 00:00:00+00:00,6,4
5,Spain,2013-10-30 00:00:00+00:00,7,3
6,Spain,2015-04-05 00:00:00+00:00,9,1
7,Spain,2015-05-23 00:00:00+00:00,7,3
8,Spain,2014-09-20 00:00:00+00:00,2,8


## CASE statements

**Example 1.7** 

Using a simple CASE statement, let's generate a list of matches won by Italy's Bologna team.

In [33]:
-- Select the season, date, home_goal, and away_goal columns
SELECT 
	season,
    date,
	home_goal,
	away_goal
FROM soccer.match
WHERE 
-- Exclude games not won by Bologna
	CASE 
		WHEN hometeam_id = 9857 AND home_goal > away_goal 
			THEN 'Bologna Win'
		WHEN awayteam_id = 9857 AND away_goal > home_goal 
			THEN 'Bologna Win' 
		END IS NOT NULL

	-- IS NOT NULL: avoid output rows when this CASE condition is Null (when the team doesnt win)
		
LIMIT 10;
		