**Setup**

In [2]:
# Library
import pandas as pd
import sqlite3

# Load CSV data into a DataFrame
data = pd.read_csv('D:\Code\DE\PostgreSQL Summary Stats and Window Functions\summer.csv')

# Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')

# Store the DataFrame as a table in the database
data.to_sql('Summer_Medals', conn, index=False)

31165

**Fencing**

**Future gold medalists**

Fetching functions allow you to get values from different parts of the table into one row. If you have time-ordered data, you can "peek into the future" with the `LEAD` fetching function. This is especially useful if you want to compare a current value to a future value.

**Instructions**

- For each year, fetch the current gold medalist and the gold medalist 3 competitions ahead of the current row.

In [2]:
query = """
WITH Discus_Medalists AS (
  SELECT DISTINCT
    Year,
    Athlete
  FROM Summer_Medals
  WHERE Medal = 'Gold'
    AND Event = 'Discus Throw'
    AND Gender = 'Women'
    AND Year >= 2000)

SELECT
  -- For each year, fetch the current and future medalists
  Year,
  Athlete,
  LEAD(Athlete, 3) OVER (ORDER BY Year ASC) AS Future_Champion
FROM Discus_Medalists
ORDER BY Year ASC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Year,Athlete,Future_Champion
0,2000,ZVEREVA Ellina,PERKOVIC Sandra
1,2004,SADOVA Natalya,
2,2008,BROWN TRAFTON Stephanie,
3,2012,PERKOVIC Sandra,


**First athlete by name**

It's often useful to get the first or last value in a dataset to compare all other values to it. With absolute fetching functions like `FIRST_VALUE`, you can fetch a value at an absolute position in the table, like its beginning or end.

**Instructions**

- Return all athletes and the first athlete ordered by alphabetical order.

In [3]:
query = """
WITH All_Male_Medalists AS (
  SELECT DISTINCT
    Athlete
  FROM Summer_Medals
  WHERE Medal = 'Gold'
    AND Gender = 'Men')

SELECT
  -- Fetch all athletes and the first athlete alphabetically
  Athlete,
  FIRST_VALUE(Athlete) OVER (
    ORDER BY Athlete ASC
  ) AS First_Athlete
FROM All_Male_Medalists;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Athlete,First_Athlete
0,AABYE Edgar,AABYE Edgar
1,AALTONEN Paavo Johannes,AABYE Edgar
2,AAS Thomas Valentin,AABYE Edgar
3,ABALMASAU Aliaksei,AABYE Edgar
4,ABALO Luc,AABYE Edgar
...,...,...
6240,ÖRVIG Thor,AABYE Edgar
6241,ÖSTERVOLD Henrik,AABYE Edgar
6242,ÖSTERVOLD Jan Olsen,AABYE Edgar
6243,ÖSTERVOLD Kristian Olsen,AABYE Edgar


**Last country by name**

Just like you can get the first row's value in a dataset, you can get the last row's value. This is often useful when you want to compare the most recent value to previous values.

**Instructions**

- Return the year and the city in which each Olympic games were held.
- Fetch the last city in which the Olympic games were held.

In [4]:
query = """
WITH Hosts AS (
  SELECT DISTINCT Year, City
    FROM Summer_Medals)

SELECT
  Year,
  City,
  -- Get the last city in which the Olympic games were held
  LAST_VALUE(City) OVER (
   ORDER BY Year ASC
   RANGE BETWEEN
     UNBOUNDED PRECEDING AND
     UNBOUNDED FOLLOWING
  ) AS Last_City
FROM Hosts
ORDER BY Year ASC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Year,City,Last_City
0,1896,Athens,London
1,1900,Paris,London
2,1904,St Louis,London
3,1908,London,London
4,1912,Stockholm,London
5,1920,Antwerp,London
6,1924,Paris,London
7,1928,Amsterdam,London
8,1932,Los Angeles,London
9,1936,Berlin,London


**Ranking**

**Ranking athletes by medals earned**

In chapter 1, you used `ROW_NUMBER` to rank athletes by awarded medals. However, `ROW_NUMBER` assigns different numbers to athletes with the same count of awarded medals, so it's not a useful ranking function; if two athletes earned the same number of medals, they should have the same rank.

**Instructions**

- Rank each athlete by the number of medals they've earned -- the higher the count, the higher the rank -- with identical numbers in case of identical values.

In [3]:
query = """
WITH Athlete_Medals AS (
  SELECT
    Athlete,
    COUNT(*) AS Medals
  FROM Summer_Medals
  GROUP BY Athlete)

SELECT
  Athlete,
  Medals,
  -- Rank athletes by the medals they've won
  RANK() OVER (ORDER BY Medals DESC) AS Rank_N
FROM Athlete_Medals
ORDER BY Medals DESC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Athlete,Medals,Rank_N
0,PHELPS Michael,22,1
1,LATYNINA Larisa,18,2
2,ANDRIANOV Nikolay,15,3
3,MANGIAROTTI Edoardo,13,4
4,ONO Takashi,13,4
...,...,...,...
22757,ÖSTERVOLD Henrik,1,5267
22758,ÖSTERVOLD Jan Olsen,1,5267
22759,ÖSTERVOLD Kristian Olsen,1,5267
22760,ÖSTERVOLD Ole Olsen,1,5267


**Ranking athletes from multiple countries**

In the previous exercise, you used `RANK` to assign rankings to one group of athletes. In real-world data, however, you'll often find numerous groups within your data. Without partitioning your data, one group's values will influence the rankings of the others.

Also, while `RANK` skips numbers in case of identical values, the most natural way to assign rankings is not to skip numbers. If two countries are tied for second place, the country after them is considered to be third by most people.

**Instructions**

- Rank each country's athletes by the count of medals they've earned -- the higher the count, the higher the rank -- without skipping numbers in case of identical values.

In [4]:
query = """
WITH Athlete_Medals AS (
  SELECT
    Country, Athlete, COUNT(*) AS Medals
  FROM Summer_Medals
  WHERE
    Country IN ('JPN', 'KOR')
    AND Year >= 2000
  GROUP BY Country, Athlete
  HAVING COUNT(*) > 1)

SELECT
  Country,
  -- Rank athletes in each country by the medals they've won
  Athlete,
  DENSE_RANK() OVER (PARTITION BY Country
                         ORDER BY Medals DESC) AS Rank_N
FROM Athlete_Medals
ORDER BY Country ASC, RANK_N ASC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Country,Athlete,Rank_N
0,JPN,KITAJIMA Kosuke,1
1,JPN,UCHIMURA Kohei,2
2,JPN,TACHIBANA Miya,3
3,JPN,TAKEDA Miho,3
4,JPN,ICHO Kaori,4
...,...,...,...
69,KOR,OH Yong Ran,4
70,KOR,PARK Jinman,4
71,KOR,PARK Kyung-Mo,4
72,KOR,YOO Yong-Sung,4


**Paging**

**Paging events**

There are exactly 666 unique events in the Summer Medals Olympics dataset. If you want to chunk them up to analyze them piece by piece, you'll need to split the events into groups of approximately equal size.

**Instructions**

- Split the distinct events into exactly 111 groups, ordered by event in alphabetical order.

In [5]:
query = """
WITH Events AS (
  SELECT DISTINCT Event
  FROM Summer_Medals)
  
SELECT
  --- Split up the distinct events into 111 unique groups
  Event,
  NTILE(111) OVER (ORDER BY Event ASC) AS Page
FROM Events
ORDER BY Event ASC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Event,Page
0,+ 100KG,1
1,+ 100KG (Heavyweight),1
2,+ 100KG (Super Heavyweight),1
3,+ 105KG,1
4,+ 108KG Total (Super Heavyweight),1
...,...,...
661,York Round (100Y - 80Y - 60Y),111
662,Épée Amateurs And Masters,111
663,Épée Individual,111
664,Épée Masters,111


**Top, middle, and bottom thirds**

Splitting your data into thirds or quartiles is often useful to understand how the values in your dataset are spread. Getting summary statistics (averages, sums, standard deviations, etc.) of the top, middle, and bottom thirds can help you determine what distribution your values follow.

**Instructions**

- Split the athletes into top, middle, and bottom thirds based on their count of medals.

In [6]:
query = """
WITH Athlete_Medals AS (
  SELECT Athlete, COUNT(*) AS Medals
  FROM Summer_Medals
  GROUP BY Athlete
  HAVING COUNT(*) > 1)
  
SELECT
  Athlete,
  Medals,
  -- Split athletes into thirds by their earned medals
  NTILE(3) OVER (ORDER BY Medals DESC) AS Third
FROM Athlete_Medals
ORDER BY Medals DESC, Athlete ASC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Athlete,Medals,Third
0,PHELPS Michael,22,1
1,LATYNINA Larisa,18,1
2,ANDRIANOV Nikolay,15,1
3,MANGIAROTTI Edoardo,13,1
4,ONO Takashi,13,1
...,...,...,...
5261,ZVEREVA Ellina,2,3
5262,ZWERVER Ronald,2,3
5263,ZWOLLE Hendrik Jan,2,3
5264,ZYKINA Olesya,2,3


- Return the average of each third.

In [7]:
query = """
WITH Athlete_Medals AS (
  SELECT Athlete, COUNT(*) AS Medals
  FROM Summer_Medals
  GROUP BY Athlete
  HAVING COUNT(*) > 1),
  
  Thirds AS (
  SELECT
    Athlete,
    Medals,
    NTILE(3) OVER (ORDER BY Medals DESC) AS Third
  FROM Athlete_Medals)
  
SELECT
  -- Get the average medals earned in each third
  Third,
  AVG(Medals) AS Avg_Medals
FROM Thirds
GROUP BY Third
ORDER BY Third ASC;
"""
result = pd.read_sql_query(query, conn)
result

Unnamed: 0,Third,Avg_Medals
0,1,3.786446
1,2,2.0
2,3,2.0
