# **Extract, Transform, Load (ETL) Process**

## Objectives

* The objective of this notebook is to perform an ETL process of the data I have gathered from Kaggle.
* I have fetched the data which you can find in the datasets/raw folder.
* There are 4 related csv files that contain foreign keys that relate to eachother such as player_id, match_id and tournament_id. I will use these keys to clean and merge the data.
## Inputs

* My inputs are 4 csv files:
    * matches.csv
    * players.csv
    * scores.csv
    * tournaments.csv

## Outputs

* I am going to output one main table. From this table I will filter information into smaller csv files where I can work on them more easily to create visualisations.

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir


'/Users/stephenbeese/GitHub/Snooker-Data-Analysis/Snooker-Data-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/stephenbeese/GitHub/Snooker-Data-Analysis/Snooker-Data-Analysis'

# Imports

To clean and merge this data I will first need to import some libraries

In [4]:
import numpy as np
import pandas as pd

# Set up data directories

In [5]:
# Set the file path for raw data
raw_data_dir = os.path.join(current_dir, 'datasets/raw')

# Set the file path for clean data
clean_data_dir = os.path.join(current_dir, 'datasets/clean')

# Section 1 - matches.csv

In [6]:
# Load all csv files into Pandas DataFrames
df_matches = pd.read_csv(os.path.join(raw_data_dir, 'matches.csv'))
df_players = pd.read_csv(os.path.join(raw_data_dir, 'players.csv'))
df_scores = pd.read_csv(os.path.join(raw_data_dir, 'scores.csv'))
df_tournaments = pd.read_csv(os.path.join(raw_data_dir, 'tournaments.csv'))


---

In [7]:
# Display the head of df_matches
df_matches.head()

Unnamed: 0,tournament_id,match_id,date,stage,best_of,player1_name,player1_url,player2_name,player2_url,score1,score2,frames_scores,is_walkover
0,753,82716,,Final,31,Terry Griffiths,https://cuetracker.net/players/terry-griffiths,Alex Higgins,https://cuetracker.net/players/alex-higgins,16,15,20-58; 31-90; 56-52; 26-87(67); 0-114(67); 73(...,False
1,753,82718,,Semi-final,17,Terry Griffiths,https://cuetracker.net/players/terry-griffiths,Tony Meo,https://cuetracker.net/players/tony-meo,9,7,71-8; 50-71(55); 31-62; 69-30; 73-61; 34-77(52...,False
2,753,82717,,Semi-final,17,Alex Higgins,https://cuetracker.net/players/alex-higgins,Ray Reardon,https://cuetracker.net/players/ray-reardon,9,6,28-71; 67(50)-29; 74(74)-0; 53-79; 60-54; 112(...,False
3,753,82721,,Quarter-final,17,Terry Griffiths,https://cuetracker.net/players/terry-griffiths,Steve Davis,https://cuetracker.net/players/steve-davis,9,6,"1-103; 117(60,57)-6; 5-105(60); 57-60; 79-0; 2...",False
4,753,82719,,Quarter-final,17,Alex Higgins,https://cuetracker.net/players/alex-higgins,John Spencer,https://cuetracker.net/players/john-spencer,9,5,69(54)-31; 103-21; 72-48; 33-82; 40-56; 71-51;...,False


In [8]:
# Check missing values in the dataframe
df_matches.isnull().sum()

tournament_id         0
match_id              0
date             138711
stage                 0
best_of               0
player1_name          0
player1_url           0
player2_name          0
player2_url           0
score1                0
score2                0
frames_scores    123278
is_walkover           0
dtype: int64

From the code above we can see that there are a lot of missing values in both the `date` column and `frames_scores` column.

As we are not interested in `frames_scores` we can remove this column completely.

The `date` column however is useful to us, luckily in the `tournaments.csv` file it contains both the `season` and `year` of each tournament. We can use the `tournament_id` to get this information and create a new column in the `matches.csv` file.

We will do this later but for now we can drop the `date` and `frames_scores` as they are not necessary.

Looking further into the DataFrame, there are some more unecessary columns that won't be needed for the analysis we are looking at in this project.
* `player1_url`
* `player2_url`
* `is_walkover`

In [9]:
# Create a copy of df_matches
df_matches = df_matches.copy()

# Delete date and frames_scores from df_matches
df_matches = df_matches.drop(columns=['date', 'frames_scores', 'player1_url', 'player2_url', 'is_walkover'])

In [10]:
# Display the head of df_matches after cleaning
df_matches.head()

Unnamed: 0,tournament_id,match_id,stage,best_of,player1_name,player2_name,score1,score2
0,753,82716,Final,31,Terry Griffiths,Alex Higgins,16,15
1,753,82718,Semi-final,17,Terry Griffiths,Tony Meo,9,7
2,753,82717,Semi-final,17,Alex Higgins,Ray Reardon,9,6
3,753,82721,Quarter-final,17,Terry Griffiths,Steve Davis,9,6
4,753,82719,Quarter-final,17,Alex Higgins,John Spencer,9,5


In [11]:
# Check missing values in the dataframe
df_matches.isnull().sum()

tournament_id    0
match_id         0
stage            0
best_of          0
player1_name     0
player2_name     0
score1           0
score2           0
dtype: int64

In [12]:
# Check for duplicate matches
df_matches[df_matches.duplicated()]

Unnamed: 0,tournament_id,match_id,stage,best_of,player1_name,player2_name,score1,score2


As you can see from the code cell above we now have a cleaned `matches.csv` dataset with no null values or duplicates.

In the next cell I will save this to a new .csv file and continue working on the other dataframes.

In [13]:
# Save cleaned DataFrame to a new .csv file
df_matches.to_csv(os.path.join(clean_data_dir, 'matches_cleaned.csv'), index=False)

# Section 2 - players.csv

Section 2 content

In [14]:
# Display the head of df_players
df_players.head()

Unnamed: 0,url,id,first_name,last_name,full_name,country
0,https://cuetracker.net/players/mohammed-a-belg...,mohammed-a-belgaizi,Mohammed,A Belgaizi,Mohammed A Belgaizi,United Arab Emirates
1,https://cuetracker.net/players/ishaq-a-khaleg,ishaq-a-khaleg,Ishaq,A Khaleg,Ishaq A Khaleg,Bahrain
2,https://cuetracker.net/players/ahmed-a-asere,ahmed-a-asere,Ahmed,A. Asere,Ahmed A. Asere,Saudi Arabia
3,https://cuetracker.net/players/magnus-aagaard,magnus-aagaard,Magnus,Aagaard,Magnus Aagaard,Denmark
4,https://cuetracker.net/players/asbjorn-aalberg,asbjorn-aalberg,Asbjorn,Aalberg,Asbjorn Aalberg,Norway


In [15]:
# Check for missing values
df_players.isnull().sum()

url           0
id            0
first_name    0
last_name     0
full_name     0
country       0
dtype: int64

In [16]:
# Check for duplicate players
df_players[df_players.duplicated()]

Unnamed: 0,url,id,first_name,last_name,full_name,country


As you can see from the code above there are no missing values or duplicates in this table.

This means we can look at the columns and decide what is useful to our project outcomes.

As we are looking at international players we are interested in the country and the player's names.

This means we can drop the player_url as that isn't crucial for our analysis.

I am going to drop this column and then save it to a new `.csv` file.

In [17]:
# Create a copy of df_players
df_players = df_players.copy()

# Drop the player_url column
df_players = df_players.drop(columns=['url'])

# Display the head of df_players
df_players.head()

Unnamed: 0,id,first_name,last_name,full_name,country
0,mohammed-a-belgaizi,Mohammed,A Belgaizi,Mohammed A Belgaizi,United Arab Emirates
1,ishaq-a-khaleg,Ishaq,A Khaleg,Ishaq A Khaleg,Bahrain
2,ahmed-a-asere,Ahmed,A. Asere,Ahmed A. Asere,Saudi Arabia
3,magnus-aagaard,Magnus,Aagaard,Magnus Aagaard,Denmark
4,asbjorn-aalberg,Asbjorn,Aalberg,Asbjorn Aalberg,Norway


In [18]:
# Save df_players to a CSV file
df_players.to_csv(os.path.join(clean_data_dir, 'players_cleaned.csv'), index=False)

---

# Section 3 - tournaments.csv

In [19]:
# Show the head of df_tournaments
df_tournaments

Unnamed: 0,id,season,year,name,full_name,url,status,category,prize,country,city
0,753,1982-1983,1982,UK Championship,1982 UK Championship,https://cuetracker.net/tournaments/uk-champion...,Professional,Non-ranking,47000.0,England,Preston
1,1140,1982-1983,1982,World Amateur Championship - Men,1982 World Amateur Championship - Men,https://cuetracker.net/tournaments/world-amate...,Amateur,World Event,0.0,Canada,Calgary
2,762,1982-1983,1982,Professional Players Tournament,1982 Professional Players Tournament,https://cuetracker.net/tournaments/professiona...,Professional,Ranking,31500.0,England,Birmingham
3,2586,1982-1983,1982,Pontins Autumn Open,1982 Pontins Autumn Open,https://cuetracker.net/tournaments/pontins-aut...,Pro-am,Event,0.0,Wales,Prestatyn
4,754,1982-1983,1982,International Open,1982 International Open,https://cuetracker.net/tournaments/internation...,Professional,Ranking,73500.0,England,Derby
...,...,...,...,...,...,...,...,...,...,...,...
2715,2884,2018-2019,2019,Guernsey Amateur Championship,2019 Guernsey Amateur Championship,https://cuetracker.net/tournaments/guernsey-am...,Amateur,National Championship,0.0,Guernsey,Various
2716,3321,2019-2020,2020,English Amateur Championship,2020 English Amateur Championship,https://cuetracker.net/tournaments/english-ama...,Amateur,National Championship,0.0,England,Cheltenham
2717,3362,,2020,3 Kings Open,2020 3 Kings Open,https://cuetracker.net/tournaments/3-kings-ope...,Pro-am,Event,0.0,Austria,Rankweil
2718,3357,2019-2020,2020,Singapore Amateur Championship,2020 Singapore Amateur Championship,https://cuetracker.net/tournaments/singapore-a...,Amateur,National Championship,0.0,Singapore,Singapore


As we are only looking at the professional tour, I am going to filter these to ensure we are only left with tournaments that took place on the professional tour.

In [20]:
# Copy the df_tournaments DataFrame
df_tournaments = df_tournaments.copy()

# filter for professional status tournaments
df_tournaments = df_tournaments[df_tournaments['status'] == 'Professional']

df_tournaments

Unnamed: 0,id,season,year,name,full_name,url,status,category,prize,country,city
0,753,1982-1983,1982,UK Championship,1982 UK Championship,https://cuetracker.net/tournaments/uk-champion...,Professional,Non-ranking,47000.0,England,Preston
2,762,1982-1983,1982,Professional Players Tournament,1982 Professional Players Tournament,https://cuetracker.net/tournaments/professiona...,Professional,Ranking,31500.0,England,Birmingham
4,754,1982-1983,1982,International Open,1982 International Open,https://cuetracker.net/tournaments/internation...,Professional,Ranking,73500.0,England,Derby
5,759,1982-1983,1982,Scottish Masters,1982 Scottish Masters,https://cuetracker.net/tournaments/scottish-ma...,Professional,Invitational,23000.0,Scotland,Glasgow
10,795,1982-1983,1982,Australian Masters,1982 Australian Masters,https://cuetracker.net/tournaments/australian-...,Professional,Invitational,18568.0,Australia,Sydney
...,...,...,...,...,...,...,...,...,...,...,...
2701,2852,2018-2019,2019,World Grand Prix,2019 World Grand Prix,https://cuetracker.net/tournaments/world-grand...,Professional,Ranking,370000.0,England,Cheltenham
2703,2777,2018-2019,2019,German Masters,2019 German Masters,https://cuetracker.net/tournaments/german-mast...,Professional,Ranking,395000.0,Germany,Berlin
2710,2821,2018-2019,2019,Masters,2019 Masters,https://cuetracker.net/tournaments/masters/201...,Professional,Invitational,590000.0,England,London
2713,2820,2018-2019,2019,Championship League,2019 Championship League,https://cuetracker.net/tournaments/championshi...,Professional,League,177800.0,England,Coventry


In [21]:
# Check missing values in DataFrame
df_tournaments.isnull().sum()

id            0
season        1
year          0
name          0
full_name     0
url           0
status        0
category      0
prize         0
country      30
city          1
dtype: int64

After filtering this dataframe to only show Professional tournaments you can see that we have some null values.

The columns that contain null values are:
* `season`
* `country`
* `city`

As our project aim is regarding the international growth of the sport we are going to drop any tournaments that do not have a country associated to them

In [22]:
# Drop rows with null values in country column
df_tournaments = df_tournaments[df_tournaments['country'].notna()]

# Check missing values in DataFrame
df_tournaments.isnull().sum()


id           0
season       1
year         0
name         0
full_name    0
url          0
status       0
category     0
prize        0
country      0
city         1
dtype: int64

After removing rows with null country values we are now left with two rows that contain missing values.

* One row is missing a `season` value
* One row is missing a `city` value

In [23]:
# View rows with null in specific columns
rows_with_nulls = df_tournaments[df_tournaments.isna().any(axis=1)]
rows_with_nulls

Unnamed: 0,id,season,year,name,full_name,url,status,category,prize,country,city
693,357,1997-1998,1997,Malta Grand Prix,1997 Malta Grand Prix,https://cuetracker.net/tournaments/malta-grand...,Professional,Invitational,5000.0,Malta,
2005,1044,,2014,EBSA Qualifying Tour - Play-offs,2014 EBSA Qualifying Tour - Play-offs,https://cuetracker.net/tournaments/ebsa-qualif...,Professional,Tour Qualifier,0.0,England,Sheffield


To fix these missing values I am first going to remove the `city` column as it is not necessary for our analysis.

I will also fill in the season based on the `year` column and format the value in the way it is presented in the other columns (`YYYY-YYYY`). I have found some information regarding this tournament <a href="https://cuetracker.net/tournaments/ebsa-qualifying-tour-play-offs/2014/1044" target="_blank" rel="noopener">here</a>. It states that the tournament was played in the 2013-2014 season so I will update this cell with that information.

In [24]:
# Remove city column
df_tournaments = df_tournaments.drop(columns=['city'])

# Display DataFrame
df_tournaments

Unnamed: 0,id,season,year,name,full_name,url,status,category,prize,country
0,753,1982-1983,1982,UK Championship,1982 UK Championship,https://cuetracker.net/tournaments/uk-champion...,Professional,Non-ranking,47000.0,England
2,762,1982-1983,1982,Professional Players Tournament,1982 Professional Players Tournament,https://cuetracker.net/tournaments/professiona...,Professional,Ranking,31500.0,England
4,754,1982-1983,1982,International Open,1982 International Open,https://cuetracker.net/tournaments/internation...,Professional,Ranking,73500.0,England
5,759,1982-1983,1982,Scottish Masters,1982 Scottish Masters,https://cuetracker.net/tournaments/scottish-ma...,Professional,Invitational,23000.0,Scotland
10,795,1982-1983,1982,Australian Masters,1982 Australian Masters,https://cuetracker.net/tournaments/australian-...,Professional,Invitational,18568.0,Australia
...,...,...,...,...,...,...,...,...,...,...
2701,2852,2018-2019,2019,World Grand Prix,2019 World Grand Prix,https://cuetracker.net/tournaments/world-grand...,Professional,Ranking,370000.0,England
2703,2777,2018-2019,2019,German Masters,2019 German Masters,https://cuetracker.net/tournaments/german-mast...,Professional,Ranking,395000.0,Germany
2710,2821,2018-2019,2019,Masters,2019 Masters,https://cuetracker.net/tournaments/masters/201...,Professional,Invitational,590000.0,England
2713,2820,2018-2019,2019,Championship League,2019 Championship League,https://cuetracker.net/tournaments/championshi...,Professional,League,177800.0,England


In [25]:
# Fill in season based on row index
df_tournaments.loc[2005, "season"] = "2013/2014"

In [26]:
# Check missing values
df_tournaments.isnull().sum()

id           0
season       0
year         0
name         0
full_name    0
url          0
status       0
category     0
prize        0
country      0
dtype: int64

Now our DataFrame has no missing values we can look at getting rid of any unecessary columns.

Most of the data in this DataFrame is useful to us however we do not need the `url` column as it's not useful to our analysis so I will drop that next.

In [27]:
# Drop url column
df_tournaments = df_tournaments.drop(columns=['url'])

# Display DataFrame head
df_tournaments.head()

Unnamed: 0,id,season,year,name,full_name,status,category,prize,country
0,753,1982-1983,1982,UK Championship,1982 UK Championship,Professional,Non-ranking,47000.0,England
2,762,1982-1983,1982,Professional Players Tournament,1982 Professional Players Tournament,Professional,Ranking,31500.0,England
4,754,1982-1983,1982,International Open,1982 International Open,Professional,Ranking,73500.0,England
5,759,1982-1983,1982,Scottish Masters,1982 Scottish Masters,Professional,Invitational,23000.0,Scotland
10,795,1982-1983,1982,Australian Masters,1982 Australian Masters,Professional,Invitational,18568.0,Australia


It would be beneficial to have the id as the index as that will make it easier to query when relating to other DataFrames.

In [28]:
# Make id column the index
df_tournaments = df_tournaments.set_index('id')

# Display DataFrame
df_tournaments.head()

Unnamed: 0_level_0,season,year,name,full_name,status,category,prize,country
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
753,1982-1983,1982,UK Championship,1982 UK Championship,Professional,Non-ranking,47000.0,England
762,1982-1983,1982,Professional Players Tournament,1982 Professional Players Tournament,Professional,Ranking,31500.0,England
754,1982-1983,1982,International Open,1982 International Open,Professional,Ranking,73500.0,England
759,1982-1983,1982,Scottish Masters,1982 Scottish Masters,Professional,Invitational,23000.0,Scotland
795,1982-1983,1982,Australian Masters,1982 Australian Masters,Professional,Invitational,18568.0,Australia


Now that our tournaments.csv file has been cleaned we can now save it to a new cleaned file

In [29]:
# Save DataFrame to CSV
df_tournaments.to_csv(os.path.join(clean_data_dir, 'tournaments_cleaned.csv'), index=False)


# Section 4 - scores.csv

* We may not need this table for analysis however I am going to clean it anyway incase we need any information from it later in the project.

In [30]:
# Show the head of the DataFrame
df_scores.head()

# Copy the DataFrame
df_scores_copy = df_scores.copy()


In [31]:
# Show missing values
df_scores.isnull().sum()

match_id                  0
frame                     0
player                    0
score                     0
50plus_breaks_str    757723
dtype: int64

As we can see from the cell above there were 757,723 rows that do not contain a value for `50plus_breaks_str`.

We can assume that if this cell reads as `Null` or `NaN` then we can assume that there were no breaks over 50.

Therefore we can update these cells with "0".

In [33]:
# Update missing 50plus_breaks_str
df_scores['50plus_breaks_str'] = df_scores['50plus_breaks_str'].fillna('0')

# Check missing values
df_scores.isnull().sum()

match_id             0
frame                0
player               0
score                0
50plus_breaks_str    0
dtype: int64

Seeing as there are now no missing values in this dataset we can now save it as a new csv file.

We are not going to drop any columns as this is all useful information to us.

In [34]:
# Save DataFrame to CSV
df_scores.to_csv(os.path.join(clean_data_dir, 'scores_cleaned.csv'), index=False)


# Merging Datasets

Now that each dataset has been cleaned we can start to merge them and create some meaningful tables that are useful for analysis.

In [47]:
# Recap of Cleaned DataFrames
df_matches = pd.read_csv(os.path.join(clean_data_dir, 'matches_cleaned.csv'))
df_players = pd.read_csv(os.path.join(clean_data_dir, 'players_cleaned.csv'))
df_tournaments = pd.read_csv(os.path.join(clean_data_dir, 'tournaments_cleaned.csv'))
df_scores = pd.read_csv(os.path.join(clean_data_dir, 'scores_cleaned.csv'))

### Plan

My plan for merging these tables is as follows:
* I am going to create **one master table** that holds all information about the matches
* The master table will have the following columns:

| Column Name                   | Description                        | Source                                                                                            |
| ----------------------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------- |
| `Match_id`                    | Unique match identifier            | `matches_r.csv` → `match_id`                                                                      |
| `tournament_id`               | Tournament identifier              | `matches_r.csv` → `tournament_id` (FK to `tournaments.csv` → `id`)                                |
| `season`                      | Snooker season (e.g., 2013/2014)   | `tournaments.csv` → `season` *(derive from match `date` if missing)*                              |
| `year`                        | Tournament year                    | `tournaments.csv` → `year`                                                                        |
| `tournament_name`             | Official tournament name           | `tournaments.csv` → `full_name` *(fallback `name`)*                                               |
| `stage`                       | Tournament stage (`QF`, `SF`, `F`) | `matches_r.csv` → `stage` *(normalise values)*                                                    |
| `best_of`                     | Max frames in the match            | `matches_r.csv` → `best_of`                                                                       |
| `player1_name`                | First player’s name                | `matches_r.csv` → `player1_name`                                                                  |
| `player1_country`             | First player’s country             | `players_r.csv` → `Country` *(join on `matches_r.player1_url = players_r.url`)*                   |
| `player2_name`                | Second player’s name               | `matches_r.csv` → `player2_name`                                                                  |
| `player2_country`             | Second player’s country            | `players_r.csv` → `Country` *(join on `matches_r.player2_url = players_r.url`)*                   |
| `score1`                      | Frames won by player 1             | `matches_r.csv` → `score1`                                                                        |
| `score2`                      | Frames won by player 2             | `matches_r.csv` → `score2`                                                                        |
| `tournament_country`          | Host country                       | `tournaments.csv` → `country`                                                                     |
| `tournament_prize`            | Prize fund (nominal)               | `tournaments.csv` → `prize`                                                                       |
| `winner`                      | Winner’s name                      | **Derived:** `score1 > score2 → player1_name` else `player2_name`                                 |
| `winner_country`              | Winner’s country                   | **Derived:** from `player1_country` / `player2_country` based on `match_winner`                   |
| `is_international_match`      | Is one player outside of the UK?   | **Derived:** `player1_country or player2_country != UK`.                                          |
| `is_international_tournament` | Event hosted outside the UK?       | **Derived:** `tournament_country != {England, Scotland, Wales, Northern Ireland}`                 |



I will use the `matches_cleaned.csv` as the starting point for this table as it contains a lot of the same data. I will just add to it to create the master table.


In [48]:
# Display the df_matches table
df_matches

Unnamed: 0,tournament_id,match_id,stage,best_of,player1_name,player2_name,score1,score2
0,753,82716,Final,31,Terry Griffiths,Alex Higgins,16,15
1,753,82718,Semi-final,17,Terry Griffiths,Tony Meo,9,7
2,753,82717,Semi-final,17,Alex Higgins,Ray Reardon,9,6
3,753,82721,Quarter-final,17,Terry Griffiths,Steve Davis,9,6
4,753,82719,Quarter-final,17,Alex Higgins,John Spencer,9,5
...,...,...,...,...,...,...,...,...
193525,3275,203168,Group 1,5,Neil Robertson,Luca Brecel,3,1
193526,3275,203174,Group 1,5,Neil Robertson,Jack Lisowski,3,1
193527,3275,203167,Group 1,5,Mark Selby,Luca Brecel,3,0
193528,3275,203169,Group 1,5,Mark Selby,Ryan Day,3,2


In [49]:
# Make a copy of df_matches
df_matches = df_matches.copy()

# set index to match_id
# df_matches.set_index('match_id', inplace=True)

# Display head
df_matches.head()

Unnamed: 0,tournament_id,match_id,stage,best_of,player1_name,player2_name,score1,score2
0,753,82716,Final,31,Terry Griffiths,Alex Higgins,16,15
1,753,82718,Semi-final,17,Terry Griffiths,Tony Meo,9,7
2,753,82717,Semi-final,17,Alex Higgins,Ray Reardon,9,6
3,753,82721,Quarter-final,17,Terry Griffiths,Steve Davis,9,6
4,753,82719,Quarter-final,17,Alex Higgins,John Spencer,9,5


In [50]:
# Rename index
df_tournaments.index.name = "id"

# Reset index
df_tournaments_subset = df_tournaments.reset_index()

# List tournament columns I would like to merge
tournament_columns_to_merge = ["id","season", "year", "name", "country", "prize" ]

# Select only the tournament columns
df_tournaments_subset = df_tournaments_subset[tournament_columns_to_merge]

df_tournaments_subset.head()

Unnamed: 0,id,season,year,name,country,prize
0,0,1982-1983,1982,UK Championship,England,47000.0
1,1,1982-1983,1982,Professional Players Tournament,England,31500.0
2,2,1982-1983,1982,International Open,England,73500.0
3,3,1982-1983,1982,Scottish Masters,Scotland,23000.0
4,4,1982-1983,1982,Australian Masters,Australia,18568.0


Now that we have a subset of the tournaments DataFrame, we can look at merging this with the matches DataFrame by the tournament_id.

In [54]:
# Merge tournament and match data
df_master = (
    df_matches
    .merge(df_tournaments_subset, left_on='tournament_id', right_on='id', how='left', validate='many_to_one')
    .rename(columns={
        "name": "tournament_name",
        "country": "tournament_country",
        "prize": "tournament_prize"
    })
)

df_master.head()

Unnamed: 0,tournament_id,match_id,stage,best_of,player1_name,player2_name,score1,score2,id,season,year,tournament_name,tournament_country,tournament_prize
0,753,82716,Final,31,Terry Griffiths,Alex Higgins,16,15,753.0,2014-2015,2014.0,UK Championship,England,740000.0
1,753,82718,Semi-final,17,Terry Griffiths,Tony Meo,9,7,753.0,2014-2015,2014.0,UK Championship,England,740000.0
2,753,82717,Semi-final,17,Alex Higgins,Ray Reardon,9,6,753.0,2014-2015,2014.0,UK Championship,England,740000.0
3,753,82721,Quarter-final,17,Terry Griffiths,Steve Davis,9,6,753.0,2014-2015,2014.0,UK Championship,England,740000.0
4,753,82719,Quarter-final,17,Alex Higgins,John Spencer,9,5,753.0,2014-2015,2014.0,UK Championship,England,740000.0


NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---