# **Extract, Transform, Load (ETL) Process**

## Objectives

* The objective of this notebook is to perform an ETL process of the data I have gathered from Kaggle.
* I have fetched the data which you can find in the datasets/raw folder.
* There are 4 related csv files that contain foreign keys that relate to eachother such as player_id, match_id and tournament_id. I will use these keys to clean and merge the data.
## Inputs

* My inputs are 4 csv files:
    * matches.csv
    * players.csv
    * scores.csv
    * tournaments.csv

## Outputs

* I am going to output one main table. From this table I will filter information into smaller csv files where I can work on them more easily to create visualisations.

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir


'/Users/stephenbeese/GitHub/Snooker-Data-Analysis/Snooker-Data-Analysis/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/stephenbeese/GitHub/Snooker-Data-Analysis/Snooker-Data-Analysis'

# Imports

To clean and merge this data I will first need to import some libraries

In [4]:
import numpy as np
import pandas as pd

# Set up data directories

In [None]:
# Set the file path for raw data
raw_data_dir = os.path.join(current_dir, 'datasets/raw')

# Set the file path for clean data
clean_data_dir = os.path.join(current_dir, 'datasets/clean')

# Section 1 - matches.csv

In [12]:
# Load all csv files into Pandas DataFrames
df_matches = pd.read_csv(os.path.join(raw_data_dir, 'matches.csv'))
df_players = pd.read_csv(os.path.join(raw_data_dir, 'players.csv'))
df_scores = pd.read_csv(os.path.join(raw_data_dir, 'scores.csv'))
df_tournaments = pd.read_csv(os.path.join(raw_data_dir, 'tournaments.csv'))


---

In [6]:
# Display the head of df_matches
df_matches.head()

Unnamed: 0,tournament_id,match_id,date,stage,best_of,player1_name,player1_url,player2_name,player2_url,score1,score2,frames_scores,is_walkover
0,753,82716,,Final,31,Terry Griffiths,https://cuetracker.net/players/terry-griffiths,Alex Higgins,https://cuetracker.net/players/alex-higgins,16,15,20-58; 31-90; 56-52; 26-87(67); 0-114(67); 73(...,False
1,753,82718,,Semi-final,17,Terry Griffiths,https://cuetracker.net/players/terry-griffiths,Tony Meo,https://cuetracker.net/players/tony-meo,9,7,71-8; 50-71(55); 31-62; 69-30; 73-61; 34-77(52...,False
2,753,82717,,Semi-final,17,Alex Higgins,https://cuetracker.net/players/alex-higgins,Ray Reardon,https://cuetracker.net/players/ray-reardon,9,6,28-71; 67(50)-29; 74(74)-0; 53-79; 60-54; 112(...,False
3,753,82721,,Quarter-final,17,Terry Griffiths,https://cuetracker.net/players/terry-griffiths,Steve Davis,https://cuetracker.net/players/steve-davis,9,6,"1-103; 117(60,57)-6; 5-105(60); 57-60; 79-0; 2...",False
4,753,82719,,Quarter-final,17,Alex Higgins,https://cuetracker.net/players/alex-higgins,John Spencer,https://cuetracker.net/players/john-spencer,9,5,69(54)-31; 103-21; 72-48; 33-82; 40-56; 71-51;...,False


In [7]:
# Check missing values in the dataframe
df_matches.isnull().sum()

tournament_id         0
match_id              0
date             138711
stage                 0
best_of               0
player1_name          0
player1_url           0
player2_name          0
player2_url           0
score1                0
score2                0
frames_scores    123278
is_walkover           0
dtype: int64

From the code above we can see that there are a lot of missing values in both the `date` column and `frames_scores` column.

As we are not interested in `frames_scores` we can remove this column completely.

The `date` column however is useful to us, luckily in the `tournaments.csv` file it contains both the `season` and `year` of each tournament. We can use the `tournament_id` to get this information and create a new column in the `matches.csv` file.

We will do this later but for now we can drop the `date` and `frames_scores` as they are not necessary.

Looking further into the DataFrame, there are some more unecessary columns that won't be needed for the analysis we are looking at in this project.
* `player1_url`
* `player2_url`
* `is_walkover`

In [8]:
# Create a copy of df_matches
df_matches = df_matches.copy()

# Delete date and frames_scores from df_matches
df_matches = df_matches.drop(columns=['date', 'frames_scores', 'player1_url', 'player2_url', 'is_walkover'])

In [9]:
# Display the head of df_matches after cleaning
df_matches.head()

Unnamed: 0,tournament_id,match_id,stage,best_of,player1_name,player2_name,score1,score2
0,753,82716,Final,31,Terry Griffiths,Alex Higgins,16,15
1,753,82718,Semi-final,17,Terry Griffiths,Tony Meo,9,7
2,753,82717,Semi-final,17,Alex Higgins,Ray Reardon,9,6
3,753,82721,Quarter-final,17,Terry Griffiths,Steve Davis,9,6
4,753,82719,Quarter-final,17,Alex Higgins,John Spencer,9,5


In [10]:
# Check missing values in the dataframe
df_matches.isnull().sum()

tournament_id    0
match_id         0
stage            0
best_of          0
player1_name     0
player2_name     0
score1           0
score2           0
dtype: int64

In [None]:
# Check for duplicate matches
df_matches[df_matches.duplicated()]

Unnamed: 0,tournament_id,match_id,date,stage,best_of,player1_name,player1_url,player2_name,player2_url,score1,score2,frames_scores,is_walkover


As you can see from the code cell above we now have a cleaned `matches.csv` dataset with no null values or duplicates.

In the next cell I will save this to a new .csv file and continue working on the other dataframes.

In [11]:
# Save cleaned DataFrame to a new .csv file
df_matches.to_csv(os.path.join(clean_data_dir, 'matches_cleaned.csv'), index=False)

# Section 2 - players.csv

Section 2 content

In [17]:
# Display the head of df_players
df_players.head()

Unnamed: 0,url,id,first_name,last_name,full_name,country
0,https://cuetracker.net/players/mohammed-a-belg...,mohammed-a-belgaizi,Mohammed,A Belgaizi,Mohammed A Belgaizi,United Arab Emirates
1,https://cuetracker.net/players/ishaq-a-khaleg,ishaq-a-khaleg,Ishaq,A Khaleg,Ishaq A Khaleg,Bahrain
2,https://cuetracker.net/players/ahmed-a-asere,ahmed-a-asere,Ahmed,A. Asere,Ahmed A. Asere,Saudi Arabia
3,https://cuetracker.net/players/magnus-aagaard,magnus-aagaard,Magnus,Aagaard,Magnus Aagaard,Denmark
4,https://cuetracker.net/players/asbjorn-aalberg,asbjorn-aalberg,Asbjorn,Aalberg,Asbjorn Aalberg,Norway


In [14]:
# Check for missing values
df_players.isnull().sum()

url           0
id            0
first_name    0
last_name     0
full_name     0
country       0
dtype: int64

In [None]:
# Check for duplicate players
df_players[df_players.duplicated()]

Unnamed: 0,url,id,first_name,last_name,full_name,country


As you can see from the code above there are no missing values or duplicates in this table.

This means we can look at the columns and decide what is useful to our project outcomes.

As we are looking at international players we are interested in the country and the player's names.

This means we can drop the player_url as that isn't crucial for our analysis.

I am going to drop this column and then save it to a new `.csv` file.

In [19]:
# Create a copy of df_players
df_players = df_players.copy()

# Drop the player_url column
df_players = df_players.drop(columns=['url'])

# Display the head of df_players
df_players.head()

Unnamed: 0,id,first_name,last_name,full_name,country
0,mohammed-a-belgaizi,Mohammed,A Belgaizi,Mohammed A Belgaizi,United Arab Emirates
1,ishaq-a-khaleg,Ishaq,A Khaleg,Ishaq A Khaleg,Bahrain
2,ahmed-a-asere,Ahmed,A. Asere,Ahmed A. Asere,Saudi Arabia
3,magnus-aagaard,Magnus,Aagaard,Magnus Aagaard,Denmark
4,asbjorn-aalberg,Asbjorn,Aalberg,Asbjorn Aalberg,Norway


In [20]:
# Save df_players to a CSV file
df_players.to_csv(os.path.join(clean_data_dir, 'players_cleaned.csv'), index=False)

---

# Section 3 - tournaments.csv

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [58]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)