### Before start:

This analysis presumes some basic knowledges in the PC game League of Legends, so before you preoceed, here is a brief introduction to some of the conepts in the game:

* League of Legends: a MOBA（Multiplayer Online Battle Arena）game which is typically 5V5, meaning 10 users form 2 groups and fight against each other, whichever group destroy the opponent's base first, wins the game
* Champion: each user controls a unique champion in the game, and often times champions in a team will need to collbaroate well to win
* Objectives: monsters which both teams will try to slay, as they will bring massive benefits in order to win a game.
    1. Baron Nashor: Baron Nashor is the most powerful monster in the jungle.
    2. Drakes(Dragons): Drakes, or dragons, are powerful monsters that grant unique bonuses depending on the element of the drake your team slays.
* items: purchasble equipment of a champion in game,by using gold earned from various ways
* vision: in order to see opponents on the map, a team must have visions, which can be done by placing wards
* wards: a prop to plant by champion, which will give your team vision in a certain area on the map

    
***
Source: https://leagueoflegends.fandom.com/wiki/League_of_Legends_Wiki, https://euw.leagueoflegends.com/en-gb/how-to-play/
***

# Statistically Winning League of Legends

## Project Summary
How to win a LOL(League of Legends) game, if you are not a mechanical player like Rookie, Uzi, or Faker? One way to make it, although less creatively, is copying what the winning others are doing.

This project explores 100,800 high ranked games in the KR(Korean) server, and trying to help to answer few questions that puzzled many players:

* Who are the best buddies in champions?
* What is the best build(items) of a champion?
* Which is more important, dragon soul (getting 4 dragons) or Baron Nashor?
* Does vision control really improve chances of winning?

**In general, the final result will provide input for analysis of a LOL game in 3 dimensions: champion compositions, champion item builds, and team objetives/visions control**

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

!pip install psycopg2

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pickle # data processing, pickle file I/O
import json # data processing, json file I/O
import random # data quality testing
import psycopg2 # establish aws Redshift connection
import sqlalchemy # copy pd dataframe to Redshift 
from sqlalchemy.types import * # load staging_tables

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Step 1: Scope the Project and Gather Data

#### Scope 

The end-result is prepared for a backend of game analytics, to give suggestions on winning strategies. The two sources of data are: [League of Legends(LOL) - Ranked Games 2020](https://www.kaggle.com/gyejr95/league-of-legendslol-ranked-games-2020-ver1?select=match_data_version1.csv), a Kaggle dataset; [Data Dragon of Patch 10.15.1](https://ddragon.leagueoflegends.com/cdn/dragontail-10.15.1.tgz), LOL game data and assets from Riot(LOL's monther company).

The end solution will be a database in aws RDS Postgres, containing the data to full solutions of each question.

Some Python analytic libraries are used in imports.

#### Describe and Gather Data 
The Kaggle dataset contains 10,800 game data in the KR server, the Data Dragon (meta_champs & meta_items) contains all the in-game data up to patch 10.15.1, which is the latest version of game by Aug. 2, 2020.


read 3 match data pickle files, `match_df` contains participants for both the winning and losing side of a match; `winnder_df` contains in-game stats and objetives info for the winning team of a match; `loser_df` contains the same info for the losing side

In [None]:
match_df = pd.read_pickle('../input/league-of-legendslol-ranked-games-2020-ver1/match_data_version1.pickle')
winner_df =  pd.read_pickle('../input/league-of-legendslol-ranked-games-2020-ver1/match_winner_data_version1.pickle')
loser_df = pd.read_pickle('../input/league-of-legendslol-ranked-games-2020-ver1/match_loser_data_version1.pickle')

In [None]:
match_df.info()

In [None]:
match_df['gameVersion'].iloc[0]

In [None]:
winner_df.info()

In [None]:
loser_df.info() 

In [None]:
meta_champs = pd.read_json('../input/league-of-legendslol-data-dragon-en-us10151/en_US-10.15.1/meta_champion.json')

In [None]:
meta_champs.iloc[0]['data']

the only info we need from the `meta_champs` is, the key-name mapping of each champion, becuase it is used in all match tables above.

similarly, we also need the 'key-name' mapping of each item from `meta_items`

In [None]:
with open('../input/league-of-legendslol-data-dragon-en-us10151/en_US-10.15.1/meta_item.json') as f:
    data = json.load(f)
meta_items = pd.read_json(json.dumps(data['data']), orient='index')

In [None]:
meta_items.info()

In [None]:
meta_items.iloc[0]

### Step 2: Explore and Assess the Data
#### Explore the Data 

Problems:
1. **NaN** `win` in `loser_df`
2. `gameMode` in `match_df`, we should only need one of the game mode, **classic**, which is the ranked games we are focusing on
3. **remake** games in LOL

#### Cleaning Steps

`gameId` is the foreign key in `match_df`, we will use the field to drop records of the same match in 3 dfs

*Issue 1*

In [None]:
loser_df['win'].isna().sum()

12 values in the `win` field of `loser_df` is **NaN**, presumably they should be **Fail** instead

we try to find their opponents by `gameId` (the foreign key in both dfs) in `winnder_df`, and see if these are actually just missing values

also we can look at the full match in `match_df`

In [None]:
loser_df[loser_df['win'].isna() == True]

In [None]:
missingVals = loser_df[loser_df['win'].isna() == True]['gameId'].tolist()
winner_df[winner_df['gameId'].isin(missingVals)]

In [None]:
match_df[match_df['gameId'].isin(missingVals)]

all games are in **TUTORIAL_MODULE_1**, which is a tutorial instead of a matched game in LOL

for comparison, let's look a random set of rows in `loser_df` and `match_df`

In [None]:
randomlist = []
for i in range(10):
    n = random.randint(0, 108828)
    randomlist.append(n)
print(randomlist)

In [None]:
match_df[match_df['gameId'].isin(loser_df.iloc[randomlist]['gameId']).tolist()]

one quick obersvation is the dramatic difference in `gameDuration` between the random set and the win == NaN set, unfortunately I did not find any doc on `gameDuration` in Riot API

the guess I take here is the **unit** is in `gameDuration` is second(s), it would be reasonable considering the set of `gameDuration` values in the random set we just created:

In [None]:
avg_gameDuration = np.average(match_df[match_df['gameId'].isin(loser_df.iloc[randomlist]['gameId']).tolist()]['gameDuration'].values)
print('if unit in mins: {:.2f} mins'.format(avg_gameDuration))
print('if unit in sec: {:.2f} mins'.format(avg_gameDuration/60))
print('if unit in milisec: {:.2f} mins'.format(avg_gameDuration * 1.6666666666667E-5))

around **20.00 mins** seems to be a reasonable guess for what an average LOL game would take, the other two seem way short or long

In [None]:
avg_NaNgameDura = np.average(match_df[match_df['gameId'].isin(missingVals)]['gameDuration'].values)
print('NaN games avg: {:.2f} mins'.format(avg_NaNgameDura/60))

these games are way too short to be meaningful in the analysis, and the missing `win` field might have been a direct result of this

we should drop all these match records

In [None]:
winner_df = winner_df[~winner_df['gameId'].isin(missingVals)]
loser_df = loser_df[~loser_df['gameId'].isin(missingVals)]
match_df = match_df[~match_df['gameId'].isin(missingVals)]

*Issue 2*

In [None]:
match_df['gameMode'].unique()

only classic games are needed, we should drop all other game modes

In [None]:
gameId_CLASSIC =  match_df[match_df['gameMode'] == 'CLASSIC'].gameId.tolist()

In [None]:
winner_df = winner_df[winner_df['gameId'].isin(gameId_CLASSIC)]
loser_df = loser_df[loser_df['gameId'].isin(gameId_CLASSIC)]
match_df = match_df[match_df['gameId'].isin(gameId_CLASSIC)]

In [None]:
match_df.gameMode.unique()

*issue 3*

if one player disconnected from a game right from the start, the team can choose to **remake** the game, meaning the match will end immediately

**remake** should not be regarded the same as a loss, because the other players in the team just did not want a 4v5 situation

we should identify the **remake** games from `match_df`, as they affect our win/loss analysis

a team can only choose to **reamke** in the first 15 mins of the game, after that the **reamke** attempt becomes an **early surrender**(which is just the same as a loss)

also in reality, rarely a game at this ranking level would finish before 15 mins, unless someone disconnected

so if `gameDuration` in `match_df` <= **15 mins**, we can almost be certain that the game is a **remake**

we will drop these games

In [None]:
gameId_remake = match_df.query('gameDuration <= 15*60').gameId.values.tolist()

In [None]:
winner_df = winner_df[~winner_df['gameId'].isin(gameId_remake)]
loser_df = loser_df[~loser_df['gameId'].isin(gameId_remake)]
match_df = match_df[~match_df['gameId'].isin(gameId_remake)]

In [None]:
print('now the shortest game in df is {:.2f} mins'.format(match_df.gameDuration.min() / 60))

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model

[data_modeling.pdf](https://drive.google.com/file/d/1c0jzrDSiblRn2QPkbWr1oUKnhDp8BVD_/view?usp=sharing)

the fact-dimensional modeling would fit the need of this project well, because the ouput was produced to support the analysis of each match(fact), and dimensions such as champ compositions, vision controls


#### 3.2 Mapping Out Data Pipelines

the original data resides in Kaggle, after fixing some data quality issues, they will be loaded into an aws RDS postgres database


- create and connect to postgres
- crete each fact/dimension in postgres
- load dimension tables first, then the fact table

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

the bulk of data is very time-consuming to load to the postgres database first due to the network constraint, yet the sql dataabse has greater computing power for transforming the database (from staging tables to final tables)

we will load the staging_tables in Postgres first

In [None]:
# kaggle's add-in is used to store Postgres database's access info
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
dbname= user_secrets.get_secret("dbname")
host = user_secrets.get_secret("host")
password = user_secrets.get_secret("password")
port = user_secrets.get_secret("port")
user = user_secrets.get_secret("user")

conn = psycopg2.connect("host={} dbname={} user={} password={} port={}".format(host, dbname, user, password, port))
conn.autocommit = True
cur = conn.cursor()

In [None]:
# in the meta_dfs, we only need the key-name mapping, all other columns are dropped
meta_items = meta_items[['name']]
meta_champs['key'] = meta_champs.apply(lambda row: row.data['key'], axis=1)
meta_champs = meta_champs[['key']]

In [None]:
# column name/datatype of each staging table defined, used in loading_staging_tables
match_dict = {'gameCreation': Float(), 'gameDuration': Float(), 'gameId': Float(), 'gameMode': String(), 'gameType': String(), 'gameVersion': String(), \
             'mapId': Float(), 'participantIdentities': JSON(), 'participants': JSON(), 'platformId': String(), 'queueId': Float(), 'sessionId': Float(), \
             'status.message': String(), 'status.status_code': Float()}
winner_dict = {'teamId': Integer(), 'win': String(), 'firstBlood': Boolean(), 'firstTower': Boolean(), 'firstInhibitor': Boolean(), 'firstBaron': Boolean(), \
              'firstDragon': Boolean(), 'firstRiftHerald': Boolean(), 'towerKills': Integer(), 'inhibitorKills': Integer(), 'baronKills': Integer(), \
              'dragonKills': Integer(), 'vilemawKills': Integer(), 'riftHeraldKills': Integer(), 'dominionVictoryScore': Integer(), 'bans': JSON(), \
              'gameId': Float()}
loser_dict = winner_dict

In [None]:
# drop tables outlined in the 'data_modeling.pdf', in case a restart is needed
def drop_tables(cur):
    games_table_drop = "DROP TABLE IF EXISTS games"
    champions_table_drop = "DROP TABLE IF EXISTS champions"
    items_table_drop = "DROP TABLE IF EXISTS items"
    objectives_visions_table_drop = "DROP TABLE IF EXISTS objectives_visions"
    champion_key_table_drop = "DROP TABLE IF EXISTS champion_key"
    item_key_table_drop = "DROP TABLE IF EXISTS item_key"
    
    # execute all queries defined
    drop_table_queries = [games_table_drop, champions_table_drop, items_table_drop, objectives_visions_table_drop, champion_key_table_drop, item_key_table_drop]
    for query in drop_table_queries:
        cur.execute(query)

In [None]:
# drop tables outlined in the 'data_modeling.pdf', in case a restart is needed
def drop_staging_tables(cur):
    staging_match_table_drop = "DROP TABLE IF EXISTS staging_match"
    staging_winner_table_drop = "DROP TABLE IF EXISTS staging_winner"
    staging_loser_table_drop = "DROP TABLE IF EXISTS staging_loser"
    staging_meta_champs_table_drop = "DROP TABLE IF EXISTS staging_meta_champs"
    staging_meta_items_table_drop = "DROP TABLE IF EXISTS staging_meta_items"
    
    # execute all queries defined
    drop_table_queries = [staging_match_table_drop, staging_winner_table_drop, staging_loser_table_drop, staging_meta_champs_table_drop, staging_meta_items_table_drop]
    for query in drop_table_queries:
        cur.execute(query)

In [None]:
# create and insert staging tables
def load_staging_tables(conn):
    
    conn = sqlalchemy.create_engine('postgresql://{}:{}@{}:{}/{}'.format(user, password, host, port, dbname))
    
    print('loading staging_match')
    match_df.to_sql('staging_match', conn, index=False, if_exists='replace', dtype=match_dict)
    
    print('loading staging_winner')
    winner_df.to_sql('staging_winner', conn, index=False, if_exists='replace', dtype=winner_dict)
    
    print('loading staging_loser')
    loser_df.to_sql('staging_loser', conn, index=False, if_exists='replace', dtype=loser_dict)
    
    print('loading staging_meta_champs')
    meta_champs.to_sql('staging_meta_champs', conn, index=True, if_exists='replace')
    
    print('creating staging_meta_items')
    meta_items.to_sql('staging_meta_items', conn, index=True, if_exists='replace')

In [None]:
# create tables outlined in the 'data_modeling.pdf'
def create_tables(cur):
    games_table_create = ("""CREATE TABLE IF NOT EXISTS games(game_id bigint PRIMARY KEY, game_duration float NOT NULL, game_version varchar NOT NULL, participants varchar[10] NOT NULL)
    """)
    champions_table_create = ("""CREATE TABLE IF NOT EXISTS champions(game_id bigint PRIMARY KEY, champ_1 int NOT NULL, champ_2 int NOT NULL, champ_3 int NOT NULL, champ_4 int NOT NULL, champ_5 int NOT NULL, champ_6 int NOT NULL, champ_7 int NOT NULL, champ_8 int NOT NULL, champ_9 int NOT NULL, champ_10 int NOT NULL)
    """)
    items_table_create = ("""CREATE TABLE IF NOT EXISTS items(game_id bigint PRIMARY KEY, build_1 int[6] NOT NULL, build_2 int[6] NOT NULL, build_3 int[6] NOT NULL, build_4 int[6] NOT NULL, build_5 int[6] NOT NULL, build_6 int[6] NOT NULL, build_7 int[6] NOT NULL, build_8 int[6] NOT NULL, build_9 int[6] NOT NULL, build_10 int[6] NOT NULL)
    """)
    objectives_visions_table_create = ("""CREATE TABLE IF NOT EXISTS objectives_visions(game_id bigint PRIMARY KEY, win_dragon_soul boolean NOT NULL, win_baron_nashor boolean NOT NULL, win_ward_placed int NOT NULL, win_ward_destroyed int NOT NULL, lose_dragon_soul boolean NOT NULL, lose_baron_nashor boolean NOT NULL, lose_ward_placed int NOT NULL, lose_ward_destroyed int NOT NULL)
    """)
    champion_key_table_create = ("""CREATE TABLE IF NOT EXISTS champion_key(champion_key bigint PRIMARY KEY, champion_name varchar NOT NULL)
    """)
    item_key_table_create = ("""CREATE TABLE IF NOT EXISTS item_key(item_key bigint PRIMARY KEY, item_name varchar NOT NULL)
    """)

    # execute all queries defined
    create_table_queries = [games_table_create, champions_table_create, items_table_create, objectives_visions_table_create, champion_key_table_create, item_key_table_create]
    for query in create_table_queries:
        cur.execute(query)

# the next cell should only be ran when running ETL for the first time, as we are recreating all tables 

In [None]:
print('drop staging tables')
drop_staging_tables(cur)
print('dropping fact/dimension tables')
drop_tables(cur)
print('creating staging tables')
load_staging_tables(cur)
print('creating fact/dimension tables')
create_tables(cur)

In [None]:
# list all tables created
cur.execute("""SELECT table_name FROM information_schema.tables
       WHERE table_schema = 'public'""")
for table in cur.fetchall():
    print(table)

display all staging tables with their columns/datatypes

In [None]:
pd.options.display.max_rows = 75
cur.execute("""SELECT table_name, column_name, data_type FROM information_schema.columns WHERE table_name LIKE 'staging_%'""")
pd.DataFrame(cur.fetchall(), columns=['table_name', 'column_name', 'data_type'])

1. champions table

**champ1 - champion 10**  is stored in `staging_match`, we can retreive them and the **gameId** of each 10 **champX**

the `champions_table_value` statement looks a bit complicated, due to the nature of json array column in psotgres

In [None]:
champions_table_value = """
SELECT tb2.game_id, tb2.champ_ids[1], tb2.champ_ids[2], tb2.champ_ids[3], tb2.champ_ids[4],tb2.champ_ids[5],
tb2.champ_ids[6], tb2.champ_ids[7], tb2.champ_ids[8], tb2.champ_ids[9], tb2.champ_ids[10]
FROM
(SELECT tb.game_id AS game_id, array_agg(tb.c ORDER BY tb.i ASC)::jsonb[]::int[] AS champ_ids FROM 
(SELECT "gameId" AS game_id, 
json_array_elements(participants) -> 'championId' AS c, 
cast(json_array_elements(participants) -> 'participantId' as jsonb)::int AS i FROM staging_match) AS tb 
GROUP BY tb.game_id) AS tb2 ORDER BY tb2.game_id
"""

In [None]:
champions_table_insert = """INSERT INTO champions(game_id, champ_1, champ_2, champ_3, champ_4, champ_5, 
champ_6, champ_7, champ_8, champ_9, champ_10) {}""".format(champions_table_value)

one issue is noted before the insertion can be completed: some games in the original data have mssing **champ_ids**

In [None]:
cur.execute(champions_table_value)
a = pd.DataFrame(cur.fetchall())
a[a.isnull().any(axis=1)]

going through the original `match_df`, we can see these games only have 6 participants

(someone was 1 vs 5 in those games)

they are not valid for our database, nor the anaysis to be performed

delete these rows from `staging_match`

In [None]:
game_with_missing_champ_ids = list(a[a.isnull().any(axis=1)][0].array)
for game in range(len(game_with_missing_champ_ids)):
    total_participants = len(match_df[match_df['gameId'].isin(game_with_missing_champ_ids)].iloc[game].participants)
    print('game {} has: {} participants'.format(game_with_missing_champ_ids[game], total_participants))

In [None]:
# delete game with not 10 participants
cur.execute("""DELETE FROM staging_match WHERE "gameId" IN %s""", (tuple(game_with_missing_champ_ids),))

In [None]:
cur.execute(champions_table_value)
a = pd.DataFrame(cur.fetchall())
a[a.isnull().any(axis=1)]

all games are valid now, we can proceed with the insertion

In [None]:
cur.execute(champions_table_insert)

In [None]:
cur.execute("""SELECT * FROM champions LIMIT 3""")
cur.fetchall()

2. items table

all builds/items info can be retrieved from `staging_match` as well

In [None]:
items_table_value = """
SELECT game_id, i[1:1], i[2:2], i[3:3], i[4:4], i[5:5], i[6:6], i[7:7], i[8:8], i[9:9], i[10:] FROM
(SELECT game_id AS game_id, ((array_agg(array[i0,i1,i2,i3,i4,i5,i6])))::jsonb[]::int[] AS i FROM
(SELECT game_id AS game_id, p ->'item0' AS i0, p -> 'item1' AS i1, p ->'item2' AS i2, p ->'item3' AS i3, 
p ->'item4' AS i4,p ->'item5' AS i5,p ->'item6' AS i6
FROM  (SELECT "gameId" AS game_id, json_array_elements(participants) -> 'stats' AS p FROM staging_match) AS tb1) AS tb2 
GROUP BY game_id) AS tb3 ORDER BY game_id
"""

In [None]:
items_table_insert = """INSERT INTO items(game_id, build_1, build_2, build_3, build_4, build_5, 
build_6, build_7, build_8, build_9, build_10) {}""".format(items_table_value)

In [None]:
cur.execute(items_table_insert)

In [None]:
cur.execute("""SELECT * FROM items LIMIT 3""")
cur.fetchall()

3. objectives_visions table

objective kills can be retreived from `staging_loser` and `staging_winner`, and visions placed/destroyed can be retreived from `staging_match`

In [None]:
objectives_visions_table_value = """
SELECT game_id, 
CASE WHEN wdk >= 4 THEN TRUE ELSE FALSE END AS win_dragon_soul,
CASE WHEN wbk > 0 THEN TRUE ELSE FALSE END AS win_baron_nashor,
wwp AS win_ward_placed, wwk AS win_ward_killed,
CASE WHEN ldk >= 4 THEN TRUE ELSE FALSE END AS lose_dragon_soul,
CASE WHEN lbk > 0 THEN TRUE ELSE FALSE END AS lose_baron_nashor,
lwp AS lose_ward_placed, lwk AS lose_ward_killed
FROM 
(SELECT game_id AS game_id,
avg(wdk)::int AS wdk, avg(wbk)::int AS wbk, 
sum(wp::jsonb::int) FILTER (WHERE win::jsonb::boolean IS TRUE) AS wwp,
sum(wk::jsonb::int) FILTER (WHERE win::jsonb::boolean IS TRUE) AS wwk,
avg(ldk)::int AS ldk, avg(lbk)::int AS lbk,
sum(wp::jsonb::int) FILTER (WHERE win::jsonb::boolean IS FALSE) AS lwp,
sum(wk::jsonb::int) FILTER (WHERE win::jsonb::boolean IS FALSE) AS lwk
FROM (SELECT m."gameId" AS game_id,
json_array_elements(participants) #> '{stats, win}' AS win,
w."baronKills" AS wbk, w."dragonKills" AS wdk,
json_array_elements(participants) #> '{stats, wardsPlaced}' AS wp,
json_array_elements(participants) #> '{stats, wardsKilled}' AS wk,  
l."baronKills" AS lbk, l."dragonKills" AS ldk
FROM staging_match AS m  
INNER JOIN staging_winner AS w ON (m."gameId" = w."gameId") 
INNER JOIN staging_loser AS l ON (m."gameId" = l."gameId")
ORDER BY game_id
) AS tb1 GROUP BY game_id) AS tb3
"""

In [None]:
objectives_visions_table_insert = """INSERT INTO objectives_visions(game_id, win_dragon_soul, win_baron_nashor, win_ward_placed, win_ward_destroyed, 
lose_dragon_soul, lose_baron_nashor, lose_ward_placed, lose_ward_destroyed) 
{}""".format(objectives_visions_table_value)

In [None]:
cur.execute(objectives_visions_table_insert)

In [None]:
cur.execute("""SELECT * FROM objectives_visions LIMIT 3""")
cur.fetchall()

4. champion_key table

retreived from `staging_meta_champs`

In [None]:
champion_key_table_value = """SELECT key::int, index FROM staging_meta_champs ORDER BY key::int"""

In [None]:
champion_key_table_insert = """INSERT INTO champion_key(champion_key, champion_name) {}""".format(champion_key_table_value)

In [None]:
cur.execute(champion_key_table_insert)

In [None]:
cur.execute("""SELECT * FROM champion_key LIMIT 10""")
cur.fetchall()

5. `item_key` table

can be retreived from `staging_meta_items`

In [None]:
item_key_table_value = """SELECT index, name FROM staging_meta_items ORDER BY index"""

In [None]:
item_key_table_insert = """INSERT INTO item_key(item_key, item_name) {}""".format(item_key_table_value)

In [None]:
cur.execute(item_key_table_insert)

6 `games` table

can be retreived from `staging_match`

In [None]:
games_table_value = """
SELECT game_id, game_duration, game_version, array_agg(a) AS participants FROM
(SELECT "gameId" AS game_id, "gameDuration" AS game_duration, "gameVersion" AS game_version, 
json_array_elements("participantIdentities") #> '{player, accountId}' AS a
FROM staging_match) AS tb1
GROUP BY game_id, game_duration, game_version
ORDER BY game_id
"""

In [None]:
games_table_insert = """INSERT INTO games(game_id, game_duration, game_version, participants) {}""".format(games_table_value)

In [None]:
cur.execute(games_table_insert)

In [None]:
cur.execute("""SELECT * FROM games LIMIT 3""")
cur.fetchall()

#### 4.2 Data Quality Checks
 * Integrity constraints: satisfied by the **NOT NULL** and **PRIMARY KEY** definition in the table **CREATE** queries
 * Source/Count: all fact/dimension tables should have equal rows and order of ids in this case

we first examine the data types of each fact/dimension table

In [None]:
# print data type of 'games'
cur.execute("""SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'games' """)
pd.DataFrame(cur.fetchall(), columns=['column_name', 'data_type'])

In [None]:
# print data type of 'champions'
cur.execute("""SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'champions' """)
pd.DataFrame(cur.fetchall(), columns=['column_name', 'data_type'])

In [None]:
# print data type of 'items'
cur.execute("""SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'items' """)
pd.DataFrame(cur.fetchall(), columns=['column_name', 'data_type'])

In [None]:
# print data type of 'objectives_visions'
cur.execute("""SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'objectives_visions' """)
pd.DataFrame(cur.fetchall(), columns=['column_name', 'data_type'])

In [None]:
# print data type of 'champion_key'
cur.execute("""SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'champion_key' """)
pd.DataFrame(cur.fetchall(), columns=['column_name', 'data_type'])

In [None]:
# print data type of 'item_key'
cur.execute("""SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'item_key' """)
pd.DataFrame(cur.fetchall(), columns=['column_name', 'data_type'])

test if fact and dimension tables have equal number of rows

In [None]:
# count rows of table
count_rows = """
SELECT
(SELECT count(*) FROM games) AS g,
(SELECT count(*) FROM champions) AS c,
(SELECT count(*) FROM items) AS i,
(SELECT count(*) FROM objectives_visions) AS o
"""
cur.execute(count_rows)
cur.fetchall()

select some random samples across the fact and 'staging_match' table


In [None]:
random_games_samples = """
SELECT "gameDuration" AS sm_gd, gd, "gameVersion" AS sm_gv, gv FROM staging_match, 
(SELECT g.game_id AS game_id, g.game_duration AS gd, g.game_version AS gv FROM games AS g ORDER BY random() LIMIT 3) AS tb1
WHERE "gameId" IN (game_id)
"""
cur.execute(random_games_samples)
cur.fetchall()

the data matched exactly

now print two **random** rows of each fact/dimension table

In [None]:
# print two random rows of 'games'
cur.execute("""SELECT column_name FROM information_schema.columns WHERE table_name = 'games' """)
columns = cur.fetchall()
cur.execute("""SELECT * FROM games ORDER BY random() LIMIT 2""")
pd.DataFrame(cur.fetchall(), columns=columns)

In [None]:
# print two random rows of 'champions'
cur.execute("""SELECT column_name FROM information_schema.columns WHERE table_name = 'champions' """)
columns = cur.fetchall()
cur.execute("""SELECT * FROM champions ORDER BY random() LIMIT 2""")
pd.DataFrame(cur.fetchall(), columns=columns)

In [None]:
# print two random rows of 'items'
cur.execute("""SELECT column_name FROM information_schema.columns WHERE table_name = 'items' """)
columns = cur.fetchall()
cur.execute("""SELECT * FROM items ORDER BY random() LIMIT 2""")
pd.DataFrame(cur.fetchall(), columns=columns)

In [None]:
# print two random rows of 'objectives_visions'
cur.execute("""SELECT column_name FROM information_schema.columns WHERE table_name = 'objectives_visions' """)
columns = cur.fetchall()
cur.execute("""SELECT * FROM objectives_visions ORDER BY random() LIMIT 2""")
pd.DataFrame(cur.fetchall(), columns=columns)

all rows seemed valid

now the two key-name mapping tables

In [None]:
# print two random rows of 'champion_key'
cur.execute("""SELECT column_name FROM information_schema.columns WHERE table_name = 'champion_key' """)
columns = cur.fetchall()
cur.execute("""SELECT * FROM champion_key ORDER BY random() LIMIT 2""")
pd.DataFrame(cur.fetchall(), columns=columns)

In [None]:
# print two random rows of 'item_key'
cur.execute("""SELECT column_name FROM information_schema.columns WHERE table_name = 'item_key' """)
columns = cur.fetchall()
cur.execute("""SELECT * FROM item_key ORDER BY random() LIMIT 2""")
pd.DataFrame(cur.fetchall(), columns=columns)

#### 4.3 Data dictionary 

##### games - fact table

- **game_id**: game id of each LOL game, loaded from `staging_match`
- **game_duration**: duration of each LOL game, loaded from `staging_match`
- **participants**: each accont id of the 10 participants in a game, loaded from `staging_match` -> `player` -> `accountId`

##### champions

- **game_id**: same as in fact table
- **champ_1 - champ_5**: 5 champions from the winning team, order not specified, loaded from `staging_match` -> `participants` -> `championId`
- **champ_6 - champ_10**: 5 champions from the losing team, same conditions applied 

##### items

- **game_id**: same as in fact table
- **build_1 - build_ 5**: builds from the winning team, each build has 7 items (any 'item' = 0 means the build has less than 7 items), order followed the one in `champions,` loaded from `staging_match` -> `participants` -> `item[0-6]`
- **build_6 - build_ 10**:builds from the losing team, same conditions applied

##### objectives_visions

- **game_id**: same as in fact table
- **win_drag_soul**: if a winning team had killed >= 4 dragons, they have a dragon soul; otherwise they don't
- **win_baron_nashor**: if a winning had killed baron nashor at least once, they have baron nashor: otherwise they don't
- **win_ward_placed**: sum of wards placed from 5 participants in the winning team, thorughout the game
- **win_ward_killed**: sum of wards destroyed from 5 participants in the winning team, throughout the game
- **lose_\***: same conditions applied


note: all data were from `staging_winner` and `stagin_loser`, but the actual path of each field included some SQL func and agg, and for the sake of readabiity I would not outline them here; pleaser refer back to the query `objectives_visions_value` 

##### champion_key

- **champion_key**: integer key used in the `champion` table, loaded from `staging_meta_champions` -> `key`
- **champion_name**: text mapping of champion_key, loaded from `staging_meta_champions` -> `index`

##### item_key

- **item_key**: integer key used in the `items` table, loaded from `staging_meta_items` -> `index`
- **item_name**: text mapping of item_key, loaded from `staging_meta_items` -> `name`

#### Step 5: Project Write Up

**the rationale for the choice of tools and technologies**: Originally the desired tool for data with this scale would be something like AWS Redshift, while Redshift did not offer to stoore column of composite keys (like arrays), and some advanced feature of a relational database was not supported in Redshift either for the sake of efficiency; so it turned out a relational database would be a more reaosnable choice when the data contains composite json field; AWS RDS offered a Postgres database, and while it is not as fast as Redshift in performing analytic jobs, it was easier for the ETL process



**how often the data should be updated and why**: at least once a week, as the metrics in LOL changed very fast, the game development company relased new version in weekly basis sometimes, and trends are formed among players everyday

**if the data was increased by 100x**: introduce Apache Spark and Airflow to the ETL process, the distribued databse of Spark would enable the scale, and Airflow would help in monitoring the ETL process since now it takes much longer

**if the data populates a dashboard that must be updated on a daily basis by 7am every day**: run the ETL on a VM (like a AWS EC2) every night by using a cloudWatch, and updates the dashboard 

**the database needed to be accessed by 100+ people**: increase availability by using aws RDS's Multi-AZ deployment, or using Apache Spark to increase availability of distributed machines