This repository contains multiple CSV files where each contains information about competitive League of Legends players.
The overarching goal of this project was to acquire data regarding professional LoL, short for League of Legends, players and see how they are connected to one another. The method of acquiring said data will be explained and shared via a python file (see List of contents).
- Background information
- league_2012_2021_edge_list.csv
- final_meta_data.csv
- final_teams_data.csv
- webscrape.py / Data Gathering
- sub_edge_lists
- CBLOL_Brazil.csv
- LCK_South Korea.csv
- LCL_Russia.csv
- LCS_North America.csv
- LEC_Europe.csv
- LJL_Japan.csv
- LL_Latin America.csv
- LMS_Taiwan + Hong Kong + Macao.csv
- LPL_China.csv
- LST_Southeast Asia.csv
- MSI.csv
- OPL_Oceania.csv
- PCL_PCS.csv
- TCL_Turkey.csv
- VCS_Vietnam.csv
- WORLDS.csv
- Analysis
League of Legends is a 2009 multiplayer online battle arena video game developed and published by Riot Games. Two teams battle and attempt to destroy each other's Nexus, the team that succeeds in doing so, wins. Teams are made up of five players though one or more coaches can be present for some phases of the beginning of the game (draft phase).
This csv file contains roughly 42,000 lines (and four columns) and thus 42,000 edges. The edge list is a culmination of several separate edge lists where each separate edge list originated from a different national league (e.g. LCS, LEC or LCK) or large international tournament (e.g. MSI or Worlds) from 2012 to 2021. Due to inconsistent data entry, legal name changes, gamer tag changes and many other issues, this edge list was refined over multiple iterations of cleaning. Additional information regarding the cleaning process can be found in the 'webscrape.py' section. In this edge list, players and coaches are mingled. A link between players and/or coaches is created if and only if the two have actively played a match together. A coach having coached a match counts as if he or she played with the players from that match. Whereas a substitute player will not be linked to another player if he or she did not play a competitive match together. The csv file looks like the following table:
From | To | n_played | Region |
---|---|---|---|
Alocs (Leonardo Belo) | Leko (Whesley Holler) | 17 | CBLOL/Brazil |
Alocs (Leonardo Belo) | Revolta (Gabriel Henud) | 21 | CBLOL/Brazil |
Alocs (Leonardo Belo) | manajj (André Rocha) | 17 | CBLOL/Brazil |
Alocs (Leonardo Belo) | takeshi (Murilo Alves) | 39 | CBLOL/Brazil |
Leko (Whesley Holler) | Revolta (Gabriel Henud) | 37 | CBLOL/Brazil |
Leko (Whesley Holler) | manajj (André Rocha) | 17 | CBLOL/Brazil |
Leko (Whesley Holler) | takeshi (Murilo Alves) | 37 | CBLOL/Brazil |
The edge list was meant for undirected networks, the existence of a connection such as (Alocs (Leonardo Belo), Leko (Whesley Holler)) implies the existence of the reverse connection (Leko (Whesley Holler), Alocs (Leonardo Belo)). The 'n_played' column denotes the number of matches a pair of players played together. The 'Region' column indicates in which region these two players played together. A duplicate edge is possible if the two players played together in a different region or at MSI/Worlds. Furthermore, the region entries indicate the main league plus the region as there are amateur leagues a grade lower than the main league: e.g. LCS/North America would also include the NA Challenger Series and LCS Academy League.
Of the 42,000 edges, only a handful of players in that edge didn't have a gamer tag, this affected 107 edges to be exact. Though their first and last names are still usable.
The final_meta_data csv file contains roughly 17,500 lines and six columns. The meta data of every league and tournament was gathered and joined in to a single csv file. The entries in some columns were standardized in order to remove duplicate entries (e.g. adc/ADC/carry/bot were rewritten to Ad Carry). Duplicate entries may still appear due to missing data, inconsistent team naming or multiple roles/teams per player. The format of the meta_data.csv file is as follows, including a few example rows:
gamer_tag | full_name | role | residency | country | team |
---|---|---|---|---|---|
brTT | Felipe Gonçalves | Ad Carry | Brazil | brazil | paiN |
brTT | Felipe Gonçalves | Ad Carry | Unknown/None | brazil | kStars |
brTT | Felipe Gonçalves | Ad Carry | Brazil | brazil | paiN Gaming |
LEP | Pedro Luiz Marcari | Top | Brazil | brazil | KaBuM! Orange |
The final_teams_data csv file is in a wide format, it contains all of the named teams as columns. The rows are the members of that team that actively played together, as mentioned in the 2nd paragraph. As a result of the wide variety of teams, including teams that have simply renamed, the number of teams is very large. The resulting dataframe/table that can be created from this file contains 1525 teams with a maximum of 50 players per team.
CNB | KaBuM | kStars | etc. |
---|---|---|---|
Alocs (Leonardo Belo) | Espeon (Martin Gonçalves) | SuNo (An Sun-ho (안순호)) | etc. |
Leko (Whesley Holler) | LEP (Pedro Luiz Marcari) | Winged (Park Tae Jin (박태진)) | etc. |
manajj (André Rocha) | Danagorn (Daniel Drummond) | Mylon (Matheus Borges) | etc. |
Revolta (Gabriel Henud) | TinOwns (Thiago Sartori) | brTT (Felipe Gonçalves) | etc. |
takeshi (Murilo Alves) | bit1 (Bruno Lima) | Loop (Caio Almeida) | etc. |
Aoshi (Franklin Coutinho) | nan | Winged (Park Tae-jin (박태진)) | etc. |
The lolfandom website was used in order to webscrape the information of every tournament and league. Whether a player actively played a competitive match with another teammate could easily be extracted from pages such as this or more accurately this.
The data on these pages was in the following table format where a mark would mean that these players participated in the same match and played together.
player | team_1 | team_2 | team_3 |
---|---|---|---|
player_1 | - [x] | - [x] | - [x] |
player_2 | - [x] | - [x] | - [x] |
player_3 | - [x] | - [x] | - [x] |
player_4 | - [x] | - [x] | - [x] |
player_5 | - [x] | - [x] | - [ ] |
coach_1 | - [x] | - [x] | - [x] |
sub_1 | - [ ] | - [ ] | - [x] |
The webscraping process was performed for every major league and tournament, they are denoted in the 'sub_edge_lists' paragraph. Within every major league, the separate sub tournaments (e.g. play-ins, play-offs etc.) was then found, and subsequently every participating team and their match data was then found. After the extraction of all the edges, team data, and meta data, the three types of data were joined into three massive datasets.
Due to inconsistent data entries (missing middle names/last names/nicknames/foreign translation), legal name changes, and gamer tag changes the players had to be compared to each other. Every single player was compared to every other player and their name similarity was sorted. This was done semi-manually, if the last names were the same and the gamer tag was the same except for capitalization, then the first occurrence of the name would overwrite the other occurrences. In the manual process, a script automatically opened two web pages each with a gamer tag, the two pages were then visually inspected to see if they were the same player. The players' full names (minus the gamer tags), their gamer tags, and their gamer tags plus the full names were compared in order to weed out duplicate mentions of the same player. Some examples of inconsistent entries:
Comparison of gamer tags exposed this:
Nyjacky (Chenglong "Jacky" Wang) - Nyjacky (Chenglong Wang) - 1.0
Comparison of full names exposed this:
Jensen (Nicolaj Jensen) - Incarnati0n (Nicolaj Jensen) - 1.0
Comparison of gamer tags and full names exposed this:
catjug (Chen Yi-Jie (陈艺杰)) - catjungle (Chen Yi-Jie) - 0.82
A value of 1.0 meant that the string comparison was a 1 to 1 result, a perfect match. This cleaning process was repeated until no duplicates could be found. However, there may still be duplicates that could not be spotted. In most cases the most recent and accurate name was chosen to replace the older names.
Again, due to inconsistent data entries, some columns in the meta data csv file had to be standardized. In the 'role' column, entries that were meant to be 'Top Lane' for example, could be denoted by 'T', 't', 'top' and others. This standardization process was repeated for every role. The 'country' column went through a similar process. Lastly, cleaning the 'residency' column proved to be a bit more tricky as some regions were later abandoned and residents from one no longer existing region were then assigned to a larger existing region. For example, the 'residency' column contained data entries such as 'OCE' and 'Oceania' which in October 2020 was dissolved and all players with residency in Oceania would acquire North American residency. This page was consulted in order to find such outliers.
This folder contains all the separate edge lists from each major league and major internation tournament. These files are particularly handy if one wishes to view one specific tournament or league. These sub edge lists were extracted from the main edge list via sorting by 'Region' and saving the individual csv files.
The table below depicts the ranges of each league/tournament. Ongoing indicates that the particular league is currently still active. Mix indicates that the csv file contains a mix of multiple tournaments. The % Collected column indicates the amount of data that could be collected using a reliable method. In most cases all of the data could be collected except for the Qualifiers tournaments which were apart of nearly all leagues. In very few cases, data was simply not available in the table format as presented in the beginning of the previous paragraph. For example, the OPL has listed 24 Main Events and 8 Qualifiers of which 24 and and 7 could be webscraped, respectively, giving us a % Collected of 31/32 * 100 = 96.9%.
League/Tournament | Start | End | Start Data Collection | End Data Collection | % Collected | Total Active Days Collected |
---|---|---|---|---|---|---|
CBLOL_Brazil | 26-04-2014 | Ongoing | 26-4-2014 | 31-3-2021 | 100% | 2532 |
LCK_South Korea | 13-01-2012 | Ongoing | 03-07-2013 | 31-3-2021 | 78.8% | 2829 |
LCL_Russia | 16-01-2016 | Ongoing | 16-01-2016 | 31-3-2021 | 100% | 1902 |
LCS_North America | 11-01-2013 | Ongoing | 07-02-2013 | 31-3-2021 | 98.0% | 2975 |
LEC_Europe | 10-12-2012 | Ongoing | 09-02-2013 | 31-3-2021 | 81.8% | 2973 |
LJL_Japan | 09-02-2014 | Ongoing | 24-01-2015 | 31-3-2021 | 86.1% | 2259 |
LL_Latin America | 19-01-2019 | Ongoing | 19-01-2019 | 31-3-2021 | 100% | 803 |
LMS_Taiwan + Hong Kong + Macao | 18-12-2014 | 31-08-2019 | 18-12-2014 | 31-08-2019 | 100% | 1718 |
LPL_China | 29-01-2013 | Ongoing | 16-03-2013 | 31-3-2021 | 93.0% | 2984 |
LST_Southeast Asia | 31-05-2018 | 15-09-2019 | 13-08-2018 | 15-09-2019 | 14.8% (100% excluding qualifiers) | 399 |
MSI | 07-05-2015 | Ongoing | 07-05-2015 | 31-3-2021 | 100% | NA |
OPL_Oceania | 10-01-2015 | 28-08-2020 | 05-02-2015 | 28-08-2020 | 96.9% | 2032 |
PCL_PCS | 29-02-2020 | Ongoing | 29-02-2020 | 31-3-2021 | 100% | 397 |
TCL_Turkey | 01-12-2013 | Ongoing | 24-01-2015 | 31-3-2021 | 80.5% | 2259 |
VCS_Vietnam | 02-11-2013 | Ongoing | 30-07-2016 | 31-3-2021 | 65.1% | 1706 |
WORLDS | 18-06-2011 | Ongoing | 04-10-2012 | 31-3-2021 | 90% | NA |
Click on the 'analysis.md' file for preliminary results and analysis.