# Web Scraping: FIFA 21 - Battle 8min

**Objective:**<br />

- In this notebook, I will implement all the functions required to build a Web Scraper to get esports FIFA 21 games in 8 min battles.

**After this code:**<br />

- I'll have a dataset with information of each FIFA 21 Battle, which are:<br />
&nbsp;&nbsp;`data_horario`, `user_home`, `user_away`, `team_home`, `team_away`, `team_playing_home`, `team_playing_away`,<br />
&nbsp;&nbsp;`resultado`, `Score_end_home`, `Score_end_away`, `target_class` <br />
<br />
- The file name is: `FIFA21_8min_history.csv`

Let's get started!

## 1 - Packages

Let's first import all the packages that you will need during this code. 
- [selenium](https://www.selenium.dev/) is the most popularly used freeware and open source automation tool.
- **os** provides a platform-independent interface to operating system functionality.
- **time** provides various time-related functions.
- **datetime** supplies classes for manipulating dates and times.
- **glob** finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.
- **random** implements pseudo-random number generators for various distributions.
- **lxml** is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language
- **pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [33]:
from selenium import webdriver
import os, time, datetime, glob
from random import randint
from lxml import html
import pandas as pd

In [17]:
# let's create a folder to receive all pages collected.
if not os.path.exists('src'):
    os.makedirs('src')

## 2 - Chrome Driver

In [12]:
# Look the version of your Google Chrome (... > Configuration > About Google Chrome) and 
# download Chrome Driver at https://sites.google.com/a/chromium.org/chromedriver/downloads

## Configure Chrome WebDriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito") # Anonymous
driver = webdriver.Chrome(r'./ChromeDrivers/chromedriver.exe', options=chrome_options)
driver.maximize_window() # Maximize window size

In [13]:
# Define URL to get page
page_radical = r'https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.'

In [14]:
# Define radical filenames
file_radical = r'./src/EsoccerBattle8mins'

## 3 - Collect Pages

In [46]:
# Collecting 20 pages of games
num_collect_pages = 20## 2 - Chrome Driver
for num_page in list(range(1,num_collect_pages+1,1)):
    url_page = page_radical+str(num_page)
    print('>>> ', url_page)
    
    driver.get(url_page)
    tempo_espera = randint(5, 11)
    time.sleep(tempo_espera)
    
    with open('{}_{:0>5d}.txt'.format(file_radical, num_page), "w", encoding='utf-8-sig') as text_file:
        text_file.write(driver.page_source)
    print('-', 'Created file: {}_{:0>5d}.txt'.format(file_radical, num_page))

>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.1
- Created file: ./src/EsoccerBattle8mins_00001.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.2
- Created file: ./src/EsoccerBattle8mins_00002.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.3
- Created file: ./src/EsoccerBattle8mins_00003.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.4
- Created file: ./src/EsoccerBattle8mins_00004.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.5
- Created file: ./src/EsoccerBattle8mins_00005.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.6
- Created file: ./src/EsoccerBattle8mins_00006.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.7
- Created file: ./src/EsoccerBattle8mins_00007.txt
>>>  https://pt.betsapi.com/le/22614/Esoccer-Battle--8-mins-play/p.8
- Created file: ./src/EsoccerBattle8mins_00008.txt
>>>  https://pt.betsapi.com/le/22614/Eso

## 4 - Parsing HTML

In [47]:
# Actual year
ano_atual = 2021

# Verify if the month is January
em_janeiro = False

# Getting all files in the folder and invert the order to finish proccess with the most recent file
list_of_files = sorted(glob.glob('./src/*.txt'), reverse=True)

start_process_time = datetime.datetime.now()
dados_files, dados_geral = [], []
for filepath in list_of_files[:## 3 - Collect Pages]:
    print('>>>> Arquivo trabalhado: {}'.format(filepath), end='')
    
    # Import .TXT file
    with open(filepath,'r', encoding='latin-1') as fileobj:
        arquivo_html = fileobj.read().replace('\n','')
        
    # Convert .TXT to HTML type
    fileHTML = html.fromstring(arquivo_html)
    
    # Find games table
    tabela_jogos = fileHTML.xpath('//*[@class="table table-sm"]/tbody')[0]
    linhas_tabela_jogos = tabela_jogos.xpath("//tr")[1:] # Remove header from list
    
    # Getting into table and collect each game information
    cont = 1 # Counting games 
    for row in linhas_tabela_jogos[:]:

        dt_game = row[0].text_content().strip() # Collect date
        if (int(dt_game[:2]) == 1): # Verify if it is january or not
            em_janeiro = True
            
        if (em_janeiro == True) & (int(dt_game[:2]) != 1): # Change the year
            ano_atual = ano_atual-1
            em_janeiro = False
            
        # Build date in correct format
        data_horario = '{:0>2d}/{:0>2d}/{:0>4d} {}'.format(int(dt_game[3:5]),int(dt_game[:2]),ano_atual,dt_game[6:])

        # Collect both teams in the game
        team_home, team_away = row[2].text_content().strip().split(' v ')
        
        # Collect, from the team name, the user name for each team
        user_home = team_home[team_home.find("(")+1:team_home.find(")")].strip()
        user_away = team_away[team_away.find("(")+1:team_away.find(")")].strip()
        
        # Collect the score result
        resultado = row[3].text_content().strip()
        
        # Verify what happened to the game: 'Cancelled','View','Postponed','Interrupted','Abandoned'
        # If it happened one of those options, move to the next game
        if any(item in ['Cancelled','View','Postponed','Interrupted','Abandoned'] for item in [resultado]):
            continue

        # Calculate target (who won the game or draw)
        Score_end_home, Score_end_away = row[3].text_content().strip().split('-')
        if (Score_end_home > Score_end_away): # Home Win
            target_class = 1
        elif (Score_end_home < Score_end_away): # Away Win
            target_class = 2
        elif (Score_end_home == Score_end_away): # Draw
            target_class = 0
            
        # Build a list with all information for dataset
        linha_coletada = [data_horario, user_home, user_away, team_home, team_away,
                          resultado, Score_end_home, Score_end_away, target_class]
        
        # Appending each line to build a "Full Dataset"
        dados_files.append(linha_coletada)
        cont += 1 # Incrementing counting games
        
    print('\r>>>> Arquivo trabalhado: {}  ({})'.format(filepath, cont))
    
# Convert all lines into DataFrame
df = pd.DataFrame(dados_files, columns=['data_horario',
                                        'user_home', 'user_away',
                                        'team_home', 'team_away', 'resultado',
                                        'Score_end_home', 'Score_end_away',
                                        'target_class'])

>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00020.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00020.txt  (31)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00019.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00019.txt  (31)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00018.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00018.txt  (31)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00017.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00017.txt  (31)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00016.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00016.txt  (31)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00015.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00015.txt  (26)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00014.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00014.txt  (26)
>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00013.txt>>>> Arquivo trabalhado: ./src\EsoccerBattle8mins_00013.tx

In [7]:
# Close and Quit ChromeDriver
driver.close(); driver.quit()

## 5 - Data cleaning

In [None]:
# Convert Data into Datetime format
df['data_horario'] = pd.to_datetime(df['data_horario'], format='%d/%m/%Y %H:%M')
df['data_horario'] = (df['data_horario'] - datetime.timedelta(hours=1, minutes=0)) # Adjusting to BRA timezone

# Standardizing uppercase text (User name)
df['user_home'] = df['user_home'].str.upper()
df['user_away'] = df['user_away'].str.upper()
df['team_home'] = df['team_home'].str.upper()
df['team_away'] = df['team_away'].str.upper()

# Collecting the real team name (without User)
df['team_playing_home'] = df['team_home'].str.split('(').map(lambda x: x[0].strip()) # 
df['team_playing_away'] = df['team_away'].str.split('(').map(lambda x: x[0].strip())

# Rebuilding Teams name 
df['team_home'] = df['team_playing_home'] + ' (' + df['user_home'] + ') ' + 'ESPORTS' 
df['team_away'] = df['team_playing_away'] + ' (' + df['user_away'] + ') ' + 'ESPORTS'

# Ordering data by date
df = df.sort_values(by=['data_horario']).reset_index(drop=True)

## 6 - Output File

In [52]:
%time df.to_csv('./FIFA21_8min_history.csv', sep=';', decimal=',', encoding='utf-8-sig', index=False)

Wall time: 27.1 ms


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560 entries, 0 to 559
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   data_horario       560 non-null    datetime64[ns]
 1   user_home          560 non-null    object        
 2   user_away          560 non-null    object        
 3   team_home          560 non-null    object        
 4   team_away          560 non-null    object        
 5   resultado          560 non-null    object        
 6   Score_end_home     560 non-null    object        
 7   Score_end_away     560 non-null    object        
 8   target_class       560 non-null    int64         
 9   team_playing_home  560 non-null    object        
 10  team_playing_away  560 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 48.2+ KB


In [48]:
# End of file
df.tail()

Unnamed: 0,data_horario,user_home,user_away,team_home,team_away,resultado,Score_end_home,Score_end_away,target_class,team_playing_home,team_playing_away
555,2021-03-07 22:34:00,JKEY,HOTSHOT,ATLETICO MADRID (JKEY) ESPORTS,CHELSEA (HOTSHOT) ESPORTS,2-1,2,1,1,ATLETICO MADRID,CHELSEA
556,2021-03-07 22:34:00,VRICO,ARCOS,MAN CITY (VRICO) ESPORTS,DORTMUND (ARCOS) ESPORTS,0-0,0,0,0,MAN CITY,DORTMUND
557,2021-03-07 22:46:00,HOTSHOT,VRICO,CHELSEA (HOTSHOT) ESPORTS,MAN CITY (VRICO) ESPORTS,1-0,1,0,1,CHELSEA,MAN CITY
558,2021-03-07 22:58:00,VRICO,JKEY,MAN CITY (VRICO) ESPORTS,ATLETICO MADRID (JKEY) ESPORTS,1-1,1,1,0,MAN CITY,ATLETICO MADRID
559,2021-03-07 23:10:00,ARCOS,HOTSHOT,DORTMUND (ARCOS) ESPORTS,CHELSEA (HOTSHOT) ESPORTS,2-3,2,3,2,DORTMUND,CHELSEA


------------------