# Scraping FIFA Index website (Step by step)

# 1 - Introduction

In this notebook, I try to illustrate the step-by-step of scraping football players information from the [FIFA Index website](https://www.fifaindex.com/). The only purpose of this notebook is to serve as a hands-on tutorial on how to do web scraping using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and to explain the code that is able to return the players dataset used in this project (the functions in the scraping_functions.py file). Despite the fact that there are many tutorials of web scraping on the internet (like [this](https://towardsdatascience.com/a-short-practical-how-to-guide-to-scrape-data-from-a-website-using-python-888373227d4f)), I think I learned a lot more here since the information that I was interested is spread in many different HTML tags, and the page layouts change depending on some player features (and also because there is no better tutorial than getting your hands dirty). **As always, any suggestion or constructive criticism is welcome**.

# 2 - Getting information from a single player

In order to get information from all players present on fifa 20, we need to know how to get that info from a single player. As an example, we start by examinating the [Lionel Messi page on FIFA Index](https://www.fifaindex.com/player/158023/lionel-messi/), and then discuss how we can generalize the data extraction pipeline to all other players. The first thing to do is download the html from the given page using the requests package and then make the soup (the navigable tree of python objects).

In [1]:
# Importing libraries used in this section
import requests
import bs4

In [2]:
url_messi = "https://www.fifaindex.com/player/158023/lionel-messi/"
# Requesting access to the webpage
page_messi = requests.get(url_messi)

In [3]:
# Making the soup
soup = bs4.BeautifulSoup(page_messi.text,"html.parser")

Once we have the soup, we need to identify which html tag contains the information that we need. This can be done by clicking with the right mouse button on the page and then on inspect. After that, for any element on the page we can find the corresponding html tag, as shown in the image below.

<img src="images/messi1.png">

Now, getting the information that we want can be done in just a few lines of code.

In [4]:
# Finding the name of the player, overall and potential
info_list = list(soup.find('h5').stripped_strings)
info_list

['Lionel Messi', '94', '94']

In [5]:
# Finding the nationality of a player
info_list.append(soup.find_all('a', {"class": "link-nation"})[1].text)
info_list

['Lionel Messi', '94', '94', 'Argentina']

<img src="images/messi2.png" width="600" style="float:right"> By inspecting the page, we identify that almost all relevant player data are inside tags using the css class "card-body", as shown on the right. We can then find all cards using the soup that we made, and navigate in each one of them to extract the player attributes.

In [6]:
# Getting all the player information cards
cards = soup.find_all("div", {"class": "card-body"})

In [7]:
# Player stats card
info_list.append(cards[0].p.text)
info_list

['Lionel Messi',
 '94',
 '94',
 'Argentina',
 "Lionel Messi was born on June 24, 1987. He is currently 33 years old and plays as a Wide Man for FC Barcelona in Spain. His overall rating in FIFA 20 is 94 with a potential of 94. Messi has got a 4-star skillmoves rating. He prefers to shoot with his left foot. His workrates are Medium / Low. Messi's height is 170 cm cm and his weight is estimated at 72 kg kg according to our database. Currently, Lionel Messi is playing with number 10. His best stats are: Dribbling: 97, Ball Control: 96, Composure: 96, Reactions: 95, Balance: 95."]

In [8]:
#Looking at all pieces of information of the main card
p_tags = cards[1].find_all('p')
p_tags

[<p class="">Height <span class="float-right"><span class="data-units data-units-metric">170 cm</span><span class="data-units data-units-imperial">5'7"</span></span></p>,
 <p class="">Weight <span class="float-right"><span class="data-units data-units-metric">72 kg</span><span class="data-units data-units-imperial">159 lbs</span></span></p>,
 <p class="">Preferred Foot <span class="float-right">Left</span></p>,
 <p class="">Birth Date <span class="float-right">June 24, 1987</span></p>,
 <p class="">Age <span class="float-right">33</span></p>,
 <p class="">Preferred Positions <span class="float-right"><a class="link-position" href="/players/?position=23" title="RW"><span class="badge badge-dark position rw">RW</span></a><a class="link-position" href="/players/?position=25" title="ST"><span class="badge badge-dark position st">ST</span></a><a class="link-position" href="/players/?position=21" title="CF"><span class="badge badge-dark position cf">CF</span></a></span></p>,
 <p class="">Pla

In [9]:
for p_tag in p_tags:
    print(list(p_tag.strings))

['Height ', '170 cm', '5\'7"']
['Weight ', '72 kg', '159 lbs']
['Preferred Foot ', 'Left']
['Birth Date ', 'June 24, 1987']
['Age ', '33']
['Preferred Positions ', 'RW', 'ST', 'CF']
['Player Work Rate ', 'Medium / Low']
['Weak Foot ']
['Skill Moves ']
['Value ', '€95.500.000']
['Value ', '$95.500.000']
['Value ', '£95.500.000']
['Wage ', '€560.000']
['Wage ', '$560.000']
['Wage ', '£560.000']


As we can see, some of the information in the main card is redundant, like the weight in kg and in lbs, or the wage in dollars, euros or pounds. Also, the number of stars in "Weak Foot" and "Skill Moves" are not given by a string, but we can still get this information by noticing that bright stars and empty stars corresponds to different classes inside the "i" tag. Hence, this time we need more lines of code to get useful and not redundant data.

In [10]:
# Getting all the information from the main card
for p_tag in p_tags[:5]:
    info_list.append(list(p_tag.strings)[1])

# Care should be taken here since different players have a different number of preferred positions
positions = ''    
for position in list(p_tags[5].strings)[1:]:
    positions += position + '/'

positions = positions[:-1]    
info_list.append(positions)
info_list.append(list(p_tags[6].strings)[1])

weak_foot_stars = len(p_tags[7].find_all('i',{"class": "fas fa-star fa-lg"}))
info_list.append(weak_foot_stars)

skill_moves_stars = len(p_tags[8].find_all('i',{"class": "fas fa-star fa-lg"}))
info_list.append(skill_moves_stars)

# Only getting wage and value in euros
info_list.append(list(p_tags[9].strings)[1])
info_list.append(list(p_tags[12].strings)[1])

info_list

['Lionel Messi',
 '94',
 '94',
 'Argentina',
 "Lionel Messi was born on June 24, 1987. He is currently 33 years old and plays as a Wide Man for FC Barcelona in Spain. His overall rating in FIFA 20 is 94 with a potential of 94. Messi has got a 4-star skillmoves rating. He prefers to shoot with his left foot. His workrates are Medium / Low. Messi's height is 170 cm cm and his weight is estimated at 72 kg kg according to our database. Currently, Lionel Messi is playing with number 10. His best stats are: Dribbling: 97, Ball Control: 96, Composure: 96, Reactions: 95, Balance: 95.",
 '170 cm',
 '72 kg',
 'Left',
 'June 24, 1987',
 '33',
 'RW/ST/CF',
 'Medium / Low',
 4,
 4,
 '€95.500.000',
 '€560.000']

The next cards are related to the teams in which Messi plays (Barcelona and Argentina). I'm not interested in that information. The only thing that I want to know is the name of the club and national team of the player (if any). Depending on the player, the number and order of the cards may change. For example, if a player does not play for his national team, this card will be missing in the page (See, for instance, [Dayot Upamecano](https://www.fifaindex.com/player/229558/dayot-upamecano/)). Free Agents like [Schetino](https://www.fifaindex.com/player/245309/egidio-maestre-schetino/) have the national team on the right card instead of on the left, and other players like [Walker Zimmerman](https://www.fifaindex.com/player/212591/walker-zimmerman/) do not have any information at all. Also, players that do not belong to a club cannot have a wage or value information. For this reason, it is useful to define the **team_info** variable below, which allows us to count the number of teams of the player and check who they are. In the get_player_data function this variable is important to avoid errors in the code that could be generated from missing information in some player webpages.

In [11]:
# player team(s) information
teams_info = soup.find_all('div',{'class':'col-12 col-sm-6 col-lg-6 team'})
print("number of teams: {}\n".format(len(teams_info)))
teams_info

number of teams: 2



[<div class="col-12 col-sm-6 col-lg-6 team">
 <div class="card mb-5">
 <h5 class="card-header"><a class="link-team" href="/team/1369/argentina/" title="Argentina FIFA 20"><img alt="Argentina FIFA 20" class="team size-3" data-src="/static/FIFA20/images/crest/3/light/1369.png" data-srcset="/static/FIFA20/images/crest/3/light/1369@2x.png 2x, /static/FIFA20/images/crest/3/light/1369@3x.png 3x" src="/static/FIFA21/images/crest/3/light/notfound.png" title="Argentina FIFA 20"/></a> <a class="link-team" href="/team/1369/argentina/" title="Argentina FIFA 20">Argentina</a></h5>
 <div class="card-body">
 <p class="">Position <span class="float-right"><a class="link-position" href="/players/?position=24" title="RS"><span class="badge badge-dark position rs">RS</span></a></span></p>
 <p class="">Kit Number <span class="float-right">10</span></p>
 </div>
 </div>
 </div>, <div class="col-12 col-sm-6 col-lg-6 team">
 <div class="card mb-5">
 <h5 class="card-header"><a class="link-team" href="/team/241

After the initial and tricky part of the scraping, luckly all other cards follow a similar structure. The numeric cards (like Passing and Shooting) have all the information on "p" tags, which can be easily extracted.

In [12]:
cards[4]

<div class="card-body">
<p class="">Ball Control <span class="float-right"><span class="badge badge-dark rating r1">96</span></span></p>
<p class="">Dribbling <span class="float-right"><span class="badge badge-dark rating r1">97</span></span></p>
</div>

In [13]:
for i in range(4,11):
    tags = cards[i].find_all('p')
    for tag in tags:
        info_list.append(int(tag.span.text))
        
info_list

['Lionel Messi',
 '94',
 '94',
 'Argentina',
 "Lionel Messi was born on June 24, 1987. He is currently 33 years old and plays as a Wide Man for FC Barcelona in Spain. His overall rating in FIFA 20 is 94 with a potential of 94. Messi has got a 4-star skillmoves rating. He prefers to shoot with his left foot. His workrates are Medium / Low. Messi's height is 170 cm cm and his weight is estimated at 72 kg kg according to our database. Currently, Lionel Messi is playing with number 10. His best stats are: Dribbling: 97, Ball Control: 96, Composure: 96, Reactions: 95, Balance: 95.",
 '170 cm',
 '72 kg',
 'Left',
 'June 24, 1987',
 '33',
 'RW/ST/CF',
 'Medium / Low',
 4,
 4,
 '€95.500.000',
 '€560.000',
 96,
 97,
 33,
 26,
 37,
 48,
 95,
 94,
 40,
 94,
 96,
 88,
 92,
 92,
 91,
 75,
 68,
 95,
 84,
 93,
 68,
 70,
 86,
 95,
 94,
 93,
 94,
 75,
 88,
 14,
 6,
 11,
 15,
 8]

The Specialities and Traits cards have unique strings that characterize the special features of the player. Therefore, we can concatenate all this information in a single string (in a similar way we did for the preferred positions). 

In [14]:
cards[11]

<div class="card-body">
<p>Dribbler</p>
<p>Distance Shooter</p>
<p>Crosser</p>
<p>FK Specialist</p>
<p>Acrobat</p>
<p>Clinical Finisher</p>
</div>

In [15]:
specialities = ''
for tag in cards[11].find_all('p'):
    specialities += tag.text + '/'
    
specialities = specialities[:-1]
info_list.append(specialities)

traits = ''
for tag in cards[12].find_all('p'):
    traits += tag.text + '/'
    
traits = traits[:-1]
info_list.append(traits)

By wrapping up everything that was discussed in this section, we can easily understand the get_player_data function defined in the scraping_functions.py file, which receive a player url and returns a dictionary with all relevant data from the player.

# 3 - Getting all player webpages

In order to use the get_player_data function to scrape information of every player in FIFA 20, we need to first get all the corresponding URLs. By inspecting the FIFA Index main page, we notice that each player has a unique ID that appear in his URL.

<img src="images/ronaldo7.png">

We could store the whole URL, but it is cleaner to store the players IDs and construct the URL following the pattern "https://www.fifaindex.com/player/" + player_id. To get the IDs from a page with a list of players, we can use the following code:

In [16]:
players_ids = []
players_list_url = 'https://www.fifaindex.com'

players_list_page = requests.get(players_list_url)
soup = bs4.BeautifulSoup(players_list_page.text,"html.parser")
        
for player_tag in soup.find_all(attrs={'data-playerid': True}):
    players_ids.append(player_tag['data-playerid'])
    
print(players_ids)

['221363', '208054', '158023', '20801', '234396', '229558', '235790', '239085', '233049', '231747', '241096', '228702', '190871', '246147', '206113', '253004', '206517', '213956', '212188', '181291', '208670', '231478', '202652', '211110', '233306', '203376', '237692', '248243', '173731', '209658']


The idea is then to navigate all pages listing players and extract the IDs. To do that, we need to "click" the "Next Page" button after scraping the IDs from a given page. This can be done by finding the tag containing the link associated with the "Next Page" button.

In [17]:
# Finding the link to the next page
link_tag = soup.find('a',string='Next Page')
print("Next Page tag: {}".format(str(link_tag)))

# Joining with the main page
fifaindex_url = 'https://www.fifaindex.com'
players_list_url = fifaindex_url+link_tag['href']

# Clicking the Next Page button
players_list_page = requests.get(players_list_url)

Next Page tag: <a class="btn btn-light" href="/2/">Next Page</a>


By iterating this process we are able to obtain the IDs of all players we are interested in (we can filter the players in the website by many different ways). The get_players_ids function (in the scraping_functions.py file) perform this operation sequentially until it reaches the last page (i.e., link_tag = None).

# 4 - Iterating over all players and constructing the dataset

Now it's time to construct the dataset and generate the fifaindex_players.csv file. We first import the relevant functions and packages from the scraping_functions.py file.

In [18]:
from scraping_functions import *

Second, we get all the players IDs and save them in a txt file (just for not having to do it again in case of having to restart the kernel). N.B.: we are interested only in male players, hence, the URL has a query string given by '?gender=male'.

In [None]:
players_ids = get_players_ids('https://www.fifaindex.com/players/?gender=male')

with open('players_ids.txt', 'w') as filehandle:
    for listitem in players_ids:
        filehandle.write('%s\n' % listitem)

In [19]:
# Code to read the saved file (if needed)
players_ids = []

with open('players_ids.txt', 'r') as filehandle:
    for line in filehandle:
        currentPlace = line[:-1]
        players_ids.append(currentPlace)

Finally, we generate the dataframe and save it into a csv file. The generate_players_df function automatically creates the dataframe for us. However, since we are collecting data from every player (almost 20000), it is useful to construct the dataset incrementally to prevent losing the data due to connection losses or anything else in the middle of the scraping process (this actually happened to me :( ...). In the code below we make sure that we do not lose progress if we have a problem at some point of the scraping. Also, I try to [be a nice guy by not scraping the website too fast](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/) (I also make sure of this in the defined functions). By the end of the day (and with that I mean almost 24 hours of scraping) you will have the fifaindex dataset that I used in the project (or any other dataset from this website).

In [None]:
df = pd.DataFrame()
#df = pd.read_csv('fifaindex_players.csv') --> If you have to restart the scraper

iteration = 0
total_iterations = int(len(players_ids)/250)
for i in range(iteration,total_iterations):
    
    iteration = i
    df_i = generate_players_df(players_ids[250*i:(250*i+250)])
    df = pd.concat([df,df_i],ignore_index=True)
    df.to_csv('fifaindex_players.csv',index=False)
    clear_output(wait=True)
    gc.collect()
    print('iteration number {} finished. Waiting 30 seconds before proceeding...\n'.format(iteration))
    time.sleep(30)
    clear_output(wait=True)

iteration = total_iterations
df_i = generate_players_df(players_ids[250*iteration:])
df = pd.concat([df,df_i],ignore_index=True)
df.to_csv('fifaindex_players.csv',index=False)
clear_output(wait=True)
gc.collect()

df.tail()

In [20]:
# Inspecting if everything is ok
pd.read_csv('fifaindex_players.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19626 entries, 0 to 19625
Data columns (total 53 columns):
name                   19626 non-null object
overall                19626 non-null int64
potential              19626 non-null int64
nationality            19626 non-null object
description            19626 non-null object
height                 19626 non-null object
weight                 19626 non-null object
preferred_foot         19626 non-null object
birth_date             19626 non-null object
age                    19626 non-null int64
preferred_positions    19626 non-null object
work_rate              19626 non-null object
weak_foot              19626 non-null int64
skill_moves            19626 non-null int64
value                  19626 non-null object
wage                   19626 non-null object
team_club              19626 non-null object
team_nation            19626 non-null object
ball_control           19626 non-null int64
dribbling              19626 non-null int6