In [1]:
# https://towardsdatascience.com/web-scraping-5649074f3ead

In [2]:
# build a web scraper using Python and BeautifulSoup (a library) to scrape data of FIFA World Cup 2018. The data includes an individual player’s information and statistics of the whole world cup.

In [3]:
# Step 0 — Understanding What Data to Scrape

In [4]:
# Step 1 — Loading the Web page in Python

In [None]:
#Import libraries
import requests
from bs4 import BeautifulSoup

#Request URL
page = requests.get("https://www.fifa.com/worldcup/players.html")

#Fetch webpage
soup = BeautifulSoup(page.content,"html.parser")
print(soup.prettify())

In [5]:
# requests: The first thing we are going to need to scrape the page is to download the page. We are going to do that with Python’s requests library.
# BeautifulSoup: We will use this library to parse the HTML page we’ve just downloaded. In other words, we’ll extract the data we need.

In [6]:
'''
With requests.get you first get the webpage by passing the URL. Now, we create an instance of BeautifulSoup. We will print that instance to check whether the web page loaded correctly or not. Adding .prettify() makes code cleaner so that it is readable to humans. As you can see below, our page has loaded correctly. Now we can scrape data from it.
'''

'\nWith requests.get you first get the webpage by passing the URL. Now, we create an instance of BeautifulSoup. We will print that instance to check whether the web page loaded correctly or not. Adding .prettify() makes code cleaner so that it is readable to humans. As you can see below, our page has loaded correctly. Now we can scrape data from it.\n'

In [7]:
# Step 2 — First Player

In [8]:
# Ronaldo

In [None]:
#Import libraries
import requests
from bs4 import BeautifulSoup

#Request URL
page = requests.get("https://www.fifa.com/worldcup/players/player/201200/profile.html")

#Fetch webpage
soup = BeautifulSoup(page.content,"html.parser")

#Scraping Data
name = soup.find("div",{"class":"fi-p__name"})
print(name)

In [9]:
'''
We have requested the page where Ronaldo’s profile resides. Created an instance. Now, with find method of BeautifulSoup, we will find what we need. We need div but there are a lot of divs in the code, which one do you need specifically? We need the one where the class name is fi-p__name only. Because in that division, the name of the player is written. Hitting enter and voila, we got it.
'''

'\nWe have requested the page where Ronaldo’s profile resides. Created an instance. Now, with find method of BeautifulSoup, we will find what we need. We need div but there are a lot of divs in the code, which one do you need specifically? We need the one where the class name is fi-p__name only. Because in that division, the name of the player is written. Hitting enter and voila, we got it.\n'

In [10]:
# Simplifying the Output

In [None]:
#Import libraries
import requests
from bs4 import BeautifulSoup

#Request URL
page = requests.get("https://www.fifa.com/worldcup/players/player/201200/profile.html")

#Fetch webpage
soup = BeautifulSoup(page.content,"html.parser")

#Scraping Data
name = soup.find("div",{"class":"fi-p__name"}).text.replace("\n","").strip()
print(name)

In [11]:
# We can’t store the data in the format we got because of obvious reasons. We need it in the text format. That’s why we have added .text at the end. And this is what it looks like.

In [12]:
# Still not cool, we have several “\n” there. We got to remove them as well. So we will replace them with nothing (which is “”). And we will have something like this,

In [13]:
# One more thing, removing preceding and following spaces, for that, .strip() and the final output looks like this.

In [14]:
# There we have it. We have fetched the name of one player. With the same thing, we will fetch other details which you saw on the web page. Like, height, country, role, and goals.

In [15]:
# Step 3 — All Data of First Player

In [None]:
#Import libraries
import requests
from bs4 import BeautifulSoup

#Request URL
page = requests.get("https://www.fifa.com/worldcup/players/player/201200/profile.html")

#Fetch webpage
soup = BeautifulSoup(page.content,"html.parser")

#Scraping Data
#Name #Country #Role #Age #Height #International Caps #International Goals
player_name = soup.find("div",{"class":"fi-p__name"}).text.replace("\n","").strip()
player_country = soup.find("div",{"class":"fi-p__country"}).text.replace("\n","").strip()
player_role = soup.find("div",{"class":"fi-p__role"}).text.replace("\n","").strip()
player_age = soup.find("div",{"class":"fi-p__profile-number__number"}).text.replace("\n","").strip()
player_height = soup.find_all("div",{"class":"fi-p__profile-number__number"})[1].text.replace("\n","").strip()
player_int_caps = soup.find_all("div",{"class":"fi-p__profile-number__number"})[2].text.replace("\n","").strip()
player_int_goals = soup.find_all("div",{"class":"fi-p__profile-number__number"})[3].text.replace("\n","").strip()

print(player_name,"\n",player_country,"\n",player_role,"\n",player_age,"years \n",player_height,"\n",player_int_caps,"caps \n",player_int_goals,"goals")

In [16]:
# Each parameter has its own class. Easy for us. While height, caps, and goals don’t, huh. No worries. We will use find_all method and then we will print each with the index number in their order. Tada, we made it simple too. The output looks like this.

In [17]:
# Step 4 — All Data of All 736 Players

In [18]:
'''
That’s the data of one player we got, how about others? All player’s profiles are on different web pages. We have to scrape it all with just one script and not individual scripts for each player. What are we going to do now? Find a pattern, so that we can fetch all URL’s at the same time.
'''

'\nThat’s the data of one player we got, how about others? All player’s profiles are on different web pages. We have to scrape it all with just one script and not individual scripts for each player. What are we going to do now? Find a pattern, so that we can fetch all URL’s at the same time.\n'

In [19]:
'''
As you can see, the URL of the Ronaldo’s profile is, https://fifa.com/worldcup/players/player/201200/profile.html, do you see something which might be common for all? Yes, correct, there is. The player id (for Ronaldo — 201200) might be unique for all of them. So, before scraping all other data, we have to scrape player IDs. Just like we scraped the name, but this time we will run a loop.
'''

'\nAs you can see, the URL of the Ronaldo’s profile is, https://fifa.com/worldcup/players/player/201200/profile.html, do you see something which might be common for all? Yes, correct, there is. The player id (for Ronaldo — 201200) might be unique for all of them. So, before scraping all other data, we have to scrape player IDs. Just like we scraped the name, but this time we will run a loop.\n'

In [20]:
# 4.1 Player List

In [None]:
#Import libraries
import requests
from bs4 import BeautifulSoup
import pandas

#Empty list to store data
id_list = []

#Fetching URL
request = requests.get("https://www.fifa.com/worldcup/players/_libraries/byposition/[id]/_players-list")
soup = BeautifulSoup(request.content,"html.parser")

#Iterate to find all IDs
for ids in range(0,736):
    all = soup.find_all("a","fi-p--link")[ids]
    id_list.append(all['data-player-id'])

#Data Frame to store scrapped data
df = pandas.DataFrame({
"Ids":id_list
})
df.to_csv('player_ids.csv', index = False)
print(df,"\n Success")

In [21]:
'''
I checked the website, there was a total of 736 players, hence, running a loop to for the same. Here, in place of div there is a , the anchor tag. But the functionality remains the same — Searching and finding our class. We are creating an empty list at the beginning, and then appending the IDs to that list in each iteration. In the end, we will create a data frame to store those IDs and export them to .CSV file.
'''

'\nI checked the website, there was a total of 736 players, hence, running a loop to for the same. Here, in place of div there is a , the anchor tag. But the functionality remains the same — Searching and finding our class. We are creating an empty list at the beginning, and then appending the IDs to that list in each iteration. In the end, we will create a data frame to store those IDs and export them to .CSV file.\n'

In [22]:
# 4.2 Fetching All Data of All Players

In [23]:
# Now that we have the IDs, and know how to fetch each parameter, we just have to pass those IDs and run all through the iteration.

In [None]:
#Import libraries
import requests
from bs4 import BeautifulSoup
import pandas
from collections import OrderedDict

#Fetch data of Player's ID
player_ids = pandas.read_csv("player_ids.csv")
ids = player_ids["Ids"]

#Prive a base url and an empty list
base_url = "https://www.fifa.com/worldcup/players/player/"
player_list = []

#Iterate to scrap data of players from fifa.com
for pages in ids:
    #Using OrderedDict instead of Dict (See explaination)
    d=OrderedDict()
    #Fetching URLs one by one
    print(base_url+str(pages)+"/profile.html")
    request = requests.get(base_url+str(pages)+"/profile.html")
    #Data processing
    content = request.content
    soup = BeautifulSoup(content,"html.parser")
    #Scraping Data
    #Name #Country #Role #Age #Height #International Caps #International Goals
    d['Name'] = soup.find("div",{"class":"fi-p__name"}).text.replace("\n","").strip()
    print(d['Name'])
    d['Country'] = soup.find("div",{"class":"fi-p__country"}).text.replace("\n","").strip()
    d['Role'] = soup.find("div",{"class":"fi-p__role"}).text.replace("\n","").strip()
    d['Age'] = soup.find("div",{"class":"fi-p__profile-number__number"}).text.replace("\n","").strip()
    d['Height(cm)'] = soup.find_all("div",{"class":"fi-p__profile-number__number"})[1].text.replace("\n","").strip()
    d['International Caps'] = soup.find_all("div",{"class":"fi-p__profile-number__number"})[2].text.replace("\n","").strip()
    d['International Goals'] = soup.find_all("div",{"class":"fi-p__profile-number__number"})[3].text.replace("\n","").strip()

#Append dictionary to list
player_list.append(d)
#Create a pandas DataFrame to store data and save it to .csv
df = pandas.DataFrame(player_list)
df.to_csv('Players_info.csv', index = False)
print("Success \n")

In [24]:
'''
We have created a base URL, so when we iterate through each ID, only ID in the URL will change, and we will get data from all the URLs. One more thing is, rather than lists, we are using dictionaries because there are multiple parameters. And specifically Ordered Dictionary because we want the same order as we fetch the data and not the sorted one. Append a dictionary to list and create a data frame. Save the Data to CSV and there you have it. I’ve printed out only the name of the player while running the script because that way I can be sure that data is fetched correctly.
'''

'\nWe have created a base URL, so when we iterate through each ID, only ID in the URL will change, and we will get data from all the URLs. One more thing is, rather than lists, we are using dictionaries because there are multiple parameters. And specifically Ordered Dictionary because we want the same order as we fetch the data and not the sorted one. Append a dictionary to list and create a data frame. Save the Data to CSV and there you have it. I’ve printed out only the name of the player while running the script because that way I can be sure that data is fetched correctly.\n'

In [25]:
# https://www.kaggle.com/borrkk/web-scraped-fifa-worldcup-2018-data-from-fifacom

In [26]:
# https://github.com/Dhrumilcse/Web-Scrapper-FIFA.com