# Data Scraping and Cleaning

Using the SOFIFA website URLs of each of the players, the webpage contents of each of the players were fetched. The 'Beautiful Soup' package was used to scrape the required fifa player attributes from the contents of the webpages by creating objects from the webpages. The objects contained the HTML scripts of the webpages and the player attributes were scraped using the find() and find_all() functions provided by Beautiful Soup. Using these functions, the classes containing the required player attributes were found and subsequently the data was extracted.

The extracted data had punctuations and also some of the words and the player attributes were connected together and hence the data had to be cleaned. Once the data was cleaned, it was organized into a pandas dataframe and the dataframe was written into a '.csv' file.

In [1]:
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
import time

from data_extraction_module import extract_player_attributes
from data_extraction_module import create_trim_save_dataframe
from data_loader_module import load_data

### Storing the Player URLs of each of the Players

In [2]:
url_main='https://sofifa.com'
player_urls = {}
for n in range(0,15000,60):
    page = requests.get(url_main + '/players?offset=' + str(n))
    soup = bs(page.content, 'lxml')
    page_contents = soup.find('table', {'class': 'table table-hover persist-area'}).find('tbody').find_all('a')
    for i in page_contents:
        if i['href'][0:8] == '/player/':  player_urls[i.text] = i['href']

### Extracting the Player Attributes using the Player URLs

In [3]:
start_time = time.time()

attributes_of_all_the_players = {}
for player_name, url in player_urls.items():
    player_url = url_main + url
    player_attributes = extract_player_attributes(player_name, player_url)
    attributes_of_all_the_players.update(player_attributes)
    
end_time = time.time()
print('Time elapsed: %.2f minutes.' %((end_time - start_time)/60))

Time elapsed: 148.53 minutes.


### Creating, Trimming and Saving the Player Attributes Dataframe

In [4]:
create_trim_save_dataframe(attributes_of_all_the_players)

Number of players in each category:
 {'Striker': 2000, 'Midfielder': 4000, 'Defender': 3000, 'GoalKeeper': 1000}


### Loading the Player Attributes Dataframe and Checking if all the required data is fetched

In [5]:
# loading the dataframe
player_attr_dataframe = load_data

# checking if all the data is fetched
if player_attr_dataframe.shape[1] != (player_attr_dataframe.count() == player_attr_dataframe.shape[0]).sum():
    print('Missed some data')
else:
    print('All data fetched')

All data fetched


### Displaying the Player Attributes Dataframe

In [6]:
player_attr_dataframe

Unnamed: 0,Player Name,Player Category,Age,Height,Weight,Overall Rating,Value,Wage,Crossing,Finishing,...,Penalties,Composure,Marking,Standing Tackle,Sliding Tackle,GK Diving,GK Handling,GK Kicking,GK Positioning,GK Reflexes
0,L. Messi,Striker,31,"5'7""",159,94,€110.5M,€565K,86,95,...,75,96,33,28,26,6,11,15,14,8
1,Cristiano Ronaldo,Striker,33,"6'2""",183,94,€77M,€405K,84,94,...,85,95,28,31,23,7,11,15,14,11
2,Neymar Jr,Midfielder,26,"5'9""",150,92,€108M,€290K,83,87,...,81,94,27,24,33,9,9,15,15,11
3,De Gea,GoalKeeper,27,"6'4""",168,91,€72M,€260K,17,13,...,40,70,25,21,13,90,85,85,89,94
4,K. De Bruyne,Midfielder,27,"5'11""",154,91,€102M,€355K,93,82,...,79,90,68,58,51,15,13,5,10,13
5,E. Hazard,Striker,27,"5'8""",168,91,€93M,€340K,81,84,...,86,91,34,27,22,11,12,6,8,8
6,L. Modrić,Midfielder,32,"5'8""",146,91,€67M,€420K,86,72,...,82,90,68,76,73,13,9,7,14,9
7,L. Suárez,Striker,20,"6'1""",163,67,€1.2M,€13K,59,70,...,73,49,26,25,24,6,8,13,11,6
8,H. Kane,Midfielder,19,"5'10""",148,66,€1M,€11K,45,60,...,55,59,49,57,62,11,12,6,14,9
9,J. Oblak,GoalKeeper,25,"6'2""",192,90,€68M,€94K,13,11,...,11,70,27,12,18,86,92,78,88,89
