This is a python crawler that scrapes data about the Fifa 22 transfer market. All data is scraped from https://www.futwiz.com/en/fifa22/players. Data are intitially obtained using the requets module, the formatted using BeautifulSoup from the bs4 module, and then transformed to a pandas DataFrame.

In [1]:
import requests
import bs4
import pandas as pd

After experimentation with reading in the data, there were a few points found in the data that need to be cleaned. The first comes with dealing with names. The player's names are scraped by the name that appears on their jersey in real life, as that is how their cards appear in Fifa. Due to the nature of the site, functions are needed to deal with special characters (ü, é, etc.) in the html representation of the player's names. Also, a player whose name contain a space can also be cleaned (i.e. Wissam, Ben Yedder is read in as Yedder before a function is called). Functions to deal with these cases are defined below.

In [2]:
def clean_names(string):
    """
   A function that will return the cleaned player's name
   
   Parameters
   ----------
   string:
       the intitial player's name
   -------
   Returns the player's cleaned name.
   """
    return string.replace(' ', '').replace("Ã©","é").replace('Ã¡','á').replace('Ã¼','ü').replace('Ã±', 'ñ').replace('Ãº','ú').replace('Ä\x87','ć').replace('Ã¤','ä').replace('Ã\xad','í').replace('Ã³','ó').replace('Ä\x8c','Č').replace('Ä\x9b','ě').replace('Ã¨','è').replace('Ã¶','ö').replace('Ã§','ç').replace('Å\xa0','Š').replace('Ä\x8d','ć').replace('Ã\x81','Á').replace('Ã\x96','Ö').replace('Ã¦','ae').replace('Ä\x9f','ğ').replace('Ä±','i').replace('Å\x84','ń').replace('Å»','Ż').replace('Å\x9f','ş').replace('Ã\x98','Ø').replace('Ã«','ë').replace('Ã£','ã').replace('Ã¯','ï').replace('Ã\x87','Ç').replace('Å¾','ž')

In [3]:
def fix_names(string,pageNo):
    """
   A function that will change the player's name to deal with spaces
   
   Parameters
   ----------
   string:
       the intitial player's name
   pageNo:
       an upper bound on the site page to include the name
   -------
   Returns the player's cleaned name.
   """
    #Check if the name matches any of the following
    if(string=='Gea'):
        #If yes then return the corrected name
        return 'De Gea'
    if((string=='Jr') & (pageNo<=100)):
        return 'Neymar'
    if(string=='Dijk'):
        return 'Van Dijk'
    if(string=='Bruyne'):
        return 'De Bruyne'
    if(string=='Yedder'):
        return 'Ben Yedder'
    if((string=='Dias') & (pageNo<=100)):
        return 'Ruben Dias'
    if(string=='Basten'):
        return 'Van Basten'
    if(string=='Nistelrooy'):
        return 'Van Nistelrooy'
    if(string=='Piero'):
        return 'Del Piero'
    if(string=='Stegen'):
        return 'Ter Stegen'
    if(string=='Persie'):
        return 'Van Persie'
    if((string=='Tomás')& (pageNo<=165)):
        return 'De Tomás'
    if(string=='Dasari'):
        return 'Al Dasari'
    if(string=='Vrij'):
        return 'De Vrij'
    if((string=='Beek') & (pageNo <= 165)):
        return 'Van De Beek'
    if((string=='Jong') & (pageNo <= 165)):
        return 'De Jong'
    if(string=='Lorenzo'):
        return 'De Lorenzo'
    if(string=='Ligt'):
        return 'De Ligt'
    if((string=='Paul') & (pageNo <= 165)):
        return 'De Paul'
    if(string=='María'):
        return 'De María'
    else:
        return string

The next step of cleaning is dealing with the Price column. Some players do not have prices on the market, but are included in the players database. This can be for a number reasons; the most common being that the player can be earned but not purchased. Also, the way that Futwiz displays their prices is not practical for testing our data. For example, a price of 4,400 is stored as 4.4k, which can cause issues with handling numbers as strings. What we want is to deal with prices as ints (ex. 4.4k as 4400). The function to take care of these scenarios is defined below.

In [4]:
def clean_price(price):
    """
   A function that will return the cleaned player's price
   
   Parameters
   ----------
   price:
       the intitial player's price
   -------
   Returns the player's cleaned price.
   """
    #First override price to deal with the nuisance of the way the price is scraped
    price = price.replace('\n', "").replace(" ","")
    #When the player does not have a price on the market, the player will be given a market value of 0.
    if(price==''):
        return 0
    #Next check if the player is priced in the millions.
    elif 'M' in price:
        #If yes, override and return an int casting of the player's price.
        price = price.replace('M','')
        price = float(price)
        price = price * 1000000
        return int(price)
    #Next check if the player is priced in the thousands.
    elif 'K' in price:
        #If yes, override and return an int casting of the player's price.
        price = price.replace('K','')
        price = float(price)
        price = price * 1000
        return int(price)
    else:
        #If the player does not meet any of these special cases, return the price casted as an int
        return int(price)

Now that the data has been cleaned, we will define a function that will scrape Futwiz and give us a dataframe that we can work with.

In [5]:
def get_data(pageNo):
    """
   A function that scrapes a given page on Futwiz and gives us data to work with
   
   Parameters
   ----------
   pageNo:
       the page to be scraped
   -------
   Returns a DataFrame of players
   """
    #Get the URL
    request = requests.get('https://www.futwiz.com/en/fifa22/players?page='+str(pageNo))
    #Convert the site into HTML
    soup = bs4.BeautifulSoup(request.text,'html5lib')
    #Begin converting HTML into data
    #Find all player names on a given page
    name = soup.find_all('div', class_ = "card-22-pack-name")
    #Use list comprehension to get a cleaned list of player names
    names = [fix_names(clean_names(n.text),pageNo) for n in name]
    #Repeat the process to find ratings, positions, and all the face card stats
    rate = soup.find_all('div', class_ = "card-22-pack-rating")
    ratings = [int(r.text) for r in rate]
    pos = soup.find_all('div', class_ = "card-22-pack-position")
    positions = [p.text for p in pos]
    pac = soup.find_all('div', class_ = "card-22-pack-attnum1")
    pace = [int(p.text) for p in pac]
    shot = soup.find_all('div', class_ = "card-22-pack-attnum2")
    shooting = [int(s.text) for s in shot]
    pas = soup.find_all('div', class_ = "card-22-pack-attnum3")
    passing = [p.text for p in pas]
    dri = soup.find_all('div', class_ = "card-22-pack-attnum4")
    dribbling = [int(d.text) for d in dri]
    defend = soup.find_all('div', class_ = "card-22-pack-attnum5")
    defending = [int(d.text) for d in defend]
    phys = soup.find_all('div', class_ = "card-22-pack-attnum6")
    physical = [p.text for p in phys]
    #For non face card stats such as skill moves and work rates, more cleaning work is needed.
    #First get all non face card stats for each player and convert it to a list called attrs.
    tags = soup.find_all('div', class_ = "latest-player-info-stat")
    attrs = [a.text for a in tags]
    #Next we group our list into lists of all 4 non face card stats 
    attribs = [attrs[i:i + 4] for i in range(0, len(attrs), 4)]
    #Now go through each group to define lists of the four non face card stats
    skills = [int(attrib[0]) for attrib in attribs]
    wf = [int(attrib[1]) for attrib in attribs]
    wr = [attrib[2] for attrib in attribs]
    foot = [attrib[3] for attrib in attribs]
    #Repeat the previous process to get a cleaned list of players price
    pri = soup.find_all('div', class_ = "latest-player-info-price")
    price = [clean_price(p.text) for p in pri]
    #Define an array of all the scraped lists
    dat = {'Player':names,'Rating':ratings,'Position':positions,'Pace':pace,'Shooting':shooting,'Passing':passing,'Dribbling':dribbling,'Defending':defending,'Physical':physical,'Skills':skills,'Weak_Foot':wf,'Work_Rates':wr,'Foot':foot,'Price':price}
    #Convert the array into a DataFrame
    data = pd.DataFrame(dat)
    #Clean the dataframe slightly and return it
    data.index = data.Player
    data.pop('Player')
    return data

Now that we have a method of collecting data from one page on the site, let's define a function to that scrapes data from multiple pages.

In [6]:
def get_more(low, high):
    """
   A function that scrapes a given page range on Futwiz and gives us data to work with
   
   Parameters
   ----------
   low:
       the first page number to scrape
   high:
       the last page number to scrape
   -------
   Returns a DataFrame of players
   """
    #Start with the first page
    data = get_data(low)
    #Loop until the last page and append each new dataframe to the first one
    for i in range(low+1,high):
        dat = get_data(i)
        data = data.append(dat)
    return data

Now we have a completed methodology of scraping and obtaining data. Code to obtain the data and export it as a csv is commented out and included below.

In [7]:
# Get data on pages 1-3
# data = data.get_more(0,3)
# Export as csv
# data.to_csv('filename')