# Scraping player data from www.sofifa.com

For soccer paper - to get data on a large pool of players (~10k) from across the world
Top ~10k players selected from Sofifa's list of players arranged in descending order of 'overall rating' and 'potential'
Author: Vineet Payyappalli
Created: April 2018
Last updated: 20190128

### Notes:

- 10,000 players' data are collected per transfer window
- data for each transfer window is found manually, as the dates may not be standard for all seasons/windows
- the data for the following dates are collected (for each link, offset needs to be changed to get beyond first page of results):
- remember to select "All" in the top left of page. Default is "Hot 100"
- For each link below, offset=0 gives first 80 players; offset=80& gives 80 to 160 and so on
1. 2017-18 winter (before) 17th Dec, 2018 (https://sofifa.com/players?v=19&e=159310&set=true&offset=0)
2. 2017-18 summer (after) 20th Sep, 2018 (https://sofifa.com/players?v=19&e=159222&set=true&offset=0)
3. 2017-18 summer (before) 3rd May, 2018 (https://sofifa.com/players?v=18&e=159082&set=true&offset=0)
4. 2017-18 winter (after): 15th Feb, 2018 (https://sofifa.com/players?v=18&e=159005&set=true&offset=0)
5. 2017-18 winter (before) 14th Dec, 2017 (https://sofifa.com/players?v=18&e=158942&set=true&offset=0)
6. 2017-18 summer (after) 18th Sep, 2017 (https://sofifa.com/players?v=18&e=158855&set=true&offset=0)
7. 2017-18 summer (before) 2nd May, 2017 (https://sofifa.com/players?v=17&e=158716&set=true&offset=0)
8. 2016-17 winter (after) 14th Feb, 2017 (https://sofifa.com/players?v=17&e=158639&set=true&offset=0)
9. 2016-17 winter (before) 15th Dec, 2016 (https://sofifa.com/players?v=17&e=158578&set=true&offset=0)
10. 2016-17 summer (after) 20th Sep, 2016 (https://sofifa.com/players?v=17&e=158492&set=true&offset=0)
11. 2016-17 summer (before) 5th May, 2016 (https://sofifa.com/players?v=16&e=158354&set=true&offset=0)
12. 2015-16 winter (after) 13th Feb, 2016 (https://sofifa.com/players?v=16&e=158272&set=true&offset=0)
13. 2015-16 winter (before) 17th Dec, 2015 (https://sofifa.com/players?v=16&e=158214&set=true&offset=0)
14. 2015-16 summer (after) 21st Sep, 2015 (https://sofifa.com/players?v=16&e=158127&set=true&offset=0)
15. 2015-16 summer (before) 1st May, 2015 (https://sofifa.com/players?v=15&e=157984&set=true&offset=0)
16. 2014-15 winter (after) 13th Feb, 2015 (https://sofifa.com/players?v=15&e=157907&set=true&offset=0)
17. 2014-15 winter (before) 12th Dec, 2014 (https://sofifa.com/players?v=15&e=157844&set=true&offset=0)
18. 2014-15 summer (after) 18th Sep, 2014 (https://sofifa.com/players?v=15&e=157759&set=true&offset=0)
19. 2014-15 summer (before) 2nd May, 2014 (https://sofifa.com/players?v=14&e=157620&set=true&offset=0)
20. 2013-14 winter (after) 14th Feb, 2014 (https://sofifa.com/players?v=14&e=157543&set=true&offset=0)
21. 2013-14 winter (before) 13th Dec, 2013 (https://sofifa.com/players?v=14&e=157480&set=true&offset=0)
22. 2013-14 summer (after) 20th Sep, 2013 (https://sofifa.com/players?v=14&e=157396&set=true&offset=0)
23. 2013-14 summer (before) 2nd May, 2013 (https://sofifa.com/players?v=13&e=157256&set=true&offset=0)

### Code

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

In [None]:
dates = ['2018-19 winter (before) 17th Dec, 2018', 
        '2018-19 summer (after) 20th Sep, 2018',
        '2018-19 summer (before) 3rd May, 2018',
        '2017-18 winter (after): 15th Feb, 2018', 
         '2017-18 winter (before): 14th Dec, 2017', 
         '2017-18 summer (after): 18th Sep, 2017', 
         '2017-18 summer (before): 2nd May, 2017', 
         '2016-17 winter (after): 14th Feb, 2017', 
         '2016-17 winter (before): 15th Dec, 2016', 
         '2016-17 summer (after): 20th Sep, 2016', 
         '2016-17 summer (before): 5th May, 2016', 
         '2015-16 winter (after): 13th Feb, 2016', 
         '2015-16 winter (before): 17th Dec, 2015', 
         '2015-16 summer (after): 21st Sep, 2015', 
         '2015-16 summer (before): 1st May, 2015', 
         '2014-15 winter (after): 13th Feb, 2015', 
         '2014-15 winter (before): 12th Dec, 2014', 
         '2014-15 summer (after): 18th Sep, 2014', 
         '2014-15 summer (before): 2nd May, 2014', 
         '2013-14 winter (after): 14th Feb, 2014', 
         '2013-14 winter (before): 13th Dec, 2013', 
         '2013-14 summer (after): 20th Sep, 2013', 
         '2013-14 summer (before): 3rd May, 2013']

In [None]:
urls = ['https://sofifa.com/players?v=19&e=159310&set=true&offset=',
        'https://sofifa.com/players?v=19&e=159222&set=true&offset=',
        'https://sofifa.com/players?v=18&e=159082&set=true&offset=',
        'https://sofifa.com/players?v=18&e=159005&set=true&offset=', 
         'https://sofifa.com/players?v=18&e=158942&set=true&offset=', 
         'https://sofifa.com/players?v=18&e=158855&set=true&offset=', 
         'https://sofifa.com/players?v=17&e=158716&set=true&offset=', 
         'https://sofifa.com/players?v=17&e=158639&set=true&offset=', 
         'https://sofifa.com/players?v=17&e=158578&set=true&offset=', 
         'https://sofifa.com/players?v=17&e=158492&set=true&offset=', 
         'https://sofifa.com/players?v=16&e=158354&set=true&offset=', 
         'https://sofifa.com/players?v=16&e=158272&set=true&offset=', 
         'https://sofifa.com/players?v=16&e=158214&set=true&offset=', 
         'https://sofifa.com/players?v=16&e=158127&set=true&offset=', 
         'https://sofifa.com/players?v=15&e=157984&set=true&offset=', 
         'https://sofifa.com/players?v=15&e=157907&set=true&offset=', 
         'https://sofifa.com/players?v=15&e=157844&set=true&offset=', 
         'https://sofifa.com/players?v=15&e=157759&set=true&offset=', 
         'https://sofifa.com/players?v=14&e=157620&set=true&offset=', 
         'https://sofifa.com/players?v=14&e=157543&set=true&offset=', 
         'https://sofifa.com/players?v=14&e=157480&set=true&offset=', 
         'https://sofifa.com/players?v=14&e=157396&set=true&offset=', 
         'https://sofifa.com/players?v=13&e=157256&set=true&offset=']

In [None]:
sofifa_data = pd.DataFrame(columns = ['date', 
                                      'name1', 'name2', 'country', 'age', 
                                      'pos1', 'pos2', 'pos3', 
                                      'overall_rating', 'potential', 
                                      'value', 'wage', 'special_total', 
                                      'team', 'contract'])

In [None]:
npl = 61 # number of players per page
n_players = 10000 # number of players we need
n_pages = int(10000/npl) 
for k, url in enumerate(urls):  
    for i in range(1, n_pages+1):
        print(k, i)
        offset = (i-1)*npl
        page = url + str(offset) +'&'
        print(page)
        pageTree = requests.get(page, headers=headers)
        pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
        table = pageSoup.find_all('table')[0]
        n_players_table = len(table.find('tbody').find_all('tr'))
        table_body = table.find('tbody').find_all('tr')
        for j in range(0, n_players_table):
            player_table = table_body[j]
            player_columns = player_table.find_all('td')
            country = player_columns[1].find('a').get('title')
            name1 = player_columns[1].find_all('a')[1].get('title')
            name2 = player_columns[1].find_all('a')[1].text
            a_len = len(player_columns[1].find_all('a'))
            n_pos = a_len - 2
            # ************* check max n_pos for any player and use for loop !!!!!!!!!!!!!!
            # for now hoping that there are only max 3 positions
            pos1 = 'N/A'
            pos2 = 'N/A'
            pos3 = 'N/A'
            if (n_pos == 1):
                pos1 = player_columns[1].find_all('a')[a_len - n_pos].text
            elif (n_pos == 2):
                pos1 = player_columns[1].find_all('a')[a_len - n_pos].text
                pos2 = player_columns[1].find_all('a')[a_len - n_pos + 1].text
            elif (n_pos == 3):
                pos1 = player_columns[1].find_all('a')[a_len - n_pos].text
                pos2 = player_columns[1].find_all('a')[a_len - n_pos + 1].text
                pos3 = player_columns[1].find_all('a')[a_len - n_pos + 2].text
            age = int(player_columns[2].text)
            overall_rating = int(player_columns[3].text)
            potential = int(player_columns[4].text)
            team = player_columns[5].find('a').text
            contract = player_columns[5].find('div',{'class': 'subtitle text-ellipsis rtl'}).text
            value = player_columns[6].text[1:len(player_columns[6].text)-1]
            wage = player_columns[7].text[1:len(player_columns[7].text)-1]
            special_total = int(player_columns[8].text[1:len(player_columns[8].text)-1])
            row_data = {'date': dates[k],
                        'name1': name1, 
                        'name2': name2, 
                        'country': country, 
                        'age': age, 
                        'pos1': pos1, 
                        'pos2': pos2, 
                        'pos3': pos3, 
                        'overall_rating': overall_rating, 
                        'potential': potential, 
                        'value': value, 
                        'wage': wage, 
                        'special_total': special_total, 
                        'team': team, 
                        'contract': contract}
            sofifa_data = sofifa_data.append(row_data, ignore_index=True)    

In [None]:
sofifa_data.to_csv('sofifa_players_201314to201718_Jan2019.csv', sep = ',', encoding='utf-8') 