# <center>Parse Function Derivations</center>

Here we need to figure out how to extract all the data from the player table on sofifa.com. This notebook will document the process of figuring this out!

### Replicate the Scrapy shell in the Notebook

Here is a gem to help us use the equivalent of the scrapy shell in a Jupyter Notebook:

https://stackoverflow.com/questions/49908158/using-scrapy-in-jupyter-notebook-accessing-response-directly

In [1]:
import requests
from scrapy.http import TextResponse
import re

res=requests.get('https://sofifa.com/players')
response=TextResponse(res.url, body=res.text, encoding='utf-8')

## URL Spider Parser

### Find the URLs for the players on the page

In [2]:
urls = response.css('a.nowrap::attr(href)').getall()            #.re('href="(.*)" ')
urls = ['https://sofifa.com/players'+ x for x in urls]
urls[2:7]

['https://sofifa.com/players/player/222665/martin-odegaard/200011/',
 'https://sofifa.com/players/player/239956/myron-boadu/200011/',
 'https://sofifa.com/players/player/225193/mikel-merino-zazon/200011/',
 'https://sofifa.com/players/player/202652/raheem-sterling/200011/',
 'https://sofifa.com/players/player/229167/milot-rashica/200011/']

### Get the link to the next page 

In [3]:
# We need to get urls from all pages, so lets get the next link...

next_page='https://sofifa.com'+response.css('.pagination a::attr(href)').getall()[-1] 
next_page

'https://sofifa.com/players?offset=60'

In [49]:
# We need this statement to make sure that we are only using current players in the game
# We will create an if statement in the parse definition for the URL Spider

'Nov  2019' in response.css('.carousel-cell.is-initial-select.selected span::text').get()

True

## Stats Spider Parser
The hard part...

### Go to a player URL

In [4]:
res=requests.get('https://sofifa.com/player/242444/joao-felix-sequeira/200010/')
response=TextResponse(res.url, body=res.text, encoding='utf-8')

<img src='https://i.imgur.com/n1Y96El.png'/>

<img src='https://i.imgur.com/PPRGXk4.png'/>

<center>Here are all the different attributes that we need to scrape from the site. I will figure out what code I need to write using the inspect element tool in Chrome. Then I will store all the data in a dictionary.</center>

In [44]:
# We can redo this so that we can test all different kinds of players

res=requests.get('https://sofifa.com/player/121944/bastian-schweinsteiger/200011/')
response=TextResponse(res.url, body=res.text, encoding='utf-8')

### Player Header Info

In [6]:
sd={}  # Here is our player stats dictionary

In [7]:
# Name and SoFIFA ID
name,ID=response.css('.info h1::text').getall()[0].split('(')

name=name[:-1]
ID=ID[4:-2]

print(name)
print(ID)

sd['name']=name
sd['id']=ID

R. Finnie
227546


In [8]:
# Full Name, Country, Position(s)
sd['full_name'] = response.css('.bp3-text-overflow-ellipsis::text').get()[:-1]
sd['country'] = response.css('.bp3-text-overflow-ellipsis a::attr(title)').get()
sd['positions'] = response.css('.meta.bp3-text-overflow-ellipsis span').re('>(.*)<')

print(sd['full_name'])
print(sd['country'])
print(sd['positions'])

Ryan Finnie
Scotland
['RB']


In [9]:
# Age, DOB, Height, Weight
helper = response.css('.meta.bp3-text-overflow-ellipsis::text').getall()[-1].split()

sd['age']=int(helper[1])
dob=helper[2]+' '+helper[3]+' '+helper[4]
sd['dob']=dob[1:-1]

ht=helper[5]
ht=ht.split(ht[1])
ht[1]=ht[1].replace('"','')
sd['height']=int(ht[0])*12+int(ht[1])


sd['weight']=int(helper[6][:3])


print(sd['age'])
print(sd['dob'])
print(sd['height'])
print(sd['weight'])

19
Feb 19, 1995
72
170


In [10]:
# Overall Rating, Potential, Value, Wage

helper = response.css('.column.col-4.text-center span::text').getall()
for x in helper:
    if x[0]=='+':
        helper.remove(x)

sd['overall']=int(helper[0])
sd['potential']=int(helper[1])
sd['value'],sd['wage']=helper[2:4]

print(sd['overall'])
print(sd['potential'])
print(sd['value'])
print(sd['wage'])

54
67
€60K
€2K


### Preferred Foot Column

In [11]:
# Here are our variable names and some of the values
print(response.css('.column.col-6 ul.bp3-text-overflow-ellipsis.pl li label::text').getall())
print(response.css('.column.col-6 ul.bp3-text-overflow-ellipsis.pl li::text').getall())

['Preferred Foot', 'International Reputation', 'Weak Foot', 'Skill Moves', 'Work Rate', 'Body Type', 'Real Face']
['\n', 'Right', '\n', '1 ', '\n', '2 ', '\n', '2 ', '\n', '\n', '\n', '\n', '\n', '\n']


In [12]:
# Preferred Foot, International Reputation, Weak Foot, Skill Moves

sd['preferred_foot'],int_rep,weak_foot,skill_moves = response.css('.column.col-6 ul.bp3-text-overflow-ellipsis.pl li::text').getall()[1:8:2]

sd['int_rep']=int(int_rep)
sd['weak_foot']=int(weak_foot)
sd['skill_moves']=int(skill_moves)

print(sd['preferred_foot'])
print(sd['int_rep'])
print(sd['weak_foot'])
print(sd['skill_moves'])

Right
1
2
2


In [13]:
# Work Rate, Body Type, Real Face, Release Clause
helper=response.css('.column.col-6 ul.bp3-text-overflow-ellipsis.pl li span::text').getall()

sd['work_rate']=helper[0]
sd['body_type']=helper[1]
sd['real_face']=helper[2]
try:
    sd['release_clause'] = helper[3]
except:
    sd['release_clause']=None

print(sd['work_rate'])
print(sd['body_type'])
print(sd['real_face'])
print(sd['release_clause'])

Medium/ Medium
Lean
No
None


### Club/National Team Details

In [14]:
# Here are all the values we need

print(response.css('.bp3-text-overflow-ellipsis.pl.text-right li::text').getall())
print(response.css('.bp3-text-overflow-ellipsis.pl.text-right li span::text').getall())
print(response.css('.bp3-text-overflow-ellipsis.pl.text-right li h6').re('\">(.*)</a>'))

[' ', '19', 'Jan 29, 2015', '2015']
['62', 'SUB']
['Partick Thistle FC']


In [15]:
response.css('.bp3-text-overflow-ellipsis.pl.text-right li::text').getall()

[' ', '19', 'Jan 29, 2015', '2015']

In [16]:
# Club, Club Rating, Club Position, Jersey Number

try: 
    sd['club']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li h6').re('\">(.*)</a>')[0]
except:
    sd['club']=None

try:
    sd['club_rating']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li span::text').getall()[0]
    sd['club_position']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li span::text').getall()[1]
    sd['jersey_number']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li::text').getall()[1]
#sd['joined']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li::text').getall()[2]
#sd['contract_expir']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li::text').getall()[3]
except:
    sd['club_rating']=None
    sd['club_position']=None
    sd['jersey_number']=None


print(sd['club'])
print(sd['club_rating'])
print(sd['club_position'])
print(sd['jersey_number'])



Partick Thistle FC
62
SUB
19


In [17]:
# Country, country_rating, country_position, country_jersey

try: 
    sd['national_team']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li h6').re('\">(.*)</a>')[1]
except: 
    sd['national_team']=None
try:
    sd['nt_rating']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li span::text').getall()[2]
except:
    sd['nt_rating']=None    
try:    
    sd['nt_position']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li span::text').getall()[3]
except:
    sd['nt_position']=None    
try:    
    sd['nt_jersey']=response.css('.bp3-text-overflow-ellipsis.pl.text-right li::text').getall()[5]
except:
    sd['nt_jersey']=None
    
    
print(sd['national_team'])
print(sd['nt_rating'])
print(sd['nt_position'])
print(sd['nt_jersey'])

None
None
None
None


### All Attributes

In [18]:
print(response.css('ul li span::text').getall())

['Medium/ Medium', 'Lean', 'No', '62', 'SUB', '50', 'Crossing', '22', 'Finishing', '48', 'Heading Accuracy', '46', 'Short Passing', '28', 'Volleys', '57', 'Dribbling', '31', 'Curve', '29', 'FK Accuracy', '44', 'Long Passing', '44', 'Ball Control', '65', 'Acceleration', '65', 'Sprint Speed', '51', 'Agility', '49', 'Reactions', '59', 'Balance', '28', 'Shot Power', '61', 'Jumping', '60', 'Stamina', '59', 'Strength', '25', 'Long Shots', '56', 'Aggression', '55', 'Interceptions', '51', 'Positioning', '37', 'Vision', '34', 'Penalties', '48', 'Marking', '60', 'Standing Tackle', '59', 'Sliding Tackle', '12', '6', '11', '15', '11']


In [19]:
# We can get most of our attributes with this command
# We have already delt with the items in the list from 0-7th index position

#if sd['national_team']==None:
    #att_list = response.css('ul li span::text').getall()[6:]
#else:
    #att_list = response.css('ul li span::text').getall()[8:]
    
helper=response.css('ul li span::text').getall()
  
# We need to remove any notices that the attribute has increased

for x in helper:
    if x[0]=='+':
        helper.remove(x)
    elif x[0]=='-':
        helper.remove(x)
        
att_list=helper[(helper.index('Crossing')-1):]        

print(att_list)



['50', 'Crossing', '22', 'Finishing', '48', 'Heading Accuracy', '46', 'Short Passing', '28', 'Volleys', '57', 'Dribbling', '31', 'Curve', '29', 'FK Accuracy', '44', 'Long Passing', '44', 'Ball Control', '65', 'Acceleration', '65', 'Sprint Speed', '51', 'Agility', '49', 'Reactions', '59', 'Balance', '28', 'Shot Power', '61', 'Jumping', '60', 'Stamina', '59', 'Strength', '25', 'Long Shots', '56', 'Aggression', '55', 'Interceptions', '51', 'Positioning', '37', 'Vision', '34', 'Penalties', '48', 'Marking', '60', 'Standing Tackle', '59', 'Sliding Tackle', '12', '6', '11', '15', '11']


In [20]:
# We want to keep the format as attribute then name, over and over in the list
# We just have to adjust for the fact that some of the attribute names are not in a span
# Our indices should be the same regardless of which player we look at


fixed_list = att_list[:51]
fixed_list.append('Composure')
fixed_list.extend(att_list[51:58])
fixed_list.append('GK Diving')
fixed_list.append(att_list[58])
fixed_list.append('GK Handling')
fixed_list.append(att_list[59])
fixed_list.append('GK Kicking')
fixed_list.append(att_list[60])
fixed_list.append('GK Positioning')
fixed_list.append(att_list[61])
fixed_list.append('GK Reflexes')
fixed_list.extend(att_list[62:])


for i in range(int((len(fixed_list[:68]))/2)):
     sd[fixed_list[2*i+1].lower()]=int(fixed_list[2*i])


IndexError: list index out of range

In [None]:
# Now for the last bit we just want to add the player's traits as a list

sd['traits']=fixed_list[68:]
sd

That is everything we want! Now we have to copy all of this code into the parse function in the Scrapy Notebook, yield the "sd" dictionary and we should be good to go!