# <center>SoFIFA URL Spider</center>
## <center>Using Scrapy and Jupyter Notebook to Download Player URL from SoFIFA.com</center>

For this project I used the scrapy tutorial from the docs (1) and the blog on JJ's world about using Scrapy in Jupyter notebook (2).
<p>(1) <a href=https://doc.scrapy.org/en/latest/intro/tutorial.html>https://doc.scrapy.org/en/latest/intro/tutorial.html</a></p>
<p>(2) <a href=https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html>https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html</a></p>

In [6]:
import scrapy 
from scrapy.crawler import CrawlerProcess
import pandas as pd

### Pipeline Setup

In [5]:
import json

class JsonURLWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('sofi_urls.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

## URL Spider Setup

In [3]:
import logging

class Sofi_URL_Spider(scrapy.Spider):
    
    name='url_sofi' 
    start_urls=['https://sofifa.com/players?r=200011&set=true']
    
    custom_settings = {
        'LOG_LEVEL':logging.WARNING,
        'ITEM_PIPELINES':{'__main__.JsonURLWriterPipeline': 1},
        #'FEED_FORMAT':'json',
        #'FEED_URI':'sofi_urls.json',   # Uncomment these if you want json rather than jl
    }
        
    def parse(self,response):   # There's going to be two steps, we need all the player page urls first, then will crawl all of those pages  
        
        for x in response.css('a.nowrap::attr(href)').getall():
            yield {'url': 'https://sofifa.com' + x}
        
        next_page = 'https://sofifa.com/players?r=200011&set=true'+response.css('.pagination a::attr(href)').getall()[-1]
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)


### Start the Crawler for the URLs

In [4]:
process = CrawlerProcess({'USER_AGENT':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36)'})
process.crawl(Sofi_URL_Spider)
process.start()

2019-11-19 22:39:51 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2019-11-19 22:39:51 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.7.4 (default, Aug 13 2019, 15:17:50) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Darwin-19.0.0-x86_64-i386-64bit
2019-11-19 22:39:51 [scrapy.crawler] INFO: Overridden settings: {'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36)'}


In [5]:
# This little script will remove old player links that aren't currently in the game
# For whatever reason, some old urls still get included in the scraping process

url_list=[]
with open('sofi_urls.jl') as urls:
    for line in urls:
        dic=dict(json.loads(line))
        if '200011' in dic['url']:
            url_list.append(dic)

with open('sofi_urls.jl','w') as urls:
    for x in url_list:
        json.dump(x, urls)
        urls.write('\n')

In [1]:
!mv sofi_urls.jl data