# Simple Web Scraper

## Necessary Python Libraries:
- Requests
- Beautiful Soup
    > pip install requests<br />pip install bs4
    
First we need to import the libraries above.

In [25]:
from bs4 import BeautifulSoup
import requests

To scrape the data from a webpage we will need to do three steps, find our URL, use requests to get the information from the URL, and then use Beautiful Soup to take the url and shape it into content that we can actually use.

For the purpose of this notebook I'm just going to retrieve the fights from the latest UFC Pay Per View card.

In [26]:
URL = 'https://www.sherdog.com/events/UFC-246-McGregor-vs-Cerrone-82425'
response = requests.get(URL)
content = BeautifulSoup(response.content, "html.parser")

print('Completed')

# Uncomment below to see all the html code, left out to keep notebook concise.
# print(content)

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Completed


Next we will need to tell Beautiful Soup what HTML tags our information is contained in, in this case the fighter names are contained in:<br />
>\<span itemprop="name">
    
We represent this in beautiful soup as follows

In [27]:
names = content.find_all('span', attrs={"itemprop": "name"})

print(names)

[<span itemprop="name">UFC 246<br/>McGregor vs. Cerrone</span>, <span itemprop="name">Ultimate Fighting Championship (UFC)</span>, <span itemprop="name">Conor McGregor</span>, <span itemprop="name">Donald Cerrone</span>, <span itemprop="name">Holly Holm</span>, <span itemprop="name">Raquel Pennington</span>, <span itemprop="name">Alexey Oleynik</span>, <span itemprop="name">Maurice Greene</span>, <span itemprop="name">Brian Kelleher</span>, <span itemprop="name">Ode Osbourne</span>, <span itemprop="name">Diego Ferreira</span>, <span itemprop="name">Anthony Pettis</span>, <span itemprop="name">Roxanne Modafferi</span>, <span itemprop="name">Maycee Barber</span>, <span itemprop="name">Sodiq Yusuff</span>, <span itemprop="name">Andre Fili</span>, <span itemprop="name">Askar Askarov</span>, <span itemprop="name">Tim Elliott</span>, <span itemprop="name">Drew Dober</span>, <span itemprop="name">Nasrat Haqparast</span>, <span itemprop="name">Aleksa Camur</span>, <span itemprop="name">Justin 

Now we have the name of the fight card as well as the names of all of the fighters. To print out the names we can do so as follows:

In [28]:
for name in names:
    print(name.text)

UFC 246McGregor vs. Cerrone
Ultimate Fighting Championship (UFC)
Conor McGregor
Donald Cerrone
Holly Holm
Raquel Pennington
Alexey Oleynik
Maurice Greene
Brian Kelleher
Ode Osbourne
Diego Ferreira
Anthony Pettis
Roxanne Modafferi
Maycee Barber
Sodiq Yusuff
Andre Fili
Askar Askarov
Tim Elliott
Drew Dober
Nasrat Haqparast
Aleksa Camur
Justin Ledet
Sabina Mazo
J.J. Aldrich


However, this looks a bit ugly, and we don't need to see the first 2 containing UFC 246 and UFC.

In [29]:
#Only run this once as it continues to delete the first two entries in the list.
names = names[2:]

Now to add some more formatting:

In [34]:
for i in range(0, len(names)):
    print(names[i].text, end='')
    if i % 2 == 0:
        print(' Vs. ', end='')
    elif i % 2 == 1:
        print('\n', end='')

Conor McGregor Vs. Donald Cerrone
Holly Holm Vs. Raquel Pennington
Alexey Oleynik Vs. Maurice Greene
Brian Kelleher Vs. Ode Osbourne
Diego Ferreira Vs. Anthony Pettis
Roxanne Modafferi Vs. Maycee Barber
Sodiq Yusuff Vs. Andre Fili
Askar Askarov Vs. Tim Elliott
Drew Dober Vs. Nasrat Haqparast
Aleksa Camur Vs. Justin Ledet
Sabina Mazo Vs. J.J. Aldrich


# Notes / Big Picture
The big picture would be to scrape a lot more data than just this (E.g. winner of fight, strikes landed, takedowns, etc.) On top of this it would need to be done for several fight cards at once. However, the process remains the same. Retrieve a url, get the webpage, siphon data from the page, repeat.

## High level psuedocode

1. Retrieve a webpage that contains all fight cards (E.g. http://www.ufcstats.com/statistics/events/completed?page=all)
2. Get Links to each individual fight card from this page
3. For each link in links: retrieve the webpage containing all fights on the card
4. From the fight card page retrieve a fight_links to each individual fights webpage
5. For each fight_link in fight_links: Retrieve webpage of the fight
6. Scrape important data (content.find_all command will be used a lot)

## Data Storage
Something like this could end up being pretty large for a csv, currently think it should be done with a sql database. However, it could still be done with a csv, but might be tricky.