# Data Collection for 2018 FIFA World Cup Russia™ Prediction Project
<hr>
The goal of this notebook is to show how I collected data from the FIFA World Cup website for the 2018 World Cup Russia - Group Stage Prediction Project. Data was collected using a web crawler (build in Python), due to lack of API. 

The order of this notebook is as follows:
1. Information about the data source
2. Step by step instruction on how the web crawler was built
3. Result of data collection


## Data source

I have collected groups statistics for all FIFA World Cups between 1994 and 2014. 

That gives us 6 FIFA World Cups:

 - 2014 FIFA World Cup Brazil™
 - 2010 FIFA World Cup South Africa™
 - 2006 FIFA World Cup Germany™
 - 2002 FIFA World Cup Korea/Japan™
 - 1998 FIFA World Cup France™
 - 1994 FIFA World Cup USA™

The tournaments between 1998 and 2014 included 32 teams, divided into eight groups (A to H) - four teams (countries) per group. 1994 FIFA World Cup USA™ included 24 teams, divided into six groups (A to F) - four teams (countries) per group.

For each World Cup edition, I collected the following groups' statistics:

 - Group - name of the group
 - Teams - name of the team (country)
 - Match played - how many matches did the team play
 - Match won - how many matches it won    
 - Draw - number of draws
 - Lost - number of loses
 - Goals for - total number of goals
 - Goals against - total number of lost goals
 - Goals difference - difference between goals for and against
 - Points - total points


This data is publicaly accessible on the FIFA website in the <a href="https://www.fifa.com/fifa-tournaments/statistics-and-records/worldcup/index.html">statistics and records section</a>. According to the robots.txt, data scraping from the statistics and records page is allowed [June 10, 2018].

I went through the process of building this web scraper in the [Let the robot do your work! Web scraping with Python!](https://medium.com/ub-women-data-scholars/let-the-robot-do-your-work-web-scraping-with-python-9c147fb7690f) article posted on Medium. 

## Web crawler

This part of the notebook will explan how the web crawler was built.

Program was built in Python 3.x
 
The following libraries were used:
 - <a href="http://docs.python-requests.org/en/master/">Requests</a>
 - <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/#">BeautifulSoup</a>
 - <a href="https://pandas.pydata.org/">Pandas</a>

The structure of the code:
1. Libraries import
2. Obtaining data from the FIFA Tournaments page
 - GET request to the FIFA Tournaments page
 - saving the page content in Beautiful Soup format 
 - extraction of links to individual FIFA World Cup editions
 - edition of links in order to generate links to the pages containing groups statistics
3. Obtaining group statistics data from the individual World Cup pages
 - GET request to each page containing groups statistics
 - saving the page content in Beautiful Soup format
 - extraction of data from the HTML
 - saving data as panda's dataframe
4. Export of dataframe to Excel

### 1. Libraries import

- Request library, which allows to send HTTP requests
- BeautifulSoup library for pulling data out of HTML
- Pandas library to store data as dataframe

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 2. Obtaining data from the FIFA Tournaments page

I set url variable to FIFA Tournaments page and then use try/except block to check if page is accessible. I set timeout to 5 seconds, and store HTTP status code in page_response variable. 

If page is not accessible, I print page status to console. If timeout occurred, I print notification to console.

If page is accessible (200 is a standard code for successful HTTP requests) I save HTML to page_content variable using BeautifulSoup format. Then, I use BeautifulSoup lxml parser to find element, which contains links to all World Cup editions. I extract these links and edit them to generate new links, which direct to the groups statistics. 

As I am only interested in the last six World Cups, I keep the first 6 elements of the list.

In [2]:
url = "http://www.fifa.com/fifa-tournaments/statistics-and-records/worldcup/index.html"

try:
    page_response = requests.get(url, timeout=5)  
    
    if page_response.status_code == 200:
        page_content = BeautifulSoup(page_response.content,'lxml')
        tournaments = page_content.find(attrs={'id':'alleditions'})

        link_list = []
        for link in tournaments.find_all('a'):
            group_link = 'http://www.fifa.com' + link.get('href').replace('index.html','groups/index.html')
            link_list.append(group_link)
        link_list = link_list[0:6]

    else:
        print(page_response.status_code)
        
except requests.Timeout as e:
        print('Timeout occurred for requested page: ' + url)
        print(str(e))


As a result of these operations, I receive a list containing 6 links to group statistics, one link for each edition of the World Cup.

In [3]:
for link in link_list:
    print(link)

http://www.fifa.com/worldcup/archive/brazil2014/groups/index.html
http://www.fifa.com/worldcup/archive/southafrica2010/groups/index.html
http://www.fifa.com/worldcup/archive/germany2006/groups/index.html
http://www.fifa.com/worldcup/archive/koreajapan2002/groups/index.html
http://www.fifa.com/worldcup/archive/france1998/groups/index.html
http://www.fifa.com/worldcup/archive/usa1994/groups/index.html


### 3. Obtaining group statistics data from the individual World Cup pages


First, I create an empty list to store groups' statistics and set counter variable to 0. 

I use while loop to iterate through the link_list, and increment counter by 1 on each iteration. While loop will not be executed if link_list is empty, and will break when counter is equal to the list's lenght. 

Inside the loop, I set url variable to the link from the link_list, using counter as list index. Then I use try/except block to check if page is accessible. I set timeout to 5 seconds, and store HTTP status code in page_response variable. 

If page is not accessible, I print page status to console. If timeout occurred, I print notification to console.

If page is accessible (200 is a standard code for successful HTTP requests) I save HTML to page_content variable using BeautifulSoup format. Then, I use BeautifulSoup lxml parser to find HTML elements, which contain the data:

 - take the title of the webpage (e.g. 2014 FIFA World Cup Brazil™) and split it to extract World Cup year 
 - select HTML element that contains statistics for all groups and save it in the standings_table variable
 - select all HTML elements that contain group letters and use list comprehensions to extract their inner text;
 - repeat each group letter four times (on the FIFA website, for each group (A - H), there is only one HTML element containing the letter of the group; since there are four teams in each group, repeat the letter of the group four times)
 - select all HTML elements that contain relevant statistics (teams, match_played, match_won, draw, lost, goals_for, goals_against, goals_difference, points) and use list comprehensions to extract their inner text;

After collecting the data, I save it in the pandas dataframe format. I append dataframe to append_data list.

As a result of while loop, I receive list of dataframes. I use Pandas concat method to concatenate all dataframes into one dataframe. 

In [4]:
append_data = []
counter = 0

while counter < len(link_list) and len(link_list) > 0:
    
    url = link_list[counter]
    try:
        page_response = requests.get(url, timeout=5)

        if page_response.status_code == 200:
            page_content = BeautifulSoup(page_response.content,'lxml')

            page_title = page_content.title.text
            world_cup = page_title.split(' ')[0]

            standings_table = page_content.find('div', attrs={'id':'standings'})

            group_letters = [gl.get_text() for gl in standings_table.select('.group-wrap .caption-nolink')]
            group_letter = [gl for gl in group_letters for i in range(4)]

            teams = [tn.get_text() for tn in standings_table.select('.group-wrap .teamname-nolink span.t-nText')]
            match_played = [mp.get_text() for mp in standings_table.select('.group-wrap .tbl-matchplayed span.text')]
            match_won = [mw.get_text() for mw in standings_table.select('.group-wrap .tbl-win span.text')]
            draw = [d.get_text() for d in standings_table.select('.group-wrap .tbl-draw span.text')]
            lost = [l.get_text() for l in standings_table.select('.group-wrap .tbl-lost span.text')]
            goals_for = [gf.get_text() for gf in standings_table.select('.group-wrap .tbl-goalfor span.text')]
            goals_against = [ga.get_text() for ga in standings_table.select('.group-wrap .tbl-goalagainst span.text')]
            goals_difference = [gd.get_text() for gd in standings_table.select('.group-wrap .tbl-diffgoal span.text')]
            points = [p.get_text() for p in standings_table.select('.group-wrap .tbl-pts span.text')]

            group_tables = pd.DataFrame({
                "Fifa World Cup": world_cup,
                "Group": group_letter, 
                "Teams": teams, 
                "Match played": match_played, 
                "Match won": match_won,
                "Draw": draw,
                "Lost": lost,
                "Goals for": goals_for,
                "Goals against": goals_against,
                "Goals difference": goals_difference,
                "Points": points
            })

            append_data.append(group_tables)

        else:
            print(page_response.status_code)

    except requests.Timeout as e:
        print('Timeout occurred for requested page: ' + url)
        print(str(e))

    counter += 1
    
worldcup_data = pd.concat(append_data, axis=0, ignore_index=True)

### 4. Export of dataframe to Excel file

As the last step, I export dataframe to Excel.

In [5]:
worldcup_data.to_excel('WorldCupData.xlsx')

worldcup_data.head(10)

Unnamed: 0,Fifa World Cup,Group,Teams,Match played,Match won,Draw,Lost,Goals for,Goals against,Goals difference,Points
0,2014,Group A,Brazil,3,2,1,0,7,2,5,7
1,2014,Group A,Mexico,3,2,1,0,4,1,3,7
2,2014,Group A,Croatia,3,1,0,2,6,6,0,3
3,2014,Group A,Cameroon,3,0,0,3,1,9,-8,0
4,2014,Group B,Netherlands,3,3,0,0,10,3,7,9
5,2014,Group B,Chile,3,2,0,1,5,3,2,6
6,2014,Group B,Spain,3,1,0,2,4,7,-3,3
7,2014,Group B,Australia,3,0,0,3,3,9,-6,0
8,2014,Group C,Colombia,3,3,0,0,9,2,7,9
9,2014,Group C,Greece,3,1,1,1,2,4,-2,4
