# Lecture 21의 데모

### Lecture 21의 데모에 대한 코드

참고자료: 

https://medium.com/geekculture/web-scraping-football-matches-from-the-world-cups-1930-to-2022-with-python-d2a1d578f034

In [None]:
#import os
#from google.colab import drive
#
#drive.mount('/content/gdrive')
#
#%cd /content/gdrive/MyDrive/ITEC419-fa22/lec

# **Part1: Scraping data from one World Cup**

We'll start by scraping data from one world cup - [Brazil 2014](https://en.wikipedia.org/wiki/2014_FIFA_World_Cup).

### **Installing the libraries**
* `bs4` to scrape website
* `lxml` to parse HTML
* `request` to send requests to the target website


In [None]:
# google colab already has installed the following libraries
# if you want to install them, please uncomment the following code

# !pip install bs4
# !pip install lxml
# !pip install requests

In [None]:
from datascience import *
import numpy as np

from bs4 import BeautifulSoup
import requests

### **Creating a soup**
To extract data with Beautiful Soup we need to create a soup.

In [None]:
web = 'https://en.wikipedia.org/wiki/2014_FIFA_World_Cup'
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')

### **Extracting all the matches from the World Cup**

Now we will get the football matches. To do so, we have to identify a pattern that allows us to scrape not only one but all the matches of the competition.

To easily find one pattern, first, we have to inspect the website by right-clicking and selecting “Inspect.” After this, developer tools will pop up.

Use `.find_all` method with our `soup` to extract all the matches.

### **Extracting the home/away teams and score data of every match**

```
<div class="footballbox">
    ...
    <th class="fhome">
        ...
        <a href="...">Brazil</a>
    <th class="fscore">
        ...
        <a href="...">3-1</a>
    <th class="faway">
        ...
        <a href="...">Croatia</a>

```



# **Part2: Scraping data from ALL the World Cups**

Let’s have a look a the links of the world cups 2014, 2018 and 2022

* https://en.wikipedia.org/wiki/2014_FIFA_World_Cup
* https://en.wikipedia.org/wiki/2018_FIFA_World_Cup
* https://en.wikipedia.org/wiki/2022_FIFA_World_Cup

The links are identical except for the year when a world cup took place.

We can re-write our `web` variable to consider this pattern:

### **Define `get_matches()`**


In [None]:
year = 2014
web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'

In [None]:
def get_matches(year):
    ...

In [None]:
years = [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974,
         1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014,
         2018]

In [None]:
# results: historical data
fifa = Table(['year', 'home', 'score', 'away'])
for year in years:
    fifa = fifa.append(get_matches(year))

fifa.show(5)

In [None]:
fifa.group('year').show()

**What data are missed?**

In [None]:
year = 1930
web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
print(web)
print('matches: ', fifa.where('year', year).num_rows)
fifa.where('year', year).show()

In [None]:
year = 1974
web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
print(web)
print('matches: ', fifa.where('year', year).num_rows)
fifa.where('year', year).show()

Use **the wikipedia of FIFA World Cup** to confirm the number of games and the years (https://en.wikipedia.org/wiki/FIFA_World_Cup) 

In [None]:
web = f'https://en.wikipedia.org/wiki/FIFA_World_Cup'
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')
matches = soup.find_all('table', class_='wikitable sortable jquery-tablesorter')
len(matches)

In [None]:
...

matches_per_year = ...
matches_per_year.show()

## **What is the extra match for 1938**

In [None]:
year = 1938
web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
print(web)
get_matches(year).show()

## **Extracting all missed matches from an old format**

In [None]:
year = 1974
web = f'https://en.wikipedia.org/wiki/{year}_FIFA_World_Cup'
print(web)
response = requests.get(web)
content = response.text
soup = BeautifulSoup(content, 'lxml')
matches = soup.find_all('table')
len(matches)

### **Define `get_missed_matches()`**


In [None]:
def get_missed_matches(year):
    ...

In [None]:
year = 1930
fifa = get_matches(year)
print(fifa.num_rows)
fifa_missed = get_missed_matches(year)
print(fifa_missed.num_rows) 

In [None]:
year = 1970
fifa = get_matches(year)
print(fifa.num_rows)
fifa_missed = get_missed_matches(year)
print(fifa_missed.num_rows) 

In [None]:
def web_scraping_WC():
    fifa = Table(['year', 'home', 'score', 'away'])
    for year in years:
        fifa = fifa.append(get_matches(year))
        fifa = fifa.append(get_missed_matches(year))
    return fifa

In [None]:
fifa = web_scraping_WC()
fifa.group('year').show()

In [None]:
matches_per_year.show(3)

In [None]:
matches_per_year = matches_per_year.with_column(
    'Matches by Web Scraping',
    fifa.group('year').column('count')
)

matches_per_year = matches_per_year.with_column(
    'Diff',
    matches_per_year.column('Matches')
    - fifa.group('year').column('count')
)
matches_per_year.show()

### **Cleaning Data**

In [None]:
# we find several unwanted pattern
fifa.take(0, 24, 37)

In [None]:
# how to solve the above issue


In [None]:
# another unwanted characters
fifa.column('home').item(0)

In [None]:
# how to solve it


In [None]:
def processing_score(score):
    res = score.replace(' ', '–').split('–')
    if len(res) == 1:
        return [-1, -1]
    else:
        return [int(res[0]), int(res[1])]

In [None]:
# clean team names and split score
fifa = ...

# remove matches with empty scores
fifa = ...

fifa.show(5)

year,home,score,away
1930,France,[4 1],Mexico
1930,Argentina,[1 0],France
1930,Chile,[3 0],Mexico
1930,Chile,[1 0],France
1930,Argentina,[6 3],Mexico


In [None]:
# add home/away/total goals columns


In [None]:
# remove the original score column


In [None]:
fifa2022 = get_matches(2022)
fifa2022.show()

In [None]:
# clean team names
fifa2022.append_column('home', fifa2022.apply(lambda x: x.strip(), 'home'))
fifa2022.append_column('away', fifa2022.apply(lambda x: x.strip(), 'away'))