# Scraping the All Time RWC records

In this project I will be scraping the All Time RWC records from wikipedia. This is a complete list of total matches played by all countries that were invited or have qualified for the Rugby World Cup since the 1987 edition. The data itself came from the StatsHub of the 2015 RWC website. 

This all-time table compares national teams that have participated in the Rugby World Cup by a number of criteria including matches, wins, losses, draws, total points for, total points against, etc.

This project is part of a larger project of mine to collect, combine, clean, analyse, and visualize data relevant to #RWC2023.

First, I make sure that my scraper is working and I print the entire html file.

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

URL ='https://en.wikipedia.org/wiki/Rugby_World_Cup_all-time_table'
response = requests.get(URL)
soup = BeautifulSoup(response.text,'html.parser')

print(soup)

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Rugby World Cup all-time table - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"7b2d2ff9-1efe-4bb8-bda4-f548e8288d64","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Rugby_World_Cup_all-time_table","wgTitle":"Rugby World Cup all-time table","wgCurRevisionId":951885304,"wgRevisionId":951885304,"wgArticleId":5753175,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: discouraged parameter","Rugby World Cup","Rugby union records and statistics"],"wgPageContentLanguage":"en

Next, using the Mozilla inspector, I isolated and found where the data is stored, in a table of class "wikitable sortable". I isolate it using the soup.find() method. I essentially "sliced" the HTML to isolate the table.

In [47]:
table = soup.find('table', {'class':'wikitable sortable'}).tbody

print(table)

<tbody><tr>
<th width="150">Country
</th>
<th width="20">Pld
</th>
<th width="20">W
</th>
<th width="20">D
</th>
<th width="20">L
</th>
<th width="20">PF
</th>
<th width="20">PA
</th>
<th width="25">PD
</th>
<th width="20">%
</th>
<th width="20">TB
</th>
<th width="20">LB
</th>
<th width="20">Pts
</th>
<th width="20">Avg<br/>pts
</th>
<th width="80">Best finish
</th></tr>
<tr>
<td align="left"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="267" data-file-width="400" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/a/af/Flag_of_South_Africa.svg/23px-Flag_of_South_Africa.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/a/af/Flag_of_South_Africa.svg/35px-Flag_of_South_Africa.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/a/af/Flag_of_South_Africa.svg/45px-Flag_of_South_Africa.svg.png 2x" width="23"/> </span><a href="/wiki/South_Africa_national_rugby_union_team" title="South Africa national rugby union t

Now, I to put it simply, I see that all rows, start with 'tr', and all of my columns start with 'th'. I isolate these; the columns variable will also serve as an index later. I also take the opportunity to do a little cleaning and get rid of the '\n'.

In [40]:
rows = table.find_all('tr')
columns = [i.text.replace('\n', '') for i in rows[0].find_all('th')]

print(columns)

['Country', 'Pld', 'W', 'D', 'L', 'PF', 'PA', 'PD', '%', 'TB', 'LB', 'Pts', 'Avgpts', 'Best finish']


The index looks just as it should. From here, my goal is to populate a dataframe with the values in the "rows" variable, and using the "columns" index  and columns I just created.

The for loop:
- Goes through all of the rows and takes the values stored in 'td'.
- Populates the dataframe using those values and also replaces some of the ~rogue~ strings along the way. 

In [38]:
df = pd.DataFrame(columns=columns)

for i in range(1,len(rows)):
    tds = rows[i].find_all('td')
    values = [td.text.replace('\n','').replace('\xa0','').replace('—','0') for td in tds]
    
    df = df.append(pd.Series(values, index=columns), ignore_index=True)

Let's see if this worked.

In [44]:
df.head()

Unnamed: 0,Country,Pld,W,D,L,PF,PA,PD,%,TB,LB,Pts,Avgpts,Best finish
0,South Africa,43,36,0,7,1512,553,959,83.72,14,2,128,3.2,Winner
1,New Zealand,56,49,0,7,2552,753,1799,87.5,15,1,169,3.13,Winner
2,Australia,53,42,0,11,1794,763,1031,79.25,11,3,145,2.84,Winner
3,England,48,34,1,13,1511,685,826,71.88,10,1,127,2.65,Winner
4,France,52,36,2,14,1605,937,668,71.15,10,2,129,2.48,Runner-up


Looks just as it should!
Finally, I export the dataframe to a csv file. This csv file can be found in the repo.

In [36]:
df.to_csv(r'C:\Users\lacar\DQ Projects\Rugby DataVis' + '\\alltimeRWCrecords.csv', index=False)