# Introduction:

Yearly, the **city of Chicago** host a their famous **marathon competition** with the participation of thousands of people. And 2021 is not an exception, there is a website **results.chicagomarathon.com/well-known/2021/** which contains all the records of 2021 competition. 

For this project, we are interested in **scrapping** the information of the **top 50 male runners** by utilizing **request** and **BeautifulSoup**. More detail, we will collect their:
> - name
> - age group
> - bib number
> - age
> - city/state
> - split times

# Scrapping process :

### General Information:

Importing **requests** library and use it to connect to a web page:

In [1]:
import requests

Send a **GET** request to a result website and download a **HTML** content:

In [2]:
records= requests.get('https://results.chicagomarathon.com/2021/?lang=EN_CAP&num_results=50&pid=list&pidp=start&search%5Bsex%5D=M&search%5Bage_class%5D=%25&event=MAR&favorite_add=LSMG96382434CB').text


Importing a library **BeautifulSoup**, create an instance, and **parse** our HTML document:

In [3]:
from bs4 import BeautifulSoup
soup= BeautifulSoup(records)

Here, we can now **print** out the HTML content of the page, **formatted nicely**, using the **prettify method** on the BeautifulSoup object:



In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Bank of America Chicago Marathon
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <link href="//results-static.mikatiming.com/2021/chicago/../../stages/blue/images/apple-touch-icon.png" rel="apple-touch-icon"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="198023663696085" property="fb:app_id"/>
  <meta content="https://results-static.mikatiming.com/2021/chicago/styles/responsive_2016/logo_fb.png" property="og:image"/>
  <meta content="630" property="og:image:width"/>
  <meta content="315" property="og:image:height"/>
  <meta content="website" property="og:type"/>
  <meta content="Mika timing" property="og:site_name"/>
  <meta content="https://results.chicagomarathon.com/2021/?event=MAR&amp;lang=EN_CAP&amp;num_results=50&amp;pid=list&amp;pidp=start&amp;search%5Bsex%5D=M&amp;search%5Bage_class%5D=

Now, we can notice that every runner is wrap with the tag **h4** and the class of **list-field type-fullname**, hence, we can obtain the array of all **50 runners** using **find_all** method:

In [5]:
runners=soup.find_all("h4",class_="list-field type-fullname")

**Checking** if we successfully have a correct numbers of runners: 

In [6]:
len(runners)

50

Taking a look at the **format** of the first runner in the list:

In [7]:
runners[0]

<h4 class="list-field type-fullname"><a href="?content=detail&amp;fpid=list&amp;pid=list&amp;idp=LSMG963824DD5C&amp;lang=EN_CAP&amp;event=MAR&amp;lang=EN_CAP&amp;num_results=50&amp;pidp=start&amp;search%5Bsex%5D=M&amp;search%5Bage_class%5D=%25&amp;search_event=MAR">Tura Abdiwak, Seifu (ETH)</a></h4>

As we have here, this website only provide us **general information** such as **name**. However, we have **website** which lead to a new page contains all the **details** of each runner.

### Detail information:

First, we will **only** scrap the information of the first runner as an example and **generalize** the process to obtain **all the runners** later

We want to get the **personal** link of the runner:

In [8]:
tail=runners[0].find('a').get('href')
tail

'?content=detail&fpid=list&pid=list&idp=LSMG963824DD5C&lang=EN_CAP&event=MAR&lang=EN_CAP&num_results=50&pidp=start&search%5Bsex%5D=M&search%5Bage_class%5D=%25&search_event=MAR'

Adding the **head** of the link to get the **complete** link

In [9]:
link= "https://results.chicagomarathon.com/2021/"+tail
link

'https://results.chicagomarathon.com/2021/?content=detail&fpid=list&pid=list&idp=LSMG963824DD5C&lang=EN_CAP&event=MAR&lang=EN_CAP&num_results=50&pidp=start&search%5Bsex%5D=M&search%5Bage_class%5D=%25&search_event=MAR'

Now, we can again send a **GET** request and download the HTML content of the web page:

In [10]:
l= requests.get(link).text

Parse the HTML content with the BeautifulSoup and print out **nicely** with `prettify` method:

In [11]:
soup_1= BeautifulSoup(l)

In [12]:
print(soup_1.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Bank of America Chicago Marathon
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <link href="//results-static.mikatiming.com/2021/chicago/../../stages/blue/images/apple-touch-icon.png" rel="apple-touch-icon"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="198023663696085" property="fb:app_id"/>
  <meta content="https://results-static.mikatiming.com/2021/chicago/styles/responsive_2016/logo_fb.png" property="og:image"/>
  <meta content="630" property="og:image:width"/>
  <meta content="315" property="og:image:height"/>
  <meta content="website" property="og:type"/>
  <meta content="Mika timing" property="og:site_name"/>
  <meta content="https://results.chicagomarathon.com/2021/?content=detail&amp;event=MAR&amp;idp=LSMG963824DD5C&amp;lang=EN_CAP&amp;num_results=50&amp;pid=list&amp;pidp=start&amp;se

Spotting from content, we notice that **each detail information** is wrapped with the tag **td** and **specify class name** such as **f-__fullname last** for name, **f-age_class last** for age group, and etc.

Therefore , we manually get them with the method `find` and `get_text` of BeautifulSoup:

In [13]:
soup_1.find("td",class_="f-__fullname last").get_text()

'Tura Abdiwak, Seifu (ETH)'

In [14]:
soup_1.find("td",class_="f-age_class last").get_text()

'20-24'

In [15]:
soup_1.find("td",class_="f-start_no_text last").get_text()

'3'

In [16]:
soup_1.find("td",class_="f-__city_state last").get_text()

'Iseo'

In [17]:
soup_1.find("td",class_="f-display_name_short last").get_text()

'ST'

In [18]:
times=soup_1.find_all("td",class_="time")
times

[<td class="time">00:14:43</td>,
 <td class="time">00:29:15</td>,
 <td class="time">00:44:21</td>,
 <td class="time">00:59:13</td>,
 <td class="time">01:02:29</td>,
 <td class="time">01:14:42</td>,
 <td class="time">01:30:06</td>,
 <td class="time">01:45:01</td>,
 <td class="time">01:59:44</td>,
 <td class="time">02:06:12</td>]

For the **splits** information, it is different. We have the **time** for **10** splits and each of them is still wrapped by tag **td**. Hence, we will **iterate** through each of them and get the text:

In [19]:
for i in times:
       print (i.get_text())

00:14:43
00:29:15
00:44:21
00:59:13
01:02:29
01:14:42
01:30:06
01:45:01
01:59:44
02:06:12


And, that is how we obtain all the information for **1 runner**:. We now generalize our process for **50 runners**:

A dataframe is **list of dictionaries**. Therefore, we will have a list `all_runners` and append to it **dictionary** of information of **each runner** for every **iteration**:

In [20]:
splits=["05K","10K","15K","20K","HALF","25K","30K","35K","40K","Finish"]

In addition, in order to limit the usage of web sever, we add a delay of 2 seconds with the library **sleep*:

In [21]:
from time import sleep

In [22]:
#initialize list of runners
all_runners=[]
#iterate through every person in the runners array
for person in runners:
    #initialize the dictionary which each key is the name of information and value is the information
    listing={}
    #obtain the personal link
    incomplete_link=person.find('a').get('href')
    #complete the link by adding the header https://results.chicagomarathon.com/2021/ 
    link="https://results.chicagomarathon.com/2021/"+incomplete_link
    #using request to send a GET to have the HTML content
    info=requests.get(link).text
    #parse with BeautifulSoup to start the scrapping process
    soup= BeautifulSoup(info)
    # get the list of times of 10 splits
    time_record=soup.find_all("td",class_="time")
    #get all the information for name, age group, bib number, city, short and add to the dictionary
    listing["Name (CTZ)"]=soup.find("td",class_="f-__fullname last").get_text()
    listing["Age Group"]=soup.find("td",class_="f-age_class last").get_text()
    listing["Bib Number"]=soup.find("td",class_="f-start_no_text last").get_text()
    listing["City, State"]=soup.find("td",class_="f-__city_state last").get_text()
    listing["Short"]=soup.find("td",class_="f-display_name_short last").get_text()
    #match the splits to the corresponding time and add them to the dictionary as well
    l=range(len(splits))
    for i in l:
        listing[splits[i]]=time_record[i].get_text()
    #add the dictionary to the list of runners
    all_runners.append(listing)
    #stop for 2 seconds before getting to the next runner
    sleep(2)

Eventually, thanks to **Pandas** library, we will view our **final** result better:

In [23]:
import pandas as pd

In [24]:
df=pd.DataFrame(all_runners)
df

Unnamed: 0,Name (CTZ),Age Group,Bib Number,"City, State",Short,05K,10K,15K,20K,HALF,25K,30K,35K,40K,Finish
0,"Tura Abdiwak, Seifu (ETH)",20-24,3,Iseo,ST,00:14:43,00:29:15,00:44:21,00:59:13,01:02:29,01:14:42,01:30:06,01:45:01,01:59:44,02:06:12
1,"Rupp, Galen (USA)",35-39,9,Portland,GR,00:14:43,00:29:25,00:44:23,00:59:24,01:02:40,01:14:44,01:30:07,01:45:02,01:59:53,02:06:35
2,"Kiptanui, Eric (KEN)",30-34,7,Iten,EK,00:14:43,00:29:17,00:44:21,00:59:13,01:02:29,01:14:42,01:30:06,01:45:01,02:00:05,02:06:51
3,"Suzuki, Kengo (JPN)",25-29,5,Chiba City Chiba,KS,00:14:44,00:29:16,00:44:22,00:59:15,01:02:30,01:14:44,01:30:07,01:45:30,02:01:48,02:08:50
4,"Tamru Aredo, Shifera (ETH)",20-24,6,Iseo,ST,00:14:36,00:29:15,00:44:06,00:59:10,01:02:29,01:14:43,01:30:08,01:45:59,02:02:16,02:09:39
5,"Mickow, Colin (USA)",30-34,15,Oswego,CM,00:15:28,00:31:01,00:46:35,01:02:40,01:06:11,01:18:41,01:34:24,01:50:12,02:06:23,02:13:31
6,"Montanez, Nico (USA)",25-29,22,Mammoth Lakes,NM,00:15:28,00:31:02,00:46:35,01:02:40,01:06:11,01:18:42,01:34:24,01:50:12,02:06:25,02:13:55
7,"Kipyego, Reuben Kiprop (KEN)",25-29,2,Kapsabet,RK,00:14:43,00:29:15,00:44:21,00:59:13,01:02:29,01:14:42,01:30:07,01:46:12,02:06:02,02:14:24
8,"Fischer, Reed (USA)",25-29,30,Boulder,RF,00:15:52,00:31:45,00:47:45,01:03:45,01:07:17,01:19:54,01:35:43,01:51:39,02:07:41,02:14:41
9,"Given, Wilkerson (USA)",30-34,16,Atlanta,WG,00:15:29,00:31:02,00:46:35,01:02:41,01:06:11,01:18:42,01:34:25,01:50:28,02:07:17,02:14:55


**AMAZING !!!!**. Finally, we achieve our objective of obtaining all the detail information of **50 runners** by using **requests** and **BeautifulSoup** library.