title: Fun Beautiful Soup
date: 2022-01-28 23:14
author: Alex



###  How I Needed To Laugh Locally
Be responsible when using some sources that qoutes content creators, otherwise bs4 library is the tool that can make your web scraping done easy.  
### Shrotly About a Problem
I once decided to extract some Steven Wright jokes from a web page here: <https://www.laughteronlineuniversity.com/steven-wright-quotes/>.
In general you do the following:
```python
from bs4 import BeautifulSoup

html = """
<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Age</th>
      <th>City</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>John Smith</td>
      <td>35</td>
      <td>New York</td>
    </tr>
    <tr>
      <td>Jane Doe</td>
      <td>27</td>
      <td>Los Angeles</td>
    </tr>
  </tbody>
</table>
"""
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.find('tbody')
# output:
'''
<tbody>
<tr>
<td>John Smith</td>
<td>35</td>
<td>New York</td>
</tr>
<tr>
<td>Jane Doe</td>
<td>27</td>
<td>Los Angeles</td>
</tr>
</tbody>
'''
```

To extract the jokes was pretty simple too since there was only one html element of `ol` type. From `ol` I pulled all qoutes using `find_all()` method. To get rid of all html tags and print a pure text content use the `.text` attribute:
### Code:

In [12]:
import requests
from bs4 import BeautifulSoup

url = "https://www.laughteronlineuniversity.com/steven-wright-quotes/"
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
# Find the ol element, there is only one in this case
obody = soup.find('ol')
li_qoutes = obody.find_all('li')
for li in li_qoutes:
    print(li.text)

7 percent of all statistics are made up on the spot.
A clear conscience is usually the sign of a bad memory.
A conclusion is the place where you got tired of thinking.
A conscience is what hurts when all your other parts feel so good.
A cop stopped me for speeding. He said, “Why were you going so fast?” I said, “See this thing my foot is on? It’s called an accelerator. When you push down on it, it sends more gas to the engine. The whole car just takes right off. And see this thing? This steers it.”
A friend of mine once sent me a post card with a picture of the entire planet Earth taken from space. On the back it said, “Wish you were here.”
A lot of people are afraid of heights. Not me, I’m afraid of widths.
All those who believe in psychokinesis raise my hand.
Ambition is a poor excuse for not having enough sense to be lazy.
Bills travel through the mail at twice the speed of checks.
Borrow money from pessimists-they don’t expect it back.
Change is inevitable….except from vending mach

### More Realistic Problemsolving
Another time I was trying to make some analysis of train arrival/departures delays.
I had a url that stores html pages daily.
See the code below. 

In [14]:
import requests
import re
import datetime
import io
from bs4 import BeautifulSoup


HEADERS = ['Sched', 'Estmd', 'From', 'To', 'TOC', 'Chgs', 'Train ID', 'date']
HEADERS = ','.join(HEADERS)
output = io.StringIO()
output.write(f'{HEADERS}\n')

start_date,end_date = datetime.date(2023, 1, 1), datetime.date.today()
delta_days = (end_date - start_date).days
daterange = [(start_date + datetime.timedelta(days)).strftime('%d/%m/%Y')
             for days in range(delta_days + 1)]

url = "http://timetablehistory.com/Station.aspx?StationID=1144&Date=01/01/2023"
response = requests.get(url)
html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
# Find the table element
tbody = soup.find('table') 

def responder(_url):
    response = requests.get(url)
    html_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')
    # Find the table element
    tbody = soup.find('table') 
    return tbody

def row_parser(_table):
    # Find all rows in the table
    rows = tbody.find_all('tr')
    #_headers = [i.get_text(strip=True) 
    #         for i in rows[0].find_all('th')]
    #_headers.append('date')
    _tabrows = [i.find_all('td') 
            for i in rows[1:]]

    row_profiles = [[item.get_text(strip=True) for item in row] 
                    for row in _tabrows]
    return row_profiles

foo = ['01/01/2023',
       '02/01/2023',
       '03/01/2023']

for i in foo:
    _url = f"http://timetablehistory.com/Station.aspx?StationID=1144&Date={i}"
    tbody = responder(_url)
    """
    print(_url)
    print(_url == url)
    response = requests.get(_url)
    html_content = response.content
    soup = BeautifulSoup(html_content, 'html.parser')
    # Find the table element
    tbody = soup.find('table')
    """
    rows = row_parser(tbody)
    
    _rows2 = [_+ [i] for _ in rows]
    subrows = ['\n'.join(_rows) for _rows in _rows2]
    output.write(f'{subrows}\n')
contents = output.getvalue()
# output.truncate()

In [17]:
contents

"Sched,Estmd,From,To,TOC,Chgs,Train ID,date\n['09:06\\nCANCELLED\\nOXF\\nPAD\\nGW\\n2\\nC95818\\nView Details\\n01/01/2023', '10:11\\nREMOVED\\nPAD\\nOXF\\nGW\\n1\\nC10298\\nView Details\\n01/01/2023', '10:37\\nCANCELLED\\nPAD\\nHFD\\nGW\\n3\\nC95866\\nView Details\\n01/01/2023', '10:38\\n10:39\\nGMV\\nPAD\\nGW\\n4\\nC95821\\nView Details\\n01/01/2023', '11:02\\nREMOVED\\nOXF\\nPAD\\nGW\\n1\\nC95822\\nView Details\\n01/01/2023', '11:11\\nREMOVED\\nPAD\\nOXF\\nGW\\n3\\nC10299\\nView Details\\n01/01/2023', '11:37\\nCANCELLED\\nPAD\\nGMV\\nGW\\n1\\nC95873\\nView Details\\n01/01/2023', '11:38\\n11:37\\nGMV\\nPAD\\nGW\\n3\\nC95823\\nView Details\\n01/01/2023', '12:02\\nREMOVED\\nOXF\\nPAD\\nGW\\n1\\nC95824\\nView Details\\n01/01/2023', '12:11\\nREMOVED\\nPAD\\nOXF\\nGW\\n1\\nC02269\\nView Details\\n01/01/2023', '12:38\\n12:37\\nGMV\\nPAD\\nGW\\n2\\nC95825\\nView Details\\n01/01/2023', '12:40\\nCANCELLED\\nPAD\\nHFD\\nGW\\n1\\nC95867\\nView Details\\n01/01/2023', '13:02\\nREMOVED\\nOXF\\nPAD