# TSL Scraping 

In this notebook i will be scraping data from The Smart Local for activities to do in Singapore

In [1]:
import pandas as pd 
import re
from scrapingbee import ScrapingBeeClient
from bs4 import BeautifulSoup
import requests

In [2]:
url = 'https://thesmartlocal.com/read/things-to-do-singapore/'
res = requests.get(url)

In [3]:
res.status_code

200

In [4]:
res.text

'<!doctype html>\n<html class="no-js" lang="en-GB">\n<head>\n<meta charset="UTF-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta name="p:domain_verify" content="774ac8a9cb7d4649843c57c40f5f68b0" />\n<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />\n<link rel="preload" as="style" href="https://fonts.googleapis.com/css?family=Heebo:400,700&display=optional" />\n\n<link rel="alternate" href="https://thesmartlocal.com/" hreflang="x-default" />\n<link rel="alternate" href="https://thesmartlocal.com/" hreflang="en-sg" />\n<link rel="alternate" href="https://thesmartlocal.my/" hreflang="en-my" />\n<link rel="alternate" href="https://thesmartlocal.co.th/" hreflang="en-th" />\n<link rel="alternate" href="https://thesmartlocal.com/vietnam/" hreflang="en-vn" />\n<link rel="alternate" href="https://thesmartlocal.id/" hreflang="en-id" />\n<link rel="alternate" href="https://thesmartlocal.ph/

### Create a BeautifulSoup object

In [5]:
soup = BeautifulSoup(res.text, 'lxml') # parse html for python
soup

<!DOCTYPE html>
<html class="no-js" lang="en-GB">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="774ac8a9cb7d4649843c57c40f5f68b0" name="p:domain_verify"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link as="style" href="https://fonts.googleapis.com/css?family=Heebo:400,700&amp;display=optional" rel="preload"/>
<link href="https://thesmartlocal.com/" hreflang="x-default" rel="alternate"/>
<link href="https://thesmartlocal.com/" hreflang="en-sg" rel="alternate"/>
<link href="https://thesmartlocal.my/" hreflang="en-my" rel="alternate"/>
<link href="https://thesmartlocal.co.th/" hreflang="en-th" rel="alternate"/>
<link href="https://thesmartlocal.com/vietnam/" hreflang="en-vn" rel="alternate"/>
<link href="https://thesmartlocal.id/" hreflang="en-id" rel="alternate"/>
<link href="https://thesmartlocal.ph/" hreflang="en-ph" rel="alternate

Scraping all the activity names

In [6]:
listing = soup.find_all('h4')
listing

[<h4>1. Explore Mandai Wildlife Reserve</h4>,
 <h4><b>2. See the S.E.A. Aquarium</b></h4>,
 <h4>3. Visit the Museum of Ice Cream</h4>,
 <h4>4. Swim at Wild Wild Wet</h4>,
 <h4>5. Roller skate at Hi-Roller</h4>,
 <h4>6. Play arcade games at Timezone Westgate</h4>,
 <h4><b>7. Frolic around at Airzone (temporarily closed till Oct 2024)</b></h4>,
 <h4>8. Learn new skills at NLB libraries</h4>,
 <h4>9. Discover new playgrounds in Singapore</h4>,
 <h4>10. Chill out at Snow City</h4>,
 <h4>11. Tour the Jacob Ballas Children’s Garden</h4>,
 <h4>12. Try fishing at Qian Hu Fish Farm</h4>,
 <h4><b>13. Play with felines at The Cat Cafe</b></h4>,
 <h4><b>14. Support the Kitten Sanctuary</b></h4>,
 <h4>15. Relax in Singapore’s newest hotels</h4>,
 <h4>16. Bungy jump at Skypark Sentosa by AJ Hackett</h4>,
 <h4><b>17. Ride the Skyline Luge</b></h4>,
 <h4><b>18. Spend a day at NERF Action Xperience (Closed)</b></h4>,
 <h4><b>19. Retreat to Adventure HQ</b></h4>,
 <h4>20. Go-kart at The Karting Arena</h

In [7]:
ls1 =[]
for item in listing:
    ls1.append(item)
ls1

[<h4>1. Explore Mandai Wildlife Reserve</h4>,
 <h4><b>2. See the S.E.A. Aquarium</b></h4>,
 <h4>3. Visit the Museum of Ice Cream</h4>,
 <h4>4. Swim at Wild Wild Wet</h4>,
 <h4>5. Roller skate at Hi-Roller</h4>,
 <h4>6. Play arcade games at Timezone Westgate</h4>,
 <h4><b>7. Frolic around at Airzone (temporarily closed till Oct 2024)</b></h4>,
 <h4>8. Learn new skills at NLB libraries</h4>,
 <h4>9. Discover new playgrounds in Singapore</h4>,
 <h4>10. Chill out at Snow City</h4>,
 <h4>11. Tour the Jacob Ballas Children’s Garden</h4>,
 <h4>12. Try fishing at Qian Hu Fish Farm</h4>,
 <h4><b>13. Play with felines at The Cat Cafe</b></h4>,
 <h4><b>14. Support the Kitten Sanctuary</b></h4>,
 <h4>15. Relax in Singapore’s newest hotels</h4>,
 <h4>16. Bungy jump at Skypark Sentosa by AJ Hackett</h4>,
 <h4><b>17. Ride the Skyline Luge</b></h4>,
 <h4><b>18. Spend a day at NERF Action Xperience (Closed)</b></h4>,
 <h4><b>19. Retreat to Adventure HQ</b></h4>,
 <h4>20. Go-kart at The Karting Arena</h

Removing tags from name

In [8]:
def remove_html_tags(tag_obj):
    return tag_obj.get_text()
cleaned_text_list = [remove_html_tags(tag_obj) for tag_obj in ls1]

cleaned_text_list

['1. Explore Mandai Wildlife Reserve',
 '2. See the S.E.A. Aquarium',
 '3. Visit the Museum of Ice Cream',
 '4. Swim at Wild Wild Wet',
 '5. Roller skate at Hi-Roller',
 '6. Play arcade games at Timezone Westgate',
 '7. Frolic around at Airzone (temporarily closed till Oct 2024)',
 '8. Learn new skills at NLB libraries',
 '9. Discover new playgrounds in Singapore',
 '10. Chill out at Snow City',
 '11. Tour the Jacob Ballas Children’s Garden',
 '12. Try fishing at\xa0Qian Hu Fish Farm',
 '13. Play with felines at The Cat Cafe',
 '14. Support the Kitten Sanctuary',
 '15. Relax in Singapore’s newest hotels',
 '16. Bungy jump at Skypark Sentosa by AJ Hackett',
 '17. Ride the Skyline Luge',
 '18. Spend a day at NERF Action Xperience (Closed)',
 '19. Retreat to Adventure HQ',
 '20. Go-kart at The Karting Arena',
 '21.\xa0 Cable-ski at Singapore Wake Park',
 '22. Throw axes at Axe Factor',
 '23. Golf virtually at Five Iron Golf',
 '24. Skydive indoors at iFly Singapore',
 '25. Jump around in 

Removing heading which is not an activity

In [25]:
cleaned_text_list.pop(-1)

'Things to do in Singapore'

In [26]:
cleaned_text_list

['1. Explore Mandai Wildlife Reserve',
 '2. See the S.E.A. Aquarium',
 '3. Visit the Museum of Ice Cream',
 '4. Swim at Wild Wild Wet',
 '5. Roller skate at Hi-Roller',
 '6. Play arcade games at Timezone Westgate',
 '7. Frolic around at Airzone (temporarily closed till Oct 2024)',
 '8. Learn new skills at NLB libraries',
 '9. Discover new playgrounds in Singapore',
 '10. Chill out at Snow City',
 '11. Tour the Jacob Ballas Children’s Garden',
 '12. Try fishing at\xa0Qian Hu Fish Farm',
 '13. Play with felines at The Cat Cafe',
 '14. Support the Kitten Sanctuary',
 '15. Relax in Singapore’s newest hotels',
 '16. Bungy jump at Skypark Sentosa by AJ Hackett',
 '17. Ride the Skyline Luge',
 '18. Spend a day at NERF Action Xperience (Closed)',
 '19. Retreat to Adventure HQ',
 '20. Go-kart at The Karting Arena',
 '21.\xa0 Cable-ski at Singapore Wake Park',
 '22. Throw axes at Axe Factor',
 '23. Golf virtually at Five Iron Golf',
 '24. Skydive indoors at iFly Singapore',
 '25. Jump around in 

Scraping of details of each activity

In [None]:
h4_tags = soup.find_all('h4')
ls2 = []
for h4_tag in h4_tags:
    # Find all <p> tags before the <hr> tag
    prev_p_tags = h4_tag.find_all_previous('p')
    if prev_p_tags:
        closest_p_tag = prev_p_tags[0]  # Get the closest preceding <p> tag
        if closest_p_tag not in ls2:
            ls2.append(closest_p_tag)
ls2

In [10]:
len(ls2)

131

In [11]:
ls2

[<p>From seeing wildlife animals and birds at the zoo to roller skating and arcade games, there are plenty of fun activities in Singapore to do with the fam.</p>,
 <p><strong><span style="color: #d47978">Contact:</span> </strong>6269 3411<br/>
 <strong><span style="color: #d47978">Address:</span></strong> 80 Mandai Lake Road, Singapore 729826<br/>
 <strong><span style="color: #d47978">Opening hours:</span></strong><br/>
 <strong>Singapore Zoo:</strong> 8.30am-6pm, Daily (Last entry at 5pm)<br/>
 <strong>River Wonders:</strong> 10am-7pm, Daily (Last entry at 6pm)<br/>
 <strong>Night Safari:</strong> 7.15pm-12am, Daily (Last entry at 11.15pm)<br/>
 <strong>Bird Paradise:</strong> 9am-6pm, Daily (Last entry at 5pm)</p>,
 <p><span style="color: #d47978"><strong>Address:</strong> </span>8 Sentosa Gateway, Sentosa Island, Singapore 098269<br/>
 <strong><span style="color: #d47978">Opening hours:</span> </strong>10am-5pm, Daily</p>,
 <p><span style="color: #d47978"><b>Address: </b></span><spa

In [12]:

# Remove HTML tags from each bs4.element.Tag object in the list
cleaned_text_list1 = [remove_html_tags(tag_obj) for tag_obj in ls2]

cleaned_text_list1

['From seeing wildlife animals and birds at the zoo to roller skating and arcade games, there are plenty of fun activities in Singapore to do with the fam.',
 'Contact: 6269 3411\nAddress: 80 Mandai Lake Road, Singapore 729826\nOpening hours:\nSingapore Zoo: 8.30am-6pm, Daily (Last entry at 5pm)\nRiver Wonders: 10am-7pm, Daily (Last entry at 6pm)\nNight Safari: 7.15pm-12am, Daily (Last entry at 11.15pm)\nBird Paradise: 9am-6pm, Daily (Last entry at 5pm)',
 'Address: 8 Sentosa Gateway, Sentosa Island, Singapore 098269\nOpening hours: 10am-5pm, Daily',
 'Address: 100 Loewen Road, Singapore 248837\nOpening hours: Mon & Wed 10am-6pm | Thu-Sun 10am-9pm (Closed on Tuesdays)\n',
 'Address: 1 Pasir Ris Close, Downtown East, Singapore 519599\nOpening hours: Mon & Wed-Fri 12pm-6pm | Sat-Sun 11am-6pm (Closed on Tuesdays)\nContact: 6581 9128',
 'Address: 1 Pasir Ris Close, E!Hub, Market Square @ Downtown East, #05-103, Singapore 519599\nOpening hours: Mon-Thu 11am-6.30pm | Fri 12pm-8pm | Sat-Sun 1

In [13]:
len(cleaned_text_list1)

131

In [27]:
cleaned_text_list1.pop(0)

'From seeing wildlife animals and birds at the zoo to roller skating and arcade games, there are plenty of fun activities in Singapore to do with the fam.'

iterate through lists simultaneously and create dictionary for each activity, with activity name as the key and details as the value

In [28]:
activity_dicts = []

for name, details in zip(cleaned_text_list, cleaned_text_list1):
  
    detail_lines = details.split('\n')
    detail_dict = {}
    for line in detail_lines:
        if ': ' in line:
            key, value = line.split(': ', 1)
            detail_dict[key] = value

    activity_dict = {name: detail_dict}
    activity_dicts.append(activity_dict)

for activity_dict in activity_dicts:
    print(activity_dict)

{'1. Explore Mandai Wildlife Reserve': {'Contact': '6269 3411', 'Address': '80 Mandai Lake Road, Singapore 729826', 'Singapore Zoo': '8.30am-6pm, Daily (Last entry at 5pm)', 'River Wonders': '10am-7pm, Daily (Last entry at 6pm)', 'Night Safari': '7.15pm-12am, Daily (Last entry at 11.15pm)', 'Bird Paradise': '9am-6pm, Daily (Last entry at 5pm)'}}
{'2. See the S.E.A. Aquarium': {'Address': '8 Sentosa Gateway, Sentosa Island, Singapore 098269', 'Opening hours': '10am-5pm, Daily'}}
{'3. Visit the Museum of Ice Cream': {'Address': '100 Loewen Road, Singapore 248837', 'Opening hours': 'Mon & Wed 10am-6pm | Thu-Sun 10am-9pm (Closed on Tuesdays)'}}
{'4. Swim at Wild Wild Wet': {'Address': '1 Pasir Ris Close, Downtown East, Singapore 519599', 'Opening hours': 'Mon & Wed-Fri 12pm-6pm | Sat-Sun 11am-6pm (Closed on Tuesdays)', 'Contact': '6581 9128'}}
{'5. Roller skate at Hi-Roller': {'Address': '1 Pasir Ris Close, E!Hub, Market Square @ Downtown East, #05-103, Singapore 519599', 'Opening hours': 

In [29]:
activity_dicts

[{'1. Explore Mandai Wildlife Reserve': {'Contact': '6269 3411',
   'Address': '80 Mandai Lake Road, Singapore 729826',
   'Singapore Zoo': '8.30am-6pm, Daily (Last entry at 5pm)',
   'River Wonders': '10am-7pm, Daily (Last entry at 6pm)',
   'Night Safari': '7.15pm-12am, Daily (Last entry at 11.15pm)',
   'Bird Paradise': '9am-6pm, Daily (Last entry at 5pm)'}},
 {'2. See the S.E.A. Aquarium': {'Address': '8 Sentosa Gateway, Sentosa Island, Singapore 098269',
   'Opening hours': '10am-5pm, Daily'}},
 {'3. Visit the Museum of Ice Cream': {'Address': '100 Loewen Road, Singapore 248837',
   'Opening hours': 'Mon & Wed 10am-6pm | Thu-Sun 10am-9pm (Closed on Tuesdays)'}},
 {'4. Swim at Wild Wild Wet': {'Address': '1 Pasir Ris Close, Downtown East, Singapore 519599',
   'Opening hours': 'Mon & Wed-Fri 12pm-6pm | Sat-Sun 11am-6pm (Closed on Tuesdays)',
   'Contact': '6581 9128'}},
 {'5. Roller skate at Hi-Roller': {'Address': '1 Pasir Ris Close, E!Hub, Market Square @ Downtown East, #05-103, 

Remove this item as it is a duplicate in another scraped activity dataframe

In [30]:
activity_dicts.pop(0)

{'1. Explore Mandai Wildlife Reserve': {'Contact': '6269 3411',
  'Address': '80 Mandai Lake Road, Singapore 729826',
  'Singapore Zoo': '8.30am-6pm, Daily (Last entry at 5pm)',
  'River Wonders': '10am-7pm, Daily (Last entry at 6pm)',
  'Night Safari': '7.15pm-12am, Daily (Last entry at 11.15pm)',
  'Bird Paradise': '9am-6pm, Daily (Last entry at 5pm)'}}

Iterate through list of dictionaries, extracting activity name and details and appending to new dataframe

In [31]:
rows = []


for item in activity_dicts:
   
    activity, details = next(iter(item.items()))

    row = {'Activity': activity}
    row['Contact'] = details.get('Contact', 'not available')
    row['Opening hours'] = details.get('Opening hours', 'not available')
    row['Address'] = details.get('Address', 'not available')
    rows.append(row)

df = pd.DataFrame(rows)
df.set_index('Activity', inplace=True)

df.head()

Unnamed: 0_level_0,Contact,Opening hours,Address
Activity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2. See the S.E.A. Aquarium,not available,"10am-5pm, Daily","8 Sentosa Gateway, Sentosa Island, Singapore 0..."
3. Visit the Museum of Ice Cream,not available,Mon & Wed 10am-6pm | Thu-Sun 10am-9pm (Closed ...,"100 Loewen Road, Singapore 248837"
4. Swim at Wild Wild Wet,6581 9128,Mon & Wed-Fri 12pm-6pm | Sat-Sun 11am-6pm (Clo...,"1 Pasir Ris Close, Downtown East, Singapore 51..."
5. Roller skate at Hi-Roller,9694 4094,Mon-Thu 11am-6.30pm | Fri 12pm-8pm | Sat-Sun 1...,"1 Pasir Ris Close, E!Hub, Market Square @ Down..."
6. Play arcade games at Timezone Westgate,6265 1132,Mon-Thu 11am-10pm | Fri 11am-11pm | Sat 10am-1...,"3 Gateway Drive, Westgate #B1-45, Singapore 60..."


Exporting scraped dataset

In [32]:
df.to_csv('../datasets/activity.csv')

In [33]:
df.tail(10)

Unnamed: 0_level_0,Contact,Opening hours,Address
Activity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
121. Go shopping in Little India,not available,not available,not available
122. Read at Woods in the Books,6222 9980,Wed-Sat 10am-7pm | Sun-Mon 10am-6pm (Closed on...,"3 Yong Siak Street, Singapore 168642"
123. Go thrifting at Turf Club Road’s Antique Row,not available,not available,not available
124. Shop around at Cat Socrates,6348 0863,Mon 11am-6pm | Tue-Sat 11am-8pm | Sun & PH 11a...,"448 Joo Chiat Road, Singapore 427661"
125. Revisit People’s Park Centre,6535 9177,not available,"101 Upper Cross Street, Singapore 058357"
126. Discover T for Toys$1,9021 7376,Mon-Fri 12pm-6pm | Sat-Sun 12pm-7pm,"18 Tampines Industrial Crescent, Space @ Tampi..."
127. Browse through comic book stores,not available,not available,not available
128. Shop at Don Don Donki,not available,not available,not available
129. Stroll through Bugis Street,6338 9513,"10am-10pm, Daily",not available
"130. Wine, dine, and shop at Dempsey Hill",not available,not available,"Dempsey Road, Singapore 249679"
