# SecretSingapore Scraping

In this notebook I will be scraping data from SecretSingapore for activities to do in Singapore. 

In [1]:
import pandas as pd 
import re
from scrapingbee import ScrapingBeeClient
from bs4 import BeautifulSoup
import requests

In [2]:
url = 'https://secretsingapore.co/things-to-do-singapore/'
res = requests.get(url)

In [3]:
res.status_code

200

## Create a Beautiful Soup object

In [4]:
soup = BeautifulSoup(res.text, 'lxml') # parse html for python
soup

<!DOCTYPE html>
<html lang="en-GB">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, user-scalable=no, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0" name="viewport"/>
<meta content="ie=edge" http-equiv="X-UA-Compatible"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
<!-- This site is optimized with the Yoast SEO plugin v22.1 - https://yoast.com/wordpress/plugins/seo/ -->
<title>60+ Best Things To Do In Singapore This Year</title>
<meta content="We've rounded up the best attractions and entertainment experiences in town, providing you the best things to do in Singapore." name="description"/>
<link href="https://secretsingapore.co/things-to-do-singapore/" rel="canonical"/>
<meta content="en_GB" property="og:locale"/>
<meta content="article" property="og:type"/>
<meta content="60 Incredible Things To Do In Singapore At Least Once In Your Life" property="og:title"/>
<meta content="We've rounded up t

In [5]:
listing = soup.find('h2')
listing.text

'1. See the world‚Äôs largest public display of Southeast Asian art'

In [6]:
location = soup.find('em')
location.text

'Civic District, Singapore 178957'

Function to extact all the activity names and add them to a list

In [7]:
act_list = []
listings = soup.find_all('h2')
for listing in listings:
    act_list.append(listing)
act_list

[<h2>1. See the world‚Äôs largest public display of Southeast Asian art</h2>,
 <h2>2. Race at Singapore‚Äôs only night luge ride</h2>,
 <h2>3. Stroll around Gardens By The Bay</h2>,
 <h2>4. Indulge on famous Singapore Chilli Crab</h2>,
 <h2>5. Battle it out on Asia‚Äôs first gamified electric go-karts</h2>,
 <h2>6. Enjoy 360-degree city views</h2>,
 <h2>7. Zoom down Singapore‚Äôs longest indoor waterslide</h2>,
 <h2>8. See the underwater world</h2>,
 <h2>9. Play at Asia‚Äôs first ski, surf, skate &amp; ski resort</h2>,
 <h2>10. Be mesmerised by a Candlelight concert</h2>,
 <h2>11. Explore Singapore on one of the longest trails</h2>,
 <h2>12. Wander around Chinatown</h2>,
 <h2>13. Get lost in the ArtScience Museum</h2>,
 <h2>14. Be mesmerised by nocturnal creatures</h2>,
 <h2>15. Book an infinity pool experience</h2>,
 <h2>16. Be entertained by a first-rated colourful night show</h2>,
 <h2>17. Hop on Singapore Cable Car</h2>,
 <h2>18. Let the kids run free at Chaos Lab</h2>,
 <h2>19. Ch

Removing all unnecessary tags to leave us with just the activity name

In [8]:
def remove_html_tags(tag_obj):
    return tag_obj.get_text()
cleaned_act_list = [remove_html_tags(tag_obj) for tag_obj in act_list]

cleaned_act_list

['1. See the world‚Äôs largest public display of Southeast Asian art',
 '2. Race at Singapore‚Äôs only night luge ride',
 '3. Stroll around Gardens By The Bay',
 '4. Indulge on famous Singapore Chilli Crab',
 '5. Battle it out on Asia‚Äôs first gamified electric go-karts',
 '6. Enjoy 360-degree city views',
 '7. Zoom down Singapore‚Äôs longest indoor waterslide',
 '8. See the underwater world',
 '9. Play at Asia‚Äôs first ski, surf, skate & ski resort',
 '10. Be mesmerised by a Candlelight concert',
 '11. Explore Singapore on one of the longest trails',
 '12. Wander around Chinatown',
 '13. Get lost in the ArtScience Museum',
 '14. Be mesmerised by nocturnal creatures',
 '15. Book an infinity pool experience',
 '16. Be entertained by a first-rated colourful night show',
 '17. Hop on Singapore Cable Car',
 '18. Let the kids run free at Chaos Lab',
 '19. Check out the coolest exhibitions',
 '20. Try some escape rooms',
 '21. Explore Jewel Changi Airport',
 '22. Enjoy the city‚Äôs best ro

Removing entry 47 as it is a duplicate of another scraped activity dataset

In [9]:
removed_element = cleaned_act_list.pop(47)  
print(cleaned_act_list)  
print(removed_element)


['1. See the world‚Äôs largest public display of Southeast Asian art', '2. Race at Singapore‚Äôs only night luge ride', '3. Stroll around Gardens By The Bay', '4. Indulge on famous Singapore Chilli Crab', '5. Battle it out on Asia‚Äôs first gamified electric go-karts', '6. Enjoy 360-degree city views', '7. Zoom down Singapore‚Äôs longest indoor waterslide', '8. See the underwater world', '9. Play at Asia‚Äôs first ski, surf, skate & ski resort', '10. Be mesmerised by a Candlelight concert', '11. Explore Singapore on one of the longest trails', '12. Wander around Chinatown', '13. Get lost in the ArtScience Museum', '14. Be mesmerised by nocturnal creatures', '15. Book an infinity pool experience', '16. Be entertained by a first-rated colourful night show', '17. Hop on Singapore Cable Car', '18. Let the kids run free at Chaos Lab', '19. Check out the coolest exhibitions', '20. Try some escape rooms', '21. Explore Jewel Changi Airport', '22. Enjoy the city‚Äôs best rooftop dining', '23. P

Removing empty element ''

In [10]:
removed_element1 = cleaned_act_list.pop(-1) 
print(cleaned_act_list)  
print(removed_element1)

['1. See the world‚Äôs largest public display of Southeast Asian art', '2. Race at Singapore‚Äôs only night luge ride', '3. Stroll around Gardens By The Bay', '4. Indulge on famous Singapore Chilli Crab', '5. Battle it out on Asia‚Äôs first gamified electric go-karts', '6. Enjoy 360-degree city views', '7. Zoom down Singapore‚Äôs longest indoor waterslide', '8. See the underwater world', '9. Play at Asia‚Äôs first ski, surf, skate & ski resort', '10. Be mesmerised by a Candlelight concert', '11. Explore Singapore on one of the longest trails', '12. Wander around Chinatown', '13. Get lost in the ArtScience Museum', '14. Be mesmerised by nocturnal creatures', '15. Book an infinity pool experience', '16. Be entertained by a first-rated colourful night show', '17. Hop on Singapore Cable Car', '18. Let the kids run free at Chaos Lab', '19. Check out the coolest exhibitions', '20. Try some escape rooms', '21. Explore Jewel Changi Airport', '22. Enjoy the city‚Äôs best rooftop dining', '23. P

In [11]:
len(cleaned_act_list)

59

Scraping location data 

In [12]:
loc_list = []
locations = soup.find_all('em')
for item in locations:
    loc_list.append(item)
loc_list

[<em>Civic District, Singapore 178957</em>,
 <em>1 Imbiah Rd, Singapore 099692</em>,
 <em>18 Marina Gardens Dr, Singapore 018953</em>,
 <em> Various Locations¬†</em>,
 <em>54 Palawan Beach Walk, Singapore 098233</em>,
 <em>10 Bayfront Ave, Singapore 018956</em>,
 <em>HomeTeamNS Bedok Reservoir, 900 Bedok North Rd, Singapore</em>,
 <em>RW Sentosa, Singapore¬†</em>,
 <em>Orchard Road, Singapore</em>,
 <em>Various Locations</em>,
 <em>Various Locations</em>,
 <em> Chinatown, Singapore</em>,
 <em>FutureWorld</em>,
 <em> 6 Bayfront Ave, Singapore 018974</em>,
 <em>80 Mandai Lake Rd, Singapore 729826</em>,
 <em>Various Locations¬†</em>,
 <em>50 Beach View, Singapore 09860</em>,
 <em>Various Locations</em>,
 <em>Terminal 2, Changi Airport, Singapore¬†</em>,
 <em>Various Locations</em>,
 <em> Various Locations</em>,
 <em>78 Airport Blvd, Singapore 819666</em>,
 <em>Various Locations</em>,
 <em>36 Siloso Beach Walk, #01-01, Sentosa Island, Singapore</em>,
 <em>Various Locations</em>,
 <em>21 Ju

Removing all unecessary tags

In [13]:
def remove_html_tags(tag_obj):
    return tag_obj.get_text()
cleaned_loc_list = [remove_html_tags(tag_obj) for tag_obj in loc_list]

cleaned_loc_list

['Civic District, Singapore 178957',
 '1 Imbiah Rd, Singapore 099692',
 '18 Marina Gardens Dr, Singapore 018953',
 ' Various Locations\xa0',
 '54 Palawan Beach Walk, Singapore 098233',
 '10 Bayfront Ave, Singapore 018956',
 'HomeTeamNS Bedok Reservoir, 900 Bedok North Rd, Singapore',
 'RW Sentosa, Singapore\xa0',
 'Orchard Road, Singapore',
 'Various Locations',
 'Various Locations',
 ' Chinatown, Singapore',
 'FutureWorld',
 ' 6 Bayfront Ave, Singapore 018974',
 '80 Mandai Lake Rd, Singapore 729826',
 'Various Locations\xa0',
 '50 Beach View, Singapore 09860',
 'Various Locations',
 'Terminal 2, Changi Airport, Singapore\xa0',
 'Various Locations',
 ' Various Locations',
 '78 Airport Blvd, Singapore 819666',
 'Various Locations',
 '36 Siloso Beach Walk, #01-01, Sentosa Island, Singapore',
 'Various Locations',
 '21 Jurong Town Hall Rd, Singapore 609433',
 '8 Sentosa Gateway, Singapore 098269',
 '262 Pasir Panjang Rd, Singapore 118628',
 '13 Dempsey Road, #01-03/04, Singapore 249674',


In [14]:
len(cleaned_loc_list)

59

Joining the activity list and location list into a single list of dictionaries where the activity is the key and the location is the value

In [15]:
data = {
    'activity': cleaned_act_list,
    'location': cleaned_loc_list
}

df = pd.DataFrame(data)

df.head()

Unnamed: 0,activity,location
0,1. See the world‚Äôs largest public display of S...,"Civic District, Singapore 178957"
1,2. Race at Singapore‚Äôs only night luge ride,"1 Imbiah Rd, Singapore 099692"
2,3. Stroll around Gardens By The Bay,"18 Marina Gardens Dr, Singapore 018953"
3,4. Indulge on famous Singapore Chilli Crab,Various Locations
4,5. Battle it out on Asia‚Äôs first gamified elec...,"54 Palawan Beach Walk, Singapore 098233"


In [16]:
df['activity'] = df['activity'].str.replace(r'^\d+\.\s', '', regex=True)

df.head()

Unnamed: 0,activity,location
0,See the world‚Äôs largest public display of Sout...,"Civic District, Singapore 178957"
1,Race at Singapore‚Äôs only night luge ride,"1 Imbiah Rd, Singapore 099692"
2,Stroll around Gardens By The Bay,"18 Marina Gardens Dr, Singapore 018953"
3,Indulge on famous Singapore Chilli Crab,Various Locations
4,Battle it out on Asia‚Äôs first gamified electri...,"54 Palawan Beach Walk, Singapore 098233"


Exporting scraped data 

In [17]:
df.to_csv('../datasets/ss.csv')