<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 8.2: Web Scraping

INSTRUCTIONS:

- Run the cells
- Observe and understand the results
- Answer the questions

# Web Scraping in Python (using BeautifulSoup)

## Scraping Rules
1. **Always** check a website’s **Terms and Conditions** before you scrape it. Be careful to read the statements about legal use of data. Usually, the retrieved data should not be used for commercial purposes.
2. **Do not** request data from the website too aggressively with a program (also known as spamming), as this may break the website. Make sure the program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
3. The layout of a website may change from time to time, so make sure to revisit the site and rewrite the code as needed

## Find a Page
Visit the [Fandom](http://fandom.wikia.com) website, find a wikia of your interest and pick a page to work with.

Open a web page with the browser and inspect it.

Hover the cursor on the text and follow the shaded box surrounding the main text.

From the result, check the main text inside a few levels of HTML tags.

![image.png](attachment:image.png)

In [1]:
## Import Libraries
import regex as re

from urllib.parse import unquote
import urllib3
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings('ignore')

### Define the content to retrieve (webpage's URL)

In [2]:
# specify the url
quote_page = 'https://bigbangtheory.fandom.com/wiki/Barry_Kripke'

### Retrieve the page
- Require Internet connection

In [3]:
# query the website and return the html to the variable ‘page’
http = urllib3.PoolManager()
r = http.request('GET', quote_page)  #theres other methods apart from 'GET' method
if r.status == 200:
    page = r.data
    print('Type of the variable \'page\':', page.__class__.__name__)
    print('Page Retrieved. Request Status: %d, Page Size: %d' % (r.status, len(page)))
else:
    print('Some problem occurred. Request Status: %s' % r.status)

Type of the variable 'page': bytes
Page Retrieved. Request Status: 200, Page Size: 517885


### Convert the stream of bytes into a BeautifulSoup representation

In [4]:
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print('Type of the variable \'soup\':', soup.__class__.__name__)

Type of the variable 'soup': BeautifulSoup


### Check the content
- The HTML source
- Includes all tags and scripts
- Can be long!

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Barry Kripke | The Big Bang Theory Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Barry_Kripke","wgTitle":"Barry Kripke","wgCurRevisionId":352395,"wgRevisionId":352395,"wgArticleId":2273,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Caltech Faculty","Scientists","Physicists","Experimental Physicists","Theoretical Physicists","Particle Physicists","Recurring Characters","Season 2","Season 3","Season 4","Season 5","Season 6","Season 7","Season 8","Season 9","The Big Bang Theory","Kripke","Single","Sheldon

In [5]:
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Barry Kripke | The Big Bang Theory Wiki | Fandom
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Barry_Kripke","wgTitle":"Barry Kripke","wgCurRevisionId":352395,"wgRevisionId":352395,"wgArticleId":2273,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Characters","Caltech Faculty","Scientists","Physicists","Experimental Physicists","Theoretical Physicists","Particle Physicists","Recurring Characters","Season 2","Season 3","Season 4","Season 5","Season 6","Season 7","Season 8","Season 9","The Big Bang Theory","Kripke","Single","Sheldon

### Check the HTML's Title

In [7]:
print('Title tag :%s:' % soup.title)
print('Title text:%s:' % soup.title.string) #without html title

Title tag :<title>Barry Kripke | The Big Bang Theory Wiki | Fandom</title>:
Title text:Barry Kripke | The Big Bang Theory Wiki | Fandom:


### Find the main content
- Check if it is possible to use only the relevant data

In [8]:
soup.find_all('a')

[<a class="fandom-sticky-header__logo" href="//bigbangtheory.fandom.com">
 <img alt="The Big Bang Theory Wiki" height="65" src="https://static.wikia.nocookie.net/bigbangtheory/images/e/e6/Site-logo.png/revision/latest?cb=20210531192123" width="250"/>
 </a>,
 <a class="fandom-sticky-header__sitename" href="//bigbangtheory.fandom.com">The Big Bang Theory Wiki</a>,
 <a data-tracking="custom-level-1" href="#">
 <svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
 </a>,
 <a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
 <svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
 </a>,
 <a data-tracking="explore-discuss" href="/f">
 <svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-comment-tiny"></use></svg> <span>Discuss</span>
 </a>,
 <a data-tracking="explore-all

In [9]:
soup.find_all('a')[0]  #find 'a' tag

<a class="fandom-sticky-header__logo" href="//bigbangtheory.fandom.com">
<img alt="The Big Bang Theory Wiki" height="65" src="https://static.wikia.nocookie.net/bigbangtheory/images/e/e6/Site-logo.png/revision/latest?cb=20210531192123" width="250"/>
</a>

In [9]:
#article_tag = 'article'   #no article tag anymore
#article = soup.find_all(article_tag)[0]
#print('Type of the variable \'article\':', article.__class__.__name__)

Type of the variable 'article': Tag


In [11]:
#article.text

'\n\n\n\n\n\n\n\n\n\n\n\n\n\nwatch\t\t\t\t\t\t01:51\n\nThe Loop (TV)\n\n \n\n\n\n\n\n\n\n\n\tDo you like this video?\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\nBarry Kripke\n\n\n\n\t\t\tAdult\n\t\t\t\n\t\t\n\t\t\tYoung Adult\n\t\t\t\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nGeneral Information\n\nName\nBarry Kripke\n\n\nBorn\nPossibly May 12\n\n\nGender\nMale\n\n\nReligion\nUnknown\n\n\nNationality\nAmerican\n\n\nOccupation\nPhysicist\n\n\nPortrayed By\nJohn Ross Bowie\n\n\n\nRelationships\n\nRelationships\nAmy Farrah Fowler (crush)Beverly Hofstadter (romantic interest)\n\n\nFamily\nUnknown\n\n\n\nEpisode Guide\n\nFirst episode\n"The Killer Robot Instability"\n\n\nLast episode\nThe Change Constant\n\n\nNumber of episodes\n25\n\n\n\nSeasons Guide\n\nSeasons\nS1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12\n\n\n\nBarry Kripke, Ph.D. is a Caltech plasma-physicist-turned-string-theorist and he is a colleague of Leonard and Sheldon. He has a case of rhotacism, where he pronounces "R" and "L" a

### Get some of the text
- Plain text without HTML tags

In [12]:
# show the first 500 characters after removing redundant newlines
#print(re.sub(r'\n\n+', '\n', article.text)[:500])


watch						01:51
The Loop (TV)
 
	Do you like this video?	
 
Barry Kripke
			Adult
			
		
			Young Adult
			
		
General Information
Name
Barry Kripke
Born
Possibly May 12
Gender
Male
Religion
Unknown
Nationality
American
Occupation
Physicist
Portrayed By
John Ross Bowie
Relationships
Relationships
Amy Farrah Fowler (crush)Beverly Hofstadter (romantic interest)
Family
Unknown
Episode Guide
First episode
"The Killer Robot Instability"
Last episode
The Change Constant
Number of episodes
25
Seasons


### Find the links in the text

In [11]:
for t in soup.find_all('a'):
    print(t)

<a class="fandom-sticky-header__logo" href="//bigbangtheory.fandom.com">
<img alt="The Big Bang Theory Wiki" height="65" src="https://static.wikia.nocookie.net/bigbangtheory/images/e/e6/Site-logo.png/revision/latest?cb=20210531192123" width="250"/>
</a>
<a class="fandom-sticky-header__sitename" href="//bigbangtheory.fandom.com">The Big Bang Theory Wiki</a>
<a data-tracking="custom-level-1" href="#">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-book-tiny"></use></svg> <span>Explore</span>
</a>
<a data-tracking="explore-main-page" href="https://bigbangtheory.fandom.com/wiki/Main_Page">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-home-tiny"></use></svg> <span>Main Page</span>
</a>
<a data-tracking="explore-discuss" href="/f">
<svg class="wds-icon-tiny wds-icon navigation-item-icon"><use xlink:href="#wds-icons-comment-tiny"></use></svg> <span>Discuss</span>
</a>
<a data-tracking="explore-all-pages" href="https

In [12]:
for t in soup.find_all('a'):
    print(t.get('href'))

//bigbangtheory.fandom.com
//bigbangtheory.fandom.com
#
https://bigbangtheory.fandom.com/wiki/Main_Page
/f
https://bigbangtheory.fandom.com/wiki/Special:AllPages
https://bigbangtheory.fandom.com/wiki/Special:Community
/wiki/Blog:Recent_posts
https://bigbangtheory.fandom.com/wiki/Category:Characters
https://bigbangtheory.fandom.com/wiki/Big_Bang_Theory
https://bigbangtheory.fandom.com/wiki/Category:Main_Characters
https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter
https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter
https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper
https://bigbangtheory.fandom.com/wiki/Amy_Farrah_Fowler
https://bigbangtheory.fandom.com/wiki/Howard_Wolowitz
https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz
https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali
https://bigbangtheory.fandom.com/wiki/Stuart_Bloom
https://bigbangtheory.fandom.com/wiki/Leslie_Winkle
https://bigbangtheory.fandom.com/wiki/Emily_Sweeney
https://bigbangtheory.fa

In [16]:
# identify the type of tag to retrieve
link_tag = 'a'

# create a list with the links from the `<a>` tag
tag_list = []
for t in soup.find_all(link_tag):
    tag_list.append(t.get('href'))

# List comprehension version:
# tag_list = [t.get('href') for t in article.find_all(link_tag)]

print('Size of \'tag_list\':', len(tag_list))
tag_list

Size of 'tag_list': 1049


['//bigbangtheory.fandom.com',
 '//bigbangtheory.fandom.com',
 '#',
 'https://bigbangtheory.fandom.com/wiki/Main_Page',
 '/f',
 'https://bigbangtheory.fandom.com/wiki/Special:AllPages',
 'https://bigbangtheory.fandom.com/wiki/Special:Community',
 '/wiki/Blog:Recent_posts',
 'https://bigbangtheory.fandom.com/wiki/Category:Characters',
 'https://bigbangtheory.fandom.com/wiki/Big_Bang_Theory',
 'https://bigbangtheory.fandom.com/wiki/Category:Main_Characters',
 'https://bigbangtheory.fandom.com/wiki/Leonard_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Penny_Hofstadter',
 'https://bigbangtheory.fandom.com/wiki/Sheldon_Cooper',
 'https://bigbangtheory.fandom.com/wiki/Amy_Farrah_Fowler',
 'https://bigbangtheory.fandom.com/wiki/Howard_Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Bernadette_Rostenkowski-Wolowitz',
 'https://bigbangtheory.fandom.com/wiki/Rajesh_Koothrappali',
 'https://bigbangtheory.fandom.com/wiki/Stuart_Bloom',
 'https://bigbangtheory.fandom.com/wiki/Leslie_Winkl

In [17]:
# keep only the links to the wiki itself
wiki_tag_list = []
for link in tag_list:
    if link is not None and link[:6] == '/wiki/':
        wiki_link = link[6:]
        wiki_tag_list.append(wiki_link)

# List comprehension:
# wiki_tag_list = [link[6:] for link in tag_list if link is not None and link[:6] == '/wiki/']

print('Size of \'wiki_tag_list\':', len(wiki_tag_list))
wiki_tag_list

Size of 'wiki_tag_list': 388


['Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Special:Search',
 'Special:Search',
 'Special:Search',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Barry_Kripke?action=edit',
 'Category:Characters',
 'Category:Caltech_Faculty',
 'Category:Scientists',
 'Category:Physicists',
 'Category:Experimental_Physicists',
 'Category:Theoretical_Physicists',
 'Category:Particle_Physicists',
 'Category:Recurring_Characters',
 'Category:Season_2',
 'Category:Season_3',
 'Category:Season_4',
 'Category:Season_5',
 'Category:Season_6',
 'Category:Season_7',
 'Category:Season_8',
 'Category:Season_9',
 'Category:The_Big_Bang_Theory',
 'Category:Kripke',
 'Category:Single',
 'Category:Sheldon%27s_Mortal_Enemies',
 'Category:Ph.D.',
 'Category:Season_3_Characters',
 'Category:Season_4_Characters',
 'Category:Season_5_Characters',
 'Category:Season_6_Characters',
 'Category:Season_8_Characters',
 'Category:Season_9_Char

In [19]:
# create a filter for undesired links
filter  = '(%s)' % '|'.join([
    'Season_',
    'Category:',
    'File:',
    'Help:',
    'Portal:',
    'action=',
    'Special:',
    'Talk:'
])
# remove the links that are found in the filter
filtered_tag_list = []
for t in wiki_tag_list:
    if not re.search(filter, t):
        filtered_tag_list.append(t)

# filtered_tag_list = [t for t in wiki_tag_list if not re.search(filter, t)]
print('Size of \'filtered_tag_list\':', len(filtered_tag_list))
filtered_tag_list

Size of 'filtered_tag_list': 264


['Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'Blog:Recent_posts',
 'John_Ross_Bowie',
 'Amy_Farrah_Fowler',
 'Beverly_Hofstadter',
 'The_Killer_Robot_Instability',
 'The_Change_Constant',
 'The_Relationship_Diremption',
 'Caltech',
 'String_theory',
 'Leonard_Hofstadter',
 'Sheldon_Cooper',
 'Leonard_Hofstadter',
 'Howard_Wolowitz',
 'Rajesh',
 'Amy_Farrah_Fowler',
 'Kripke_Krippler',
 'M.O.N.T.E.',
 'Caltech',
 'The_Killer_Robot_Instability',
 'Penny',
 'Howard_Wolowitz',
 'Penny',
 'The_Friendship_Algorithm',
 'The_Electric_Can_Opener_Fluctuation',
 'Sheldon%27s_office',
 'The_Cafeteria',
 'Leonard_Hofstadter',
 'Rajesh_Koothrappali',
 'Sheldon_Cooper',
 'President_Siebert',
 'The_Vengeance_Formulation',
 'Apartment_4A',
 'Zack_Johnson',
 'Stuart_Bloom',
 'LeVar_Burton',
 'Raj%27s_apartment',
 'The_Toast_Derivation',
 'Rajesh_Koothrappali',
 'Professor_Rothman',
 'Siri',
 'Th

In [20]:
# remove duplicates
unique_tag_list = list(set(filtered_tag_list))
print('Size of \'unique_tag_list\':', len(unique_tag_list))
unique_tag_list

Size of 'unique_tag_list': 197


['Kaley_Cuoco',
 'Howie_Mandel',
 'Summer_Glau',
 'Brian_Patrick_Wade',
 'Local_Sitemap',
 'Mark_Hamill',
 'Tara_Hernandez',
 'Kareem_Abdul-Jabbar',
 'The_Valentino_Submergence',
 'Eric_Gablehauser',
 'The_Geology_Elevation',
 'The_Champagne_Reflection',
 'Maria_Ferrari',
 'The_Social_Group',
 'Leonard_Hofstadter',
 'The_Friendship_Algorithm',
 'Halley_Wolowitz',
 'Matt_Bennett',
 'Bert_Kibbler',
 'Margo_Harshman',
 'Brian_Posehn',
 'Carrie_Fisher',
 'Wil_Wheaton',
 'Jim_Reynolds',
 'Professor_Proton',
 'The_Grant_Allocation_Derivation',
 'Lucy',
 'The_Allowance_Evaporation',
 'The_Vengeance_Formulation',
 'June_Squibb',
 'Bill_Prady',
 'Joshua_Malina',
 'Steve_Holland',
 'Raj%27s_apartment',
 'Dean_Norris',
 'Stephanie_Barnett',
 'Mary_T._Quigley',
 'The_Tesla_Recoil',
 'Neil_deGrasse_Tyson',
 'Steve_Wozniak',
 'Sheldon_Cooper',
 'Sheldon_and_Amy',
 'Christine_Baranski',
 'David_Saltzberg',
 'Katey_Sagal',
 'Kripke_Krippler',
 'Kurt',
 'George_Smoot',
 'Fun_with_Flags',
 'Alex_Jensen'

In [21]:
# convert escaped sequences
unquoted_tag_list = [unquote(t) for t in unique_tag_list]
print('Size of \'unquoted_tag_list\':', len(unquoted_tag_list))
unquoted_tag_list

Size of 'unquoted_tag_list': 197


['Kaley_Cuoco',
 'Howie_Mandel',
 'Summer_Glau',
 'Brian_Patrick_Wade',
 'Local_Sitemap',
 'Mark_Hamill',
 'Tara_Hernandez',
 'Kareem_Abdul-Jabbar',
 'The_Valentino_Submergence',
 'Eric_Gablehauser',
 'The_Geology_Elevation',
 'The_Champagne_Reflection',
 'Maria_Ferrari',
 'The_Social_Group',
 'Leonard_Hofstadter',
 'The_Friendship_Algorithm',
 'Halley_Wolowitz',
 'Matt_Bennett',
 'Bert_Kibbler',
 'Margo_Harshman',
 'Brian_Posehn',
 'Carrie_Fisher',
 'Wil_Wheaton',
 'Jim_Reynolds',
 'Professor_Proton',
 'The_Grant_Allocation_Derivation',
 'Lucy',
 'The_Allowance_Evaporation',
 'The_Vengeance_Formulation',
 'June_Squibb',
 'Bill_Prady',
 'Joshua_Malina',
 'Steve_Holland',
 "Raj's_apartment",
 'Dean_Norris',
 'Stephanie_Barnett',
 'Mary_T._Quigley',
 'The_Tesla_Recoil',
 'Neil_deGrasse_Tyson',
 'Steve_Wozniak',
 'Sheldon_Cooper',
 'Sheldon_and_Amy',
 'Christine_Baranski',
 'David_Saltzberg',
 'Katey_Sagal',
 'Kripke_Krippler',
 'Kurt',
 'George_Smoot',
 'Fun_with_Flags',
 'Alex_Jensen',


In [22]:
# convert underscore to space
spaced_tag_list = []
for tag in unquoted_tag_list:
    processed_tag = re.sub('_', ' ', tag)
    spaced_tag_list.append(processed_tag)

# spaced_tag_list = [re.sub('_', ' ', t) for t in unquoted_tag_list]
print('Size of \'tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'tag_list': 197


['Kaley Cuoco',
 'Howie Mandel',
 'Summer Glau',
 'Brian Patrick Wade',
 'Local Sitemap',
 'Mark Hamill',
 'Tara Hernandez',
 'Kareem Abdul-Jabbar',
 'The Valentino Submergence',
 'Eric Gablehauser',
 'The Geology Elevation',
 'The Champagne Reflection',
 'Maria Ferrari',
 'The Social Group',
 'Leonard Hofstadter',
 'The Friendship Algorithm',
 'Halley Wolowitz',
 'Matt Bennett',
 'Bert Kibbler',
 'Margo Harshman',
 'Brian Posehn',
 'Carrie Fisher',
 'Wil Wheaton',
 'Jim Reynolds',
 'Professor Proton',
 'The Grant Allocation Derivation',
 'Lucy',
 'The Allowance Evaporation',
 'The Vengeance Formulation',
 'June Squibb',
 'Bill Prady',
 'Joshua Malina',
 'Steve Holland',
 "Raj's apartment",
 'Dean Norris',
 'Stephanie Barnett',
 'Mary T. Quigley',
 'The Tesla Recoil',
 'Neil deGrasse Tyson',
 'Steve Wozniak',
 'Sheldon Cooper',
 'Sheldon and Amy',
 'Christine Baranski',
 'David Saltzberg',
 'Katey Sagal',
 'Kripke Krippler',
 'Kurt',
 'George Smoot',
 'Fun with Flags',
 'Alex Jensen',


In [23]:
# order the list
spaced_tag_list.sort()
print('Size of \'spaced_tag_list\':', len(spaced_tag_list))
spaced_tag_list

Size of 'spaced_tag_list': 197


['Aarti Mann',
 'Adam West',
 'Alessandra Torresani',
 'Alex Jensen',
 'Alfred Hofstadter',
 'Alice Amter',
 'Althea Davis',
 'Amy Farrah Fowler',
 'Anthony Del Broccolo',
 'Anthony Rich',
 'Anu',
 'Apartment 4A',
 'Barenaked Ladies',
 'Bernadette Rostenkowski-Wolowitz',
 'Bert Kibbler',
 'Beverly Hofstadter',
 'Bill Nye',
 'Bill Prady',
 'Blog:Recent posts',
 'Brent Spiner',
 'Brian George',
 'Brian Greene',
 'Brian Patrick Wade',
 'Brian Posehn',
 'Brian Thomas Smith',
 'Buzz Aldrin',
 'Caltech',
 'Carol Ann Susi',
 'Carrie Fisher',
 'Casey Sander',
 'Charlie Sheen',
 'Christine Baranski',
 'Chuck Lorre',
 'Chuck Lorre Productions',
 'Cinnamon',
 'Claire',
 'Colonel Richard Williams',
 'Courtney Henggeler',
 'Dan',
 'Dave Goetsch',
 'David Gibbs',
 'David Saltzberg',
 'Dean Norris',
 'Debbie Wolowitz',
 'Denise',
 'Dennis Kim',
 'Dimitri',
 'Dr. Pemberton',
 'Emily Sweeney',
 'Eric Gablehauser',
 'Eric Kaplan',
 'Fun with Flags',
 'George Cooper Jr.',
 'George Cooper Sr.',
 'George S

### Create a filter for unwanted types of articles

In [24]:
# remove the links that start with "The"
no_episodes_tag_list = []
for tag in spaced_tag_list:
    if not tag.startswith('The'):
        no_episodes_tag_list.append(tag)

#no_episodes_tag_list = [t for t in tag_list if not tag.startswith('The')]

print('Size of \'no_episodes_tag_list\':', len(no_episodes_tag_list))
no_episodes_tag_list

Size of 'no_episodes_tag_list': 166


['Aarti Mann',
 'Adam West',
 'Alessandra Torresani',
 'Alex Jensen',
 'Alfred Hofstadter',
 'Alice Amter',
 'Althea Davis',
 'Amy Farrah Fowler',
 'Anthony Del Broccolo',
 'Anthony Rich',
 'Anu',
 'Apartment 4A',
 'Barenaked Ladies',
 'Bernadette Rostenkowski-Wolowitz',
 'Bert Kibbler',
 'Beverly Hofstadter',
 'Bill Nye',
 'Bill Prady',
 'Blog:Recent posts',
 'Brent Spiner',
 'Brian George',
 'Brian Greene',
 'Brian Patrick Wade',
 'Brian Posehn',
 'Brian Thomas Smith',
 'Buzz Aldrin',
 'Caltech',
 'Carol Ann Susi',
 'Carrie Fisher',
 'Casey Sander',
 'Charlie Sheen',
 'Christine Baranski',
 'Chuck Lorre',
 'Chuck Lorre Productions',
 'Cinnamon',
 'Claire',
 'Colonel Richard Williams',
 'Courtney Henggeler',
 'Dan',
 'Dave Goetsch',
 'David Gibbs',
 'David Saltzberg',
 'Dean Norris',
 'Debbie Wolowitz',
 'Denise',
 'Dennis Kim',
 'Dimitri',
 'Dr. Pemberton',
 'Emily Sweeney',
 'Eric Gablehauser',
 'Eric Kaplan',
 'Fun with Flags',
 'George Cooper Jr.',
 'George Cooper Sr.',
 'George S

In [None]:

#requests_html library for javascript

In [None]:
https://www.sephora.com/robots.txt  #use robots.txt to give you an idea what is allowed to scrape or not

In [None]:
https://www.realestate.com.au/robots.txt

In [None]:
https://www.amazon.com.au/robots.txt



---



---



> > > > > > > > > © 2021 Institute of Data


---



---



