<a href="https://colab.research.google.com/github/quincynjoroge/100-Days-Of-ML-Code/blob/master/%5BSolution_Notebook%5D_AfterWork_Data_Science_Web_Scraping_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color='#2F4F4F'>To use this notebook on Colaboratory, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# <font color='#2F4F4F'>AfterWork Data Science: Web Scraping with Python</font>

## <font color='#2F4F4F'>Prerequisites</font>

In [None]:
# We first import the required libraries
# ---
#
import pandas as pd             # library for data manupation
import requests                 # library for fetching a web page 
from bs4 import BeautifulSoup   # library for extrating contents from a webpage 

## <font color='#2F4F4F'>Examples</font>

As we go through the following examples, we should keep in mind that web pages are different, however, the process for scraping data will largely be the same. This means that while scraping for data in other webpages, we can always use the given code with some modifications.

### Example 1: Performing Basic Web Scraping 

In [None]:
# Example 1
# ---
# Scrape data found within the <a class="tag"> HTML tags in the given quotes website.
# ---
# Website: http://quotes.toscrape.com/ 
# ---
# YOUR CODE GOES BELOW
#

**Before we begin, we need to first understand the following concepts:**

1. A website is made up of web pages. These web pages can be either HTML or XML Webpages.   
2. HTML webpages contain content made up of tags i.e. `<head>, <body>, <div>, <header>, <section>, <p>, <span>, <a>, <li>` etc.
2. XML webpages contain content made up of user defined tags such as `<root>, <name>, <address>, <sector>, <location>`, etc.
3. The desired text data is usually contained within tags in an HTML or a XML web page. For example the text "hello" can be contained in the `<p>` tag as shown: `<p>`hello`</p>`. In such a case the `<p>` tag or *paragraph tag* has a closing tag `</p>`.
4. HTML Tags can comprise of attributes such as class, id, href etc. A good example would a paragraph tag with a class attribute that has a value "home": `<p class="home">hello</p>.` These attributes help us specify the elements that we would want to work with. 



#### Step 1: Obtaining our Data

In [None]:
# We will first download our webpage from the server that contains 
# our web page through the use of the get() method from the requests library.
# Upon doing this, we also check the status_code of our download.
# - A status_code starting with 2 indicates success.
# - A status_code starting with 4 or 5 indicates an error.
# ---
#  
page = requests.get('http://quotes.toscrape.com/')
page

<Response [200]>

In [None]:
# Uncomment the following code to see an error in downloading an non-existent webpage.
# ---
# 
page2 = requests.get('http://quotes.toscrape.com/mypage')
page2

<Response [404]>

In [None]:
# Once we have successfully retrieved our page we can preview our document 
# by printing the first 1600 characters of the HTML document as shown below.
# ---
#
print(page.text[0:1600])

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

From the output above we can see some of the HTML `<a class="tag" href="">` tags that contain the data that we need. We can also locate the html tag that contains our data by using the inspect tool within our browser. 
This has been demostrated in this [short video](https://www.youtube.com/watch?v=CwiRPmXhcLY).

#### Step 2: Parsing

In [None]:
# Once we have successfully downloaded our html document, 
# we can parse it and extract our desired text from the <a class="tag"> tags
# as shown below
# ---
#

# We use BeautifulSoup, which is a popular Python library for web scraping to parse our HTML document. 
# By parsing in this case, BeautifulSoup parses the HTML (stored in page.text) 
# into a special object called soup that the Beautiful Soup library understands. 
# In laymans terms, Beautiful Soup is reading the HTML and making sense of its structure.
# If we were working with an XML document, we would use the XML parser 'xml'.
# ---
# 
# 
soup = BeautifulSoup(page.text, "html.parser")

In [None]:
# We can then print out the HTML content of the page formatted nicely, 
# using the prettify() method as shown below:
# ---
#
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

#### Step 3: Extracting Required Elements

In [None]:
# Let's now extract data found in a specific tag i.e. data found in <a> tags.
# The result is a list of instances of <a> tags found within our document.
# NB: In this case, we will get all <a> tags using the find_all() method
# which will get all the instances of the specified tag in the web page.
# ---
# 
results = soup.find_all('a')
results

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

In [None]:
# Determining the no. of tags/ instances of the <a> tags
# ---
#
len(results)

55

In [None]:
# To get the first text within our first tag we can perform our extraction as follows: 
# ---
# 

# Method 1
# ---
#
soup.find_all('a')[0].get_text()

# or
# soup.find_all('a')[0].text

'Quotes to Scrape'

In [None]:
# Method 2
# ---
# Uncomment the following code
# ---
#
# soup.find('a').get_text()

# or
# soup.find_all('a')[0].text

In [None]:
# Checking the last item
# ---
# 
results[-1]

<a href="https://scrapinghub.com">Scrapinghub</a>

In [None]:
# We can also specify the tag that we would like to retrive our text data from, 
# in this case getting the text from the tag with the following attributes:
# ---
# - class="tag"
# - href="/tag/change/page/1/"
# This tag would be:
# - <a class="tag" href="/tag/change/page/1/">change</a>
# ---
# 

# We perform the following 
# ---
#
results = soup.find_all('a', attrs={'class':'tag', 'href': '/tag/change/page/1/'})[0].get_text()
results

# Method 2
# ---
# Uncomment the following lines
# ---
#
# results = soup.find_all('a', {'class':'tag', 'href': '/tag/change/page/1/'})[0].get_text()
# results

'change'

In [None]:
# We can also get the values of the attributes by 
# 


In [None]:
# We then create empty lists that we will use to store
# content fetched from the <a> tags
# ---
#
link_content = []
link_url = []

# Getting all our <a> tags 
# ---
#
results = soup.find_all('a')

# We the loop through these tags
for result in results:
   
    # Getting our text from each tag
    text = result.get_text()

    # We concatenate our domain with href link that we scrape
    # in order to form a full link
    link = 'http://quotes.toscrape.com' + result.get('href')

    # Then appending the text to our link_content list
    link_content.append(text)

    # Then appending the text to our link_url list
    link_url.append(link)

In [None]:
# Previewing our link_content list by checking first 10 items
# ---
#
link_content[0:10]

['Quotes to Scrape',
 'Login',
 '(about)',
 'change',
 'deep-thoughts',
 'thinking',
 'world',
 '(about)',
 'abilities',
 'choices']

In [None]:
# Previewing our links_content list by checking first 10 items
# ---
#
link_url[0:10]

['http://quotes.toscrape.com/',
 'http://quotes.toscrape.com/login',
 'http://quotes.toscrape.com/author/Albert-Einstein',
 'http://quotes.toscrape.com/tag/change/page/1/',
 'http://quotes.toscrape.com/tag/deep-thoughts/page/1/',
 'http://quotes.toscrape.com/tag/thinking/page/1/',
 'http://quotes.toscrape.com/tag/world/page/1/',
 'http://quotes.toscrape.com/author/J-K-Rowling',
 'http://quotes.toscrape.com/tag/abilities/page/1/',
 'http://quotes.toscrape.com/tag/choices/page/1/']

#### Step 4: Saving our Data

In [None]:
# Finally, we save the scraped contents in a dataframe and preview our data as shown
# ---
#
df = pd.DataFrame({"link_content": link_content, "link_url": link_url})
df.head()

Unnamed: 0,link_content,link_url
0,Quotes to Scrape,http://quotes.toscrape.com/
1,Login,http://quotes.toscrape.com/login
2,(about),http://quotes.toscrape.com/author/Albert-Einstein
3,change,http://quotes.toscrape.com/tag/change/page/1/
4,deep-thoughts,http://quotes.toscrape.com/tag/deep-thoughts/p...


### Example 2: Scraping for Tables

In [None]:
# Example 2
# ---
# Get a list of African cities from the given URL and store the city and respective population.
# ---
# Website URL = https://en.wikipedia.org/wiki/List_of_cities_in_Africa_by_population
# ---
# YOUR CODE GOES BELOW
# 

#### Step 1: Obtaining our Data

In [None]:
# Fetching our data from wikipedia. A status of 200 mean success.
# 
# ---
#
page = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_Africa_by_population')
page 

<Response [200]>

#### Step 2: Parsing

In [None]:
# Parsing our data using BeautifulSoup
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

Using the browser inspect feature, we identify our source table to have the following tag: 

`<table class="sortable wikitable jquery-tablesorter">`

#### Step 3: Extracting Required Elements

In [None]:
# Using the browser inspect feature, we identify our source table to have the following tag: 
# ---
# <table class="sortable wikitable">
# ---
#
right_table = soup.find('table', {'class': 'sortable wikitable'}) 

In [None]:
# And then preview it, still the purpose of confirmation
#  
print(right_table)

<table class="sortable wikitable">
<tbody><tr>
<th style="width:4">Rank
</th>
<th style="width:60">City
</th>
<th style="width:60">Country
</th>
<th style="width:40">Population
</th>
<th style="width:6">Date of Estimate
</th></tr>
<tr>
<td>1
</td>
<td><b><a href="/wiki/Lagos" title="Lagos">Lagos</a></b>
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="1200" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/79/Flag_of_Nigeria.svg/23px-Flag_of_Nigeria.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/7/79/Flag_of_Nigeria.svg/35px-Flag_of_Nigeria.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/79/Flag_of_Nigeria.svg/46px-Flag_of_Nigeria.svg.png 2x" width="23"/> </span><a href="/wiki/Nigeria" title="Nigeria">Nigeria</a>
</td>
<td>21,320,000
</td>
<td>2019
</td></tr>
<tr>
<td>2
</td>
<td><i><b><a href="/wiki/Kinshasa" title="Kinshasa">Kinshasa</a></b></i>
</td>
<td><span cla

In [None]:
# Getting the table body rows
# ---
#
rows = right_table.find_all('tr')

In [None]:
# Getting the required text data from our table body rows
# ---
#
rank = []
cities = []
countries = []
population = [] 

for row in rows:
    cells = row.find_all('td') 

    # We check to make sure there's no empty cell
    if len(cells) > 1:
        rank.append(cells[0].text.strip())
        cities.append(cells[1].text.strip())
        countries.append(cells[2].text.strip())
        population.append(cells[3].text.strip()) 

In [None]:
# Previewing our lists
# ---
# 
print(rank[0:5])
print(cities[0:5])
print(countries[0:5])
print(population[0:5]) 

['1', '2', '3', '4', '5']
['Lagos', 'Kinshasa', 'Cairo', 'Giza', 'Johannesburg']
['Nigeria', 'Democratic Republic of the Congo', 'Egypt', 'Egypt', 'South Africa']
['21,320,000', '11,860,000', '9,500,000', '8,800,000', '5,640,000']


#### Step 4: Saving our Data

In [None]:
# Saving the scraped data to a dataframe
# ---
# 
countries_df = pd.DataFrame({"rank": rank, "city": cities, "country": countries, "population": population})
countries_df.sample(10)

Unnamed: 0,rank,city,country,population
64,65,Kaduna,Nigeria,760084
0,1,Lagos,Nigeria,21320000
55,56,Port Elizabeth,South Africa,876436
92,93,Ikorodu,Nigeria,535619
29,30,Lomé,Togo,1477658
88,89,Warri,Nigeria,557398
71,72,Pointe-Noire,Republic of the Congo,715334
26,27,Kampala,Uganda,1650800
28,29,Harare,Zimbabwe,1485231
85,86,Kolwezi,Democratic Republic of the Congo,572942


### Example 3: Scraping for Articles

In [None]:
# Example 3 
# ---
# Scrape the given article from the the DailyPost Nigeria
# ---
# Website URL = https://dailypost.ng/2020/09/29/danbatta-inaugurates-evaluation-committee-for-2020-research-proposals/
# ---
# YOUR CODE GOES BELOW
# 

#### Step 1: Obtaining our Data

In [None]:
page = requests.get('https://dailypost.ng/2020/09/29/danbatta-inaugurates-evaluation-committee-for-2020-research-proposals/')
page

<Response [200]>

#### Step 2: Parsing

In [None]:
soup = BeautifulSoup(page.text, "html.parser")

#### Step 3: Extracting Required Elements

In [None]:
# Getting our heading content
# ---
# This is our tag:
# <h1 class="mvp-post-title left entry-title" itemprop="headline">
# ---
#
article_heading = soup.find('h1', {'class': 'mvp-post-title left entry-title'}).get_text()
article_heading

'Danbatta inaugurates Evaluation Committee for 2020 research proposals'

In [None]:
# Getting our article content
# ---
# Target tags: 
# All <p> tags contained in <div id="mvp-content-main">
# ---
#
article = soup.find('div', {'id': 'mvp-content-main'})
article

<div class="left relative" id="mvp-content-main">
<div class="code-block code-block-center code-block-5">
<div align="center">
<amp-ad class="i-amphtml-layout-fixed i-amphtml-layout-size-defined" data-multi-size="300x250,320x100,320x50,300x100,300x50" data-multi-size-validation="false" data-slot="/14001636/DP_Leaderboard_1" height="250" i-amphtml-layout="fixed" id="i-amp-1" style="width:320px;height:250px;" type="doubleclick" width="320">
</amp-ad>
</div></div>
<div class="code-block code-block-center code-block-4">
<div align="center">
<!-- AMP AdSlot 1 for Ad Unit 'DP_Leaderboard_1' ### Size: 728x90,320x50,320x100 -->
<amp-ad class="i-amphtml-layout-fixed i-amphtml-layout-size-defined" data-slot="/14001636/DP_Leaderboard_1" height="90" i-amphtml-layout="fixed" id="amp-aslot-1" style="width:728px;height:90px;" type="doubleclick" width="728">
</amp-ad>
</div>
<!-- End --></div>
<p>The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garb

In [None]:
# Lets find all the p tags that contain the article text 
# ---
#
p_tags = article.find_all('p')

# We then strip all the surrounding whitespace.
# ---
#
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions.',
 'The Committee, chaired by Prof. Mu’azu Bashir, a Professor of Computer and Control Engineering and Head of Computer Engineering Department at Ahmadu Bello University, Zaria was inaugurated at the Commission’s Head Office in Abuja on Wednesday, September 24, 2020.',
 'Speaking during the inauguration, Danbatta said that the initiative speaks to the Commission’s commitment towards encouraging the development of indigenous innovative solutions that impact not only the Telecom industry/ICT sector positively but also the nation as a whole.',
 '“We want to continuously support research projects that can lead to the development of new products and services in the industry as the key enabler of the nation’s

#### Step 4: Saving our Data

In [None]:
# Combine list items into string.
article = ' '.join(p_tags_text)
article

'The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions. The Committee, chaired by Prof. Mu’azu Bashir, a Professor of Computer and Control Engineering and Head of Computer Engineering Department at Ahmadu Bello University, Zaria was inaugurated at the Commission’s Head Office in Abuja on Wednesday, September 24, 2020. Speaking during the inauguration, Danbatta said that the initiative speaks to the Commission’s commitment towards encouraging the development of indigenous innovative solutions that impact not only the Telecom industry/ICT sector positively but also the nation as a whole. “We want to continuously support research projects that can lead to the development of new products and services in the industry as the key enabler of the nation’s digital econ

## <font color='#2F4F4F'>Challenges</font>

### <font color="green">Challenge 1</font>

In [None]:
# Challenge 1
# ---
# Write a Python program to extract h2 tags content from the Y Combinator website.
# --- 
# Website URL = https://www.ycombinator.com/about/
# ---
# YOUR CODE GOES BELOW
# 

In [None]:
# Solution
# ---

# 1. Obtaining our Data
# ---
#
page = requests.get('https://www.ycombinator.com/about/') 

# 2. Parsing
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

# 3. Extracting Required Elements
# ---
#
results = soup.find_all('h2') 
h2_content = [] 
for result in results:
    text = result.get_text()
    h2_content.append(text)

# 4. Saving our Data
# ---
#
df = pd.DataFrame({"h2_content": h2_content})
df.head()

Unnamed: 0,h2_content
0,Y Combinator provides seed funding for startup...
1,We make small investments in return for small ...
2,"Twice a year, we invest a small amount of mone..."
3,We think founders are most productive when the...
4,Hacker News


### <font color="green">Challenge 2</font>

In [None]:
# Challenge 2
# ---
# Write a Python program that will get the first 10 products (not sponsored) on the following e-commerce website.
# Return product title and price (USD).
# ---
# Hint: Retrieve data from the attributes.
# --- 
# Website URL = https://www.jumia.co.ke/space-heaters-accessories/
# ---
# YOUR CODE GOES BELOW
# 

In [None]:
# Solution
# ---
# 

# 1. Obtaining our Data
# ---
#
page = requests.get('https://www.jumia.co.ke/space-heaters-accessories/') 

# 2. Parsing
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

# 3. Extracting Required Elements
# ---
# 
results = soup.find_all('a', {'class': 'core'}) 

# We don't need the last 8 elements
results = results[:-8]

product_title = []
product_price = []

# We the loop through these tags
for result in results:
    title = result.get('data-name') 
    price = result.get('data-price')
 
    product_title.append(title) 
    product_price.append(price) 

# 4. Saving our Data
# ---
#
product_df = pd.DataFrame({"product_title": product_title, "product_price": product_price})
product_df.sample(10)

Unnamed: 0,product_title,product_price
22,Quartz Portable Electric Room Heater,17.9
15,Halogen Room Heater With Two Heating Settings ...,28.92
30,Oil Readiator 1000W With 5 Fins,66.46
29,Room Heater,17.2
2,Halogen Room Quartz Heater,27.36
31,5 Fins Oil Filled Room Heater,70.37
9,Halogen Portable Electric Room Heater,17.36
16,Portable Hot/ Warm/ Electric Room Heater Warme...,23.45
1,Quartz Portable Electric Room Heater,14.26
32,Room Fan Heater- White,19.51


### <font color="green">Challenge 3</font>

In [None]:
# Challenge 3
# ---
# Scrape for quotes and author from the given URL then store your data in a pandas dataframe.
# --- 
# Website URL = http://quotes.toscrape.com/
# ---
# YOUR CODE GOES BELOW
# 

In [None]:
# Solution
# ---
# 

# 1. Obtaining our Data
# ---
#
page = requests.get('http://quotes.toscrape.com/') 

# 2. Parsing
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

# 3. Extracting Required Elements
# ---
# 
quote_results   = soup.find_all('span', {'itemprop': 'text'}) 
quotes = []
for result in quote_results:
    quote       = result.get_text()
    quotes.append(quote) 

authors_results = soup.find_all('small', {'itemprop': 'author'}) 
authors = []
for result in authors_results:
    author      = result.get_text() 
    authors.append(author) 

# 4. Saving our Data
# ---
#
quotes_df = pd.DataFrame({"quote": quotes, "author": authors})
quotes_df.head()

Unnamed: 0,quote,author
0,“The world as we have created it is a process ...,Albert Einstein
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling
2,“There are only two ways to live your life. On...,Albert Einstein
3,"“The person, be it gentleman or lady, who has ...",Jane Austen
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe


### <font color="green">Challenge 4</font>

In [None]:
# Challenge 4
# ---
# Write a Python program to download IMDB's Popular 100 movie data. 
# Return (movie name, release year and imdb rating).
# --- 
# Website URL = https://www.imdb.com/chart/moviemeter
# ---
# YOUR CODE GOES BELOW
# 

In [None]:
# Solution
# ---
# 

# 1. Obtaining our Data
# ---
#
page = requests.get('https://www.imdb.com/chart/moviemeter') 

# 2. Parsing
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

# 3. Extracting Required Elements
# ---
#

# Get the movies 
movies_results = soup.find_all('td', {'class': 'titleColumn'}) 
movies = []
for movie in movies_results:
    moviename = movie.find('a').get_text()
    movies.append(moviename) 

# Get the release years
release_years = []
for movie in movies_results:
    # we use .strip('()') to remove brackets from start and end of year
    year = movie.find('span').get_text().strip('()')  
    release_years.append(year)

# Get the ratings
ratings = soup.find_all('td', {'class': 'ratingColumn imdbRating'})
movie_ratings = []
for rating in ratings:

    # some of the <td> tags do not have nested <strong>...we check for this
    if (rating.find('strong')):
        year = rating.find('strong').get_text()
    else:
        year = ''
    movie_ratings.append(year)

# 4. Saving our Data
# ---
#
movies_df = pd.DataFrame({"movie": movies, "release_year": release_years, "rating": movie_ratings})
movies_df.sample(10)

Unnamed: 0,movie,release_year,rating
47,The Lie,2018,5.3
71,Watchmen,2009,7.6
13,After We Collided,2020,5.4
42,Beauty and the Beast,2017,7.1
23,The Paramedic,2020,5.6
90,Twilight,2008,5.2
61,Infidel,2019,6.7
4,Antebellum,2020,5.5
83,The Dark Knight Rises,2012,8.4
49,#Alive,2020,6.2


### <font color="green">Challenge 5</font>

In [None]:
# Challenge 5
# ---
# You are required to scrape the following article found in the given URL.
# --- 
# Website URL = https://nation.africa/kenya/life-and-style/dn2/why-we-re-parenting-under-social-media-limelight-2452890
# ---
# YOUR CODE GOES BELOW
# 

In [None]:
# Solution
# ---
# 

# 1. Obtaining our Data
# ---
#
page = requests.get('https://nation.africa/kenya/life-and-style/dn2/why-we-re-parenting-under-social-media-limelight-2452890') 

# 2. Parsing
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

# 3. Extracting Required Elements
# ---
# 

# Getting our article heading content
# --- 
#
article_heading = soup.find('h1', {'class': 'title-medium'}).get_text() 

article_heading

'Why we’re parenting under the social media limelight'

In [None]:
# Getting our paragraph content
# --- 
#
section = soup.find('div', {'class': 'article-content'}) 
section = section.find('div', {'class': 'grid-container-small'}) 
section = section.find('div', {'class': 'col-1-1'})
section = section.find_all('section')[1] # we select the 2nd section tag
section = section.find_all('p')

p_tags_text = [tag.get_text().strip() for tag in section]
p_tags_text

['If it takes a village to raise a child, then social media only makes the village bigger. Milly Chebby Mwangi, an influencer mum, understood this well when she shared a video of her child bathing in a basin shortly after major surgery, on her Instagram stories.',
 'That her daughter’s scar was healing was a feat she wished to share with her Instagram followers, who had shown her massive support during the trying time. In the same way, they were quick to protect the child’s privacy after the video went live.',
 '“People urged me not to post the baby’s naked image online, saying, it was not a good thing to do,” she tells the DN2 Parenting.',
 'While she would eventually take the video down after the criticism, the incident begged the question: to what extent should children’s private lives be shared online?',
 'From the first steps when a child is learning to walk to when it can throw tantrums, documenting and sharing the child’s life as it grows instils a feeling of solidarity in the p

In [None]:
# Combine list items into string
# ---
#
article = ' '.join(p_tags_text)
article

'If it takes a village to raise a child, then social media only makes the village bigger. Milly Chebby Mwangi, an influencer mum, understood this well when she shared a video of her child bathing in a basin shortly after major surgery, on her Instagram stories. That her daughter’s scar was healing was a feat she wished to share with her Instagram followers, who had shown her massive support during the trying time. In the same way, they were quick to protect the child’s privacy after the video went live. “People urged me not to post the baby’s naked image online, saying, it was not a good thing to do,” she tells the DN2 Parenting. While she would eventually take the video down after the criticism, the incident begged the question: to what extent should children’s private lives be shared online? From the first steps when a child is learning to walk to when it can throw tantrums, documenting and sharing the child’s life as it grows instils a feeling of solidarity in the parent. They no lo