## Web Scrapping with Pagination

You were asked to web scrape the url https://venturebeat.com. Applying what we learned so far, this should be straightforward. 

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [2]:
# 1. GET THE HTML CONTENT OF THE MAIN PAGE
def geturlhtml(main_url):
    
    # make HTTP request
    r = requests.get(main_url)
    html_content = r.text

    # if the request went through and we have some text, 
    # convert to beautiful object
    if html_content is not None:
        html_soup = BeautifulSoup(html_content, "html.parser")
    else:
        raise Exception('Error getting data from {}'.format(url))
        
    return html_soup

# 2. FROM THE HTML CONTENT OF THE MAIN PAGE GET THE ARTICLE LINKS

def featuredlinks(main_htmldoc):
    featured = main_htmldoc.find('div', class_='FeaturedArticles')
    return [ i['href'] for i in featured.find_all('a') ]

def getarticlelinks(html_doc):
    
    article_links = []
    links = html_doc.find_all('a', class_='ArticleListing__title-link')
    for i in links:
        article_links.append(i['href'])
#     print(len(links))
    return article_links


# 3. GET THE HTML CONTENT FOR EACH ARTICLE LINK AND GET THE ARTICLE TITLE AND TEXT
def gettextfromarticleurl(article_url):
    
    # new requests to individual article pages
    r = requests.get(article_url)
    html_content = r.text
    
    # convert to beautiful soup object
    if html_content is not None:
        html_soup = BeautifulSoup(html_content, "html.parser")
    else:
        raise Exception('Error getting data from {}'.format(url))
    
#     # grab the category
#     cat_class = 'Label Label--single Label--brand Label__link--brand'
#     article_category = html_soup.find(class_=cat_class).text.strip()
                                      
    # grab the title
    article_title = html_soup.find('h1', class_='article-title').text
    
    # grab the body
    articlecontent = html_soup.find(class_= 'article-content')
    article_text = []
    for i in articlecontent.find_all('p', recursive=False):
        article_text.append(i.text.strip())
    article_text = " ".join(article_text)
    
    return article_title, article_text 

In [3]:
url = 'https://venturebeat.com'

# 1. GET THE HTML CONTENT OF THE MAIN PAGE
html_soup = geturlhtml(url)

# 2. FROM THE HTML CONTENT OF THE MAIN PAGE GET THE ARTICLE LINKS
article_links = featuredlinks(html_soup) + getarticlelinks(html_soup)

# 3. GET THE HTML CONTENT FOR EACH ARTICLE LINK AND GET THE ARTICLE TITLE AND TEXT
data = [gettextfromarticleurl(i) for i in article_links]
article_titles = [i for (i,j) in data]
article_texts = [j for (i,j) in data]
for i in article_titles:
    print(i)

How Merck works with Seeqc to cut through quantum computing hype
Amazon releases DeepRacer software in open source
Campfire raises $8 million to advance AR/VR for product design
CloudCheckr survey says cloud computing adoption is accelerating
Atlassian’s Jira Work Management encourages team collaboration
Gartner says low-code, RPA, and AI driving growth in ‘hyperautomation’
Saas provider BoostUp.ai nabs $6M to support revenue operations
DevOps orchestration platform Opsera raises $15M
AI-powered construction project platform OpenSpace nabs $55M
Viso Trust assesses third-party cybersecurity risk with AI, raises $3M
MessageBird acquires email data platform SparkPost, closes $1B round
Secrets management and authentication platform Akeyless raises $14M
Accenture says IT investments are bearing fruit
Salesforce launches employee upskilling toolkit for businesses
Sysdig raises $189M to monitor containers and apps in the cloud
AMD bets on strong demand for chips as revenue soars 93%
Microsoft

In [4]:
print(len(article_titles))
print(len(article_texts))

43
43


### A. Pagination

Pagination is a technique in webdesigning that splits content into various pages, thus presenting large datasets in digestible manner for web users. There are many pagination methods:
- numbered pagination
- infinite scrolling
- next button
- load more buttons, etc. 

While pagination makes web browsing experience better, it certainly makes the task of web scrapping more difficult. 

Let's see an example now. The webpage we are looking to scrape is https://venturebeat.com. When you scroll to the bottom of the page, you will notice that at some point the url changes to https://venturebeat.com/page/2/ and this pattern continues. 

So, lets repeat what we did above for the url page 2 and see if we get new article links

In [8]:
# 1. GET THE LIST OF WEBPAGES TO SCRAPE
web_urls = ['https://venturebeat.com', 'https://venturebeat.com/page/2']

# 2. FOR EACH WEBPAGE GET THE ARTICLE LINKS
n_urls = len(web_urls)
all_urls = []

for i in range(0,n_urls):
    html_soup = geturlhtml(web_urls[i])
    if i == 0:
        all_urls.extend( featuredlinks(html_soup) + getarticlelinks(html_soup) )
    else:
        all_urls.extend( getarticlelinks(html_soup) )
print(f'There are {len(all_urls)} article urls to retrieve.')

# 3. FOR EACH ARTICLE LINK GET THE HTML CONTENT - TITLE AND TEXT
data = [gettextfromarticleurl(i) for i in all_urls]
article_titles = [i for (i,j) in data]
article_texts = [j for (i,j) in data]

print(f'Text retrieved for {len(article_texts)} articles')

There are 83 article urls to retrieve.
Text retrieved for 83 articles


Now, we can expand the webpages urls easily, simply collect a number of webpages using page numbers.

In [10]:
# 1. GET THE LIST OF WEBPAGES
web_urls = ['https://venturebeat.com']

page_no = range(2,15,1)
for i in page_no:
    web_urls.append('https://venturebeat.com/page/'+str(i))
print(web_urls)

['https://venturebeat.com', 'https://venturebeat.com/page/2', 'https://venturebeat.com/page/3', 'https://venturebeat.com/page/4', 'https://venturebeat.com/page/5', 'https://venturebeat.com/page/6', 'https://venturebeat.com/page/7', 'https://venturebeat.com/page/8', 'https://venturebeat.com/page/9', 'https://venturebeat.com/page/10', 'https://venturebeat.com/page/11', 'https://venturebeat.com/page/12', 'https://venturebeat.com/page/13', 'https://venturebeat.com/page/14']


Repeat above. 

In [11]:
# 2. FOR EACH WEBPAGE GET THE ARTICLE LINKS
n_urls = len(web_urls)
all_urls = []

for i in range(0,n_urls):
    html_soup = geturlhtml(web_urls[i])
    if i == 0:
        all_urls.extend( featuredlinks(html_soup) + getarticlelinks(html_soup) )
    else:
        all_urls.extend( getarticlelinks(html_soup) )
print(f'There are {len(all_urls)} article urls to retrieve.')

# 3. FOR EACH ARTICLE LINK GET THE HTML CONTENT - TITLE AND TEXT
data = [gettextfromarticleurl(i) for i in all_urls]
article_titles = [i for (i,j) in data]
article_texts = [j for (i,j) in data]

print(f'Text retrieved for {len(article_texts)} articles')

There are 563 article urls to retrieve.
Text retrieved for 563 articles


In [12]:
# 4. CONVERT DATA TO DICTIONARY

data_dictionary = {'url':[], 'title':[], 'text':[]}
tracker = 0
for i in article_links:
    title, text = gettextfromarticleurl(i)
    data_dictionary['url'].append(i)
    data_dictionary['title'].append(title)
    data_dictionary['text'].append(text)
    tracker += 1
    print('processed ', tracker, ' files')
    
print('no of webpages to scrape: ', N)
print('total no. of article links: ', len(article_links))
print('total no. of article titles and texts: ', (len(data_dictionary['title']), 
                                                  len(data_dictionary['text'])) ) 

processed  1  files
processed  2  files
processed  3  files
processed  4  files
processed  5  files
processed  6  files
processed  7  files
processed  8  files
processed  9  files
processed  10  files
processed  11  files
processed  12  files
processed  13  files
processed  14  files
processed  15  files
processed  16  files
processed  17  files
processed  18  files
processed  19  files
processed  20  files
processed  21  files
processed  22  files
processed  23  files
processed  24  files
processed  25  files
processed  26  files
processed  27  files
processed  28  files
processed  29  files
processed  30  files
processed  31  files
processed  32  files
processed  33  files
processed  34  files
processed  35  files
processed  36  files
processed  37  files
processed  38  files
processed  39  files
processed  40  files
processed  41  files
processed  42  files
processed  43  files
no of webpages to scrape:  14
total no. of article links:  43
total no. of article titles and texts:  (43,

In [13]:
# CONVERT DICTIONARY TO DATAFRAME
df = pd.DataFrame.from_dict(data_dictionary)
print(df.shape)
df.head()

(43, 3)


Unnamed: 0,url,title,text
0,https://venturebeat.com/2021/04/27/how-merck-w...,How Merck works with Seeqc to cut through quan...,When it comes to grappling with the future of ...
1,https://venturebeat.com/2021/04/27/amazon-make...,Amazon releases DeepRacer software in open source,"In November 2018, Amazon launched AWS DeepRace..."
2,https://venturebeat.com/2021/04/27/campfire-ra...,Campfire raises $8 million to advance AR/VR fo...,Campfire has raised $8 million in funding for ...
3,https://venturebeat.com/2021/04/28/cloudcheckr...,CloudCheckr survey says cloud computing adopti...,"Cloud transformation is moving quickly, accord..."
4,https://venturebeat.com/2021/04/28/atlassians-...,Atlassian’s Jira Work Management encourages te...,"At its Team21 conference today, Atlassian unve..."


In [55]:
df.iloc[0,0]

'https://venturebeat.com/2021/04/27/how-merck-works-with-seeqc-to-cut-through-quantum-computing-hype/'

In [56]:
df.iloc[0,1]

'How Merck works with Seeqc to cut through quantum computing hype'

In [57]:
df.iloc[0,2]

'When it comes to grappling with the future of quantum computing, enterprises are scrambling to figure just how seriously they should take this new computing architecture. Many executives are trapped between the anxiety of missing the next wave of innovation and the fear of being played for suckers by people overhyping quantum’s revolutionary potential. That’s why the approach to quantum by pharmaceutical giant Merck offers a clear-eyed roadmap for other enterprises to follow. The company is taking a cautious but informed approach that includes setting up an internal working group and partnering with quantum startup Seeqc to monitor developments while keeping an open mind. According to Philipp Harbach, a theoretical chemist who is head of Merck’s In Silico Research group, a big part of the challenge remains trying to keep expectations of executives reasonable even as startup funding to quantum soars and the hype continues to mount. “We are not evangelists of quantum computers,” Harbach

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   url     43 non-null     object
 1   title   43 non-null     object
 2   text    43 non-null     object
dtypes: object(3)
memory usage: 1.1+ KB


In [None]:
df.to_csv('venturebeat.csv', index=False)