## Web Scrapping with Pagination

You were asked to web scrape the url https://venturebeat.com. Applying what we learned so far, this should be straightforward. 

In [5]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

In [6]:
# 1. GET THE HTML CONTENT OF THE MAIN PAGE
def geturlhtml(main_url):
    
    # make HTTP request
    r = requests.get(main_url)
    html_content = r.text

    # if the request went through and we have some text, 
    # convert to beautiful object
    if html_content is not None:
        html_soup = BeautifulSoup(html_content, "html.parser")
    else:
        raise Exception('Error getting data from {}'.format(url))
        
    return html_soup

# 2. FROM THE HTML CONTENT OF THE MAIN PAGE GET THE ARTICLE LINKS

def getheadlink(main_htmldoc):
    return [ main_htmldoc.find('a', class_='Hero__title-link')['href'] ]

def getarticlelinks(html_doc):
    
    article_links = []
    links = html_doc.find_all('a', class_='ArticleListing__title-link')
    for i in links:
        article_links.append(i['href'])
#     print(len(links))
    return article_links


# 3. GET THE HTML CONTENT FOR EACH ARTICLE LINK AND GET THE ARTICLE TITLE AND TEXT
def gettextfromarticleurl(article_url):
    
    # new requests to individual article pages
    r = requests.get(article_url)
    html_content = r.text
    
    # convert to beautiful soup object
    if html_content is not None:
        html_soup = BeautifulSoup(html_content, "html.parser")
    else:
        raise Exception('Error getting data from {}'.format(url))
    
    # grab the category
    cat_class = 'Label Label--single Label--brand Label__link--brand'
    article_category = html_soup.find(class_=cat_class).text.strip()
                                      
    # grab the title
    article_title = html_soup.find('title').text.strip()
    article_title = re.sub(r" \|.*$", "", article_title)
    
    # grab the body
    articlecontent = html_soup.find(class_= 'article-content')
    article_text = []
    for i in articlecontent.find_all('p'):
        article_text.append(i.text.strip())
    article_text = " ".join(article_text)
    
    return article_category, article_title, article_text 

In [8]:
url = 'https://venturebeat.com'

# 1. GET THE HTML CONTENT OF THE MAIN PAGE
html_soup = geturlhtml(url)

# 2. FROM THE HTML CONTENT OF THE MAIN PAGE GET THE ARTICLE LINKS
article_links = getheadlink(html_soup) + getarticlelinks(html_soup)

# 3. GET THE HTML CONTENT FOR EACH ARTICLE LINK AND GET THE ARTICLE TITLE AND TEXT
data = [gettextfromarticleurl(i) for i in article_links]
article_titles = [i for (i,j,k) in data]
article_titles = [j for (i,j,k) in data]
article_texts = [k for (i,j,k) in data]
for i in article_titles:
    print(i)

Nvidia's DLSS 2.0 aims to prove the technology is essential
MIT CSAIL's VISTA autonomous vehicles simulator transfers skills learned to the real world
Google's big download for game developers: 116 billion downloads, new Play services, and Stadia updates
Half-Life: Alyx interview -- Reviving an iconic franchise with VR
Google open-sources framework that reduces AI training costs by up to 80%
FDA allows AliveCor's AI ECG to detect coronavirus drug-induced heart problems
Gears Tactics dev Splash Damage is making a Google Stadia exclusive
Half-Life: Alyx review -- A great VR game for the wrong time
Unity Technologies launches cloud-based game simulations for developer playtests
Uber's Enhanced POET creates and solves AI agent training challenges
Google unveils Android Performance Tuner, Android GPU Inspector, and Cloud Firestore for game developers
HyperX ChargePlay Clutch makes mobile gaming less stressful
New York Times acquires Audm, whose narrators turn long-form journalism into audio

In [9]:
print(len(article_titles))
print(len(article_texts))

41
41


### A. Pagination

Pagination is a technique in webdesigning that splits content into various pages, thus presenting large datasets in digestible manner for web users. There are many pagination methods:
- numbered pagination
- infinite scrolling
- next button
- load more buttons, etc. 

While pagination makes web browsing experience better, it certainly makes the task of web scrapping more difficult. 

Let's see an example now. The webpage we are looking to scrape is https://venturebeat.com. When you scroll to the bottom of the page, you will notice that at some point the url changes to https://venturebeat.com/page/2/ and this pattern continues. 

So, lets repeat what we did above for the url page 2 and see if we get new article links

In [10]:
main_url = 'https://venturebeat.com/page/2'

# 1. GET THE HTML CONTENT OF THE MAIN PAGE
html_soup = geturlhtml(main_url)

# 2. FROM THE HTML CONTENT OF THE MAIN PAGE GET THE ARTICLE LINKS
article_links2 = getarticlelinks(html_soup)
article_links2

['https://venturebeat.com/2020/03/20/best-cbd-oil/',
 'https://venturebeat.com/2020/03/20/ai-weekly-how-data-scientists-are-helping-to-flatten-the-pandemic-curve/',
 'https://venturebeat.com/2020/03/20/twilio-updates-its-twilioquest-3-game-to-teach-kids-to-code-while-at-home/',
 'https://venturebeat.com/2020/03/20/the-retrobeat-crackdown-is-a-fun-way-to-blast-the-stress-away/',
 'https://venturebeat.com/2020/03/20/diligent-robotics-raises-10-million-for-nurse-assistant-robot-moxi/',
 'https://venturebeat.com/2020/03/20/kahlief-adams-on-scraping-by-in-the-podcast-game-how-games-make-money/',
 'https://venturebeat.com/2020/03/20/4-reasons-you-should-be-moving-toward-zero-trust-security/',
 'https://venturebeat.com/2020/03/20/world-of-warcraft-gives-players-a-100-experience-bonus-through-april-20/',
 'https://venturebeat.com/2020/03/20/probeat-the-rise-and-inevitable-fall-of-microsoft-teams-and-slack/',
 'https://venturebeat.com/2020/03/20/steams-top-20-new-games-for-february-2020/',
 'ht

Now, we can apply what we have done so far on more than one page i.e. collect article url links from a number of webpages using page numbers. Then, we can run HTTP requests on each of these article links and obtain their corresponding article titles and article contents. 

In [12]:
# # 1. GET THE HTML CONTENT OF THE MAIN PAGE
# page_no = range(2,51,1)
# page_urls = []
# for i in page_no:
#     page_urls.append('https://venturebeat.com/page/'+str(i))

# html_soups = [ geturlhtml(url) for url in page_urls ]
# html_soups


# # 2. FROM THE HTML CONTENT OF THE MAIN PAGE GET THE ARTICLE LINKS
# head_link = getheadlink(html_soups[0])
# other_links = [ getarticlelinks(html_soup) for html_soup in html_soups ]
# other_links = [eachlink 
#                for eachlist in other_links 
#                for eachlink in eachlist]
# article_links = head_link + other_links


# # 3. GET THE HTML CONTENT FOR EACH ARTICLE LINK AND GET THE ARTICLE TITLE AND TEXT FROM IT

# data_dictionary = {'url':[], 'category':[], 'title':[], 'text':[]}
# tracker = 0
# for i in article_links:
#     category, title, text = gettextfromarticleurl(i)
#     data_dictionary['url'].append(i)
#     data_dictionary['category'].append(category)
#     data_dictionary['title'].append(title)
#     data_dictionary['text'].append(text)
#     tracker += 1
#     print('processed ', tracker, ' files')
    
    
# print('no of webpages to scrape: ', len(page_urls))
# print('header article link: ', head_link)
# print('total no. of article links: ', len(article_links))
# print('total no. of article titles and texts: ', (len(data_dictionary[title]), 
#                                                   len(data_dictionary[text])) ) 

In [14]:
df = pd.DataFrame.from_dict(data_dictionary)
print(df.shape)
df.head()

Unnamed: 0,url,category,title,text
0,https://venturebeat.com/2020/03/20/despite-set...,AI,"Despite setbacks, coronavirus could hasten the...","This week, nearly every major company developi..."
1,https://venturebeat.com/2020/03/19/sensor-towe...,Games,Sensor Tower: U.S. iPhone users spent about $5...,U.S. iPhone users spent an average of about $5...
2,https://venturebeat.com/2020/03/19/microsoft-u...,Games,Microsoft unveils DirectX 12 Ultimate with imp...,Microsoft is moving on to the next generation ...
3,https://venturebeat.com/2020/03/19/sea-of-star...,Games,Sea of Stars is a gorgeous retro-RPG from The ...,"Sabotage Studios announced Sea of Stars today,..."
4,https://venturebeat.com/2020/03/19/htc-holds-v...,AR/VR,"HTC holds virtual media event, sends coronavir...",HTC’s just-concluded Virtual Vive Ecosystem Co...


In [15]:
df.iloc[0,0]

'https://venturebeat.com/2020/03/20/despite-setbacks-coronavirus-could-hasten-the-adoption-of-autonomous-vehicles-and-delivery-robots/'

In [16]:
df.iloc[0,1]

'AI'

In [17]:
df.iloc[0,2]

'Despite setbacks, coronavirus could hasten the adoption of autonomous vehicles and delivery robots'

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1961 entries, 0 to 1960
Data columns (total 4 columns):
url         1961 non-null object
category    1961 non-null object
title       1961 non-null object
text        1961 non-null object
dtypes: object(4)
memory usage: 61.4+ KB


In [None]:
df.to_csv('venturebeat.csv', index=False)