#  Web scrapping for NLP task.

The goal of this notebook is to build a tool that can scrape text from a given list of websites, in order to use it later for clustering the sites. 

The task indicates that we should get text from the landing page, as well as text from the links contained in the landing page. 

Since many requests will be necessary, some mechanism has to be put in place in order to avoid being blocked. 
(user-agents, proxy, etc.)

As each page contains many links, parallel processing can be implemented in order to speed up the scrapping. 

The final product should be able to take a list of websites and build text files with the contents of each site. 
Additional parameters could be included for managing, for instance, the pareallel processing, or maybe some further filtering of the contents. 

##  Scrapping from one site

Let's use one of the given URLs to get an idea of the kind of websites we have. 

For example, this british pipe supplier: http://www.besseges-vtf.co.uk/

In [77]:
import requests
import random
import time
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from multiprocessing import Pool

In [29]:
def get_header():
    """
    Returns a random header dictionary to be passed to requests. 
    """

    # Headers for user agent rotation:
    # Full headers obtained from hhttpbin.org
    # Firefox 84 Ubuntu


    h1 =  {
        "Accept": 	"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding":	"gzip, deflate, br",
        "Accept-Language":	"en-US,en;q=0.5",
        "Connection":	"keep-alive",
        "Host":	"httpbin.org",
        "TE":	"Trailers",
        "Upgrade-Insecure-Requests":	"1",
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"
      }

    #Firefox 84 Windows 10

    h2 = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        "Accept-Encoding": "gzip, deflate, br", 
        "Accept-Language": "en-GB,en;q=0.5", 
        "Host": "httpbin.org", 
        "Upgrade-Insecure-Requests": "1", 
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0"
       }

    # Chrome 87 Ubuntu

    h3 = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
        "Accept-Encoding": "gzip, deflate, br", 
        "Accept-Language": "en-US,en;q=0.9,fr;q=0.8,es;q=0.7", 
        "Host": "httpbin.org", 
        "Sec-Fetch-Dest": "document", 
        "Sec-Fetch-Mode": "navigate", 
        "Sec-Fetch-Site": "none", 
        "Sec-Fetch-User": "?1", 
        "Upgrade-Insecure-Requests": "1", 
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36", 
      }

    #Chrome 87 Windows 10

    h4 = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
        "Accept-Encoding": "gzip, deflate", 
        "Accept-Language": "es-419,es;q=0.9,fr;q=0.8,en;q=0.7", 
        "Host": "httpbin.org", 
        "Upgrade-Insecure-Requests": "1", 
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"
      }

    # Microsoft Edge 87 Windows 10

    h5 = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
        "Accept-Encoding": "gzip, deflate, br", 
        "Accept-Language": "en-US,en;q=0.9", 
        "Host": "httpbin.org", 
        "Sec-Fetch-Dest": "document", 
        "Sec-Fetch-Mode": "navigate", 
        "Sec-Fetch-Site": "none", 
        "Sec-Fetch-User": "?1", 
        "Upgrade-Insecure-Requests": "1", 
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75"
      }

    headers_list = [h1, h2, h3, h4, h5]
    
    return random.choice(headers_list)

In [4]:
def get_links(soup, tags = 'a'):
    """ Get get all the links for the given tags from a parsed page.  
    soup: an html page parsed with beautifulsoup.
    tags: string or list of strings indicating the html tags to search. 
    """
    
    links = [] # list to store the links found
    
    for tag in tags:
        for link in soup.find_all(tag, href=True):
            links.append(link['href'])
           
    links = list(set(links))
        
        
    return links

In [5]:
def filter_links(home, links_list):
    """
    Takes a home address and a list of links, and filters out links to external sites
    and to some common file types.
    home: string. The URL of the home page.
    links_list: list of strings with the links found on the page, as produced by get_links.
    """
    
    domain = urlparse(home).netloc # domain to to check for external links.
    
    # path to include before an internal link. Remove final '/' if present.
    path = home[:-1] if home.endswith('/') else home 

    unwanted_starts = ('javascript:', 'mailto:', 'tel:', '#', '..', '../') 
    
    unwanted_endings = ('.pdf', '.jpg', '.jpeg', '.png', '.gif', '.exe', '.js',
                        '.zip', '.tar', '.gz', '.7z', '.rar'
                       )
    
    filtered_links = list(filter(lambda link: not (link.lower().startswith(unwanted_starts) or 
                                                   link.lower().endswith(unwanted_endings)),links_list
                                )
                         )
    
    # get internal links that don't have the full URL
    internal_links = [link for link in filtered_links if not link.startswith('http') ]

    # Ensure starting '/'  
    for j, intlink in enumerate(internal_links):
        if not intlink.startswith('/'):
            internal_links[j]='/'+intlink
            
    internal_links = [path + intlink for intlink in internal_links]
    
    # removing external links
    filtered_links = list(filter(lambda link: (link.lower().startswith('http') and
                                                domain in link.lower()), filtered_links
                                )
                         )
    
    # include internal links
    filtered_links.extend(internal_links)
    
    # keeping disntinct elements only
    
    filtered_links = list(set(filtered_links))
    
    # remove home url if present.    
    try:
        filtered_links.remove(path)
    except(ValueError):
        pass
    try:
        filtered_links.remove(path+'/')
    except(ValueError):
        pass
        
    return filtered_links
    

In [33]:
def write_page(page,file):
    """ Gets the text of a page and appends it to the specified file.
    File will be created if it does not exists. 
    page: a BeautifulSoup obeject with the parsed page.
    file: string. Path to the destination file.
     """
    
    
    page_text = page.get_text(separator = '\n', strip=True) 
    
    with open(file,'a') as website_text:
        website_text.write(page_text)
        
    return None

Trial URLs

    *'http://www.besseges-vtf.co.uk'
    *'http://lumaquin.com'
    *'https://www.degso.com'
    *'http://www.ictsl.net'
    *'https://barrocorestaurante.mx'
    *'https://www.gummigoetz.de'
    *'http://www.suppliersof.com'


In [32]:
# Set landing page URL

MAIN_URL = 'http://www.ictsl.net'

TEXTS_DIRECTORY = './site_contents/'

In [7]:
# Request contents

random_header = random.choice(headers_list)

landing_page = requests.get(MAIN_URL, {'header': random_header})
landing_html = BeautifulSoup(landing_page.content, 'html.parser')

In [19]:
#what happens with requests when a page doesn't exist?

myweb = 'http://www.estawebnoexiste.com.ar'
failed_attempts = []
try:
    landing_page = requests.get(myweb)
except NewConnectionError:
    failed_attempts.append(myweb)
    pass
    


NameError: name 'NewConnectionError' is not defined

In [11]:
# Set a file name for the website and write the text of the main page.
file_name = FILES_DIRECTORY + urlparse(MAIN_URL).netloc

file_name

'./site_contents/www.ictsl.net'

In [13]:
write_page(file_name, landing_html)

In [None]:
link_list = get_links(landing_html)

In [None]:
link_list

In [None]:
link_list = filter_links(MAIN_URL, link_list)

In [None]:
link_list

In [15]:
import os

In [16]:
os.cpu_count()

8

In [26]:
SITES_LIST = []

with open('./site_lists/01_websites.csv', 'r', newline = '') as f:
    for site in f.readlines():
        SITES_LIST.append(site.strip())

SITES_LIST

['website',
 'http://lumaquin.com',
 'https://www.degso.com',
 'http://www.ictsl.net',
 'https://barrocorestaurante.mx',
 'https://www.gummigoetz.de',
 'http://www.suppliersof.com',
 'https://www.aikolon.fi',
 'http://mikro-technik.com',
 'http://de.jointeflons.com',
 'https://hbc-system.com',
 'http://german.aiflon.com',
 'https://www.briskheat.com',
 'https://www.wmh-herion.de',
 'https://de.industrial-seals.com',
 'http://de.plasticptfe.com',
 'https://www.sandprofile.de',
 'http://de.techoseal.com',
 'http://www.minipack.us',
 'https://www.polyfluor.nl',
 'https://www.miprcorp.com',
 'https://www.mn-net.com',
 'http://www.atio.cz',
 'https://www.zse.de',
 'https://www.klinger-awschultze.de',
 'https://www.aerchs.com',
 'https://www.heckerwerke.de',
 'https://www.yachticon.de',
 'https://www.ahlstrom-munksjo.com',
 'http://de.rilsonindustry.com',
 'https://www.jowat.com',
 'https://www.bostik.com',
 'https://www.hufschmied.net',
 'https://www.schreiber-berlin.de',
 'https://www.stoe

In [28]:
SITES_LIST = SITES_LIST[1:]

In [59]:
def scrape_main(main_url):
    
    """
    Takes the URL of the main site, scrapes the text and the links. 
    site_url: string. url of the desired site.
    """
    
    random_header = get_header()
    
    page = requests.get(site_url, {'header': random_header})
    soup = BeautifulSoup(page.content, 'html.parser')
    
    
    page_text = soup.get_text(separator = '\n', strip=True) 
       
    page_links = get_links(soup)
    page_links = filter_links(site_url, page_links)
        
    return page_text, page_links   

In [74]:
def scrape_links(link_url):
    
    """
    Takes the URL from one of the link, scrapesa and returns the text. 
    link_url: string. url of the desired site.
    """
    
    random_header = get_header()
    
      
    page = requests.get(link_url, {'header': random_header})
    soup = BeautifulSoup(page.content, 'html.parser')
        
    page_text = soup.get_text(separator = '\n', strip=True) 
    
    print(f'Retrieved text from {link_url}')
    
    return page_text

In [57]:
scrape_site('https://www.gummigoetz.de', getlinks=False)

'GUMMI-GÖTZ | Hersteller von Gummiformteilen und Dichtungen\nDonnerstag 21 Januar 2021\nImpressum\nAGB\nDatenschutzerklärung\nHome\nProdukte\nBack\nGummiformartikel\nFenster- und Türdichtungen\nProfile\nFlachdichtungen\nSchaumstoffe\nDicht- und Klebstoffe\nGummi-Metall- Verbindungen\nSchläuche\nDichtungen\nGummirollen\nKunststoffe\nMaterialien\nÜber uns\nKontakt\nBack\nImpressum\nAGB\nDatenschutzerklärung\nvon Profilen und Teilen aus Gummi\n& Individuallösungen\nMaßanfertigungen\nindividuell abgestimmter Produkte\nFertigung hochwertiger,\nHerzlich willkommen\nGUMMI-GÖTZ – Ihr Spezialist für industrielle Gummiformteile aller Art\nSeit 2004 schätzen unsere Kunden insbesondere unsere innovativen Lösungen für technisch hochwertige und individuelle Gummiformartikel.\nNeben einem großen Sortiment an Standardartikeln wie Profilen, Fenster- und Türdichtungen, Schaumstoffen sowie Dicht- und Klebestoffen für den Handwerks- und privaten Bereich finden Sie bei uns vor allem Ansprechpartner für ind

In [84]:
max_links = 10

for site in SITES_LIST[4:5]:
    
    domain = urlparse(site).netloc
    
    file_name = FILES_DIRECTORY + domain

    # get links and text from the main site
    text, links_list = scrape_site(site, getlinks=True)
    
    #write text of the main site
    
    with open(file_name, 'w') as f:
        f.write(text)
        
    print(f'Text from main page {domain} written to {file_name}')
    
    # set text to empty string
    text = ''
    
    start_time_p = time.time()
    print('BEGIN PARALLELL')
    if __name__ == '__main__':
        with Pool(6) as p:
            link_text = p.map(scrape_links, links_list[:10])
              
    
    text=link_text
    duration_p = time.time() - start_time_p
    print('END PARALLELL')
    
    start_time_s = time.time()
    print("BEGIN SERIAL")
    text = ''
    for link in links_list[:10]:
        link_text = scrape_links(link)
        text += '\n' + link_text
    duration_s = time.time() - start_time_s
    print("END SERIAL")
    """           
    
    with open(file_name, 'a') as f:
        f.write('\n'.join(text))
    """
        
        
    
    

Text from main page www.gummigoetz.de written to ./site_contents/www.gummigoetz.de
BEGIN PARALLELL
Retrieved text from https://www.gummigoetz.de/de/produkte/dicht-und-klebstoffeRetrieved text from https://www.gummigoetz.de/de/materialien

Retrieved text from https://www.gummigoetz.de/de/produkte/flachdichtungen
Retrieved text from https://www.gummigoetz.de/de/kontakt-hinweis/datenschutz
Retrieved text from https://www.gummigoetz.de/de/produkte/dichtungen
Retrieved text from https://www.gummigoetz.de/de/kontakt-hinweis/agb
Retrieved text from https://www.gummigoetz.de/de/produkte/schaumstoffe
Retrieved text from https://www.gummigoetz.de/de/produkte/gummi-metall-verbindungen
Retrieved text from https://www.gummigoetz.de/de/
Retrieved text from https://www.gummigoetz.de/de/produkte
END PARALLELL
BEGIN SERIAL
Retrieved text from https://www.gummigoetz.de/de/kontakt-hinweis/datenschutz
Retrieved text from https://www.gummigoetz.de/de/produkte/flachdichtungen
Retrieved text from https://www

In [87]:
print(f'Parallell took {duration_p:.2f}s')
print(f'Serial took {duration_s:.2f}s')

Parallell took 2.65s
Serial took 12.70s


In [83]:
duration_s

NameError: name 'duration_s' is not defined

In [68]:
print('\n'.join(['a', 'b', 'c' ]))

a
b
c


In [43]:
mylist = [1,2,3,4,5,6]