# Webscraping Tutorial

This tutorial explains how to use the accompanying program for webscraping. It may be mentioned that the program addresses the specific use-case of the customer, but with little modification it may be used universally. Any assistance and support required for other use cases can be provided. 

## Technology
Most webscraping strategies involve using some webscraping library or module available in different programming languages. In Python Beautiful Soup and Scrappy are the popular and favored libraries. However, this specific use case presents a problem which cannot be resolved by ordinary means as the website to be scrapped is managed by "Google Tag Manager" and hence the html is not visible.
We have used "Selenium" in order to take control of the browser to reveal the htmal elements and retreive necessary information.
Selenium is a very powerful tool as it can fully control and drive the browser. However, it has some specific limitations and intermittance issues which make working with it quite difficult. One of the issues is that it has to wait for the page to be fully loaded before acquiring the complete information, and this can be a particular problem if internet speeds are slow. That is why it takes somewhat longer to scrape with Selenium but on the other hand it can penetrate the sites sitting behind login and passwords and tag management systems thus giving it a unique power.
We have used Selenium's Python bindings to address this specific use case.

## Installation
A number of Python libraries are required to be installed prior to using this software. We have made the program available in a Jupyter notebook which is one of the most popular methods of executing Python code and provides the flexibility to immediately see the results of running the Python code. It is an intermediate solution between a code editor and a full-fledged web application.

Selenium controls the browser through its web driver which communicates with a browser driverwhich is required to be installed on the system. We are using a chrome browser. Following steps explain how to install crome's latest version and chrome driver.



In [1]:
import numpy as np
import pandas as pd

def unique_urls(urls_list):
    """
    Removes repeated entries of urls in a list which might occur due to use of regex or multiple occurences on a page
    Args:
        urls_list: list containing urls which might be repetitive entries
    Returns: list containing unique urls
    """
    # convert to set
    urls_set = set(urls_list)
    # convert back to list
    unique_urls_list = list(urls_set)
    
    return unique_urls_list


In [13]:
all_hrefs_df = pd.read_csv("all_hrefs.csv")

In [14]:
all_hrefs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104806 entries, 0 to 104805
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       98770 non-null  object
dtypes: object(1)
memory usage: 818.9+ KB


In [26]:
all_hrefs = all_hrefs_df.values.tolist()
all_hrefs = [element for sublist in all_hrefs for element in sublist]
print(len(all_hrefs))
print(all_hrefs[:5])

104806
['https://wirmarket.wir.ch/de/', 'https://wirmarket.wir.ch/de/', nan, 'https://www.wir.ch/fileadmin/user_upload/Dokumente/Flyer/wirmarket-werbemoeglichkeiten-bank-wir-de.pdf', 'https://www.wir.ch/fileadmin/user_upload/Dokumente/Flyer/wirmarket-bedienungsanleitung-bank-wir-de.pdf']


In [36]:
all_hrefs_df.duplicated()

104806

In [38]:
all_hrefs_df.drop_duplicates().count()

0    29123
dtype: int64

In [40]:
all_hrefs_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29124 entries, 0 to 104763
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       29123 non-null  object
dtypes: object(1)
memory usage: 455.1+ KB


In [22]:
def regex_pattern_urls(pattern, hrefs_list): 
    """
    Uses a regex pattern to extract urls of interest out of a list of urls
    pattern: regex pattern used for extracting urls from a list of urls
    hrefs_list: list of hrefs required to be filtered according to the regex pattern
    returns:urls_of_interest a list of urls filtered according to regex pattern provided
        
    """
    # Filter the list 'all_hrefs' to keep only the company profile page urls

    #import regex library
    import re
    # regex pattern for company profile pages
    # Do not uncomment following line of code
#     pattern = r'https:\/\/\w+\.wir\.ch\/de\/companyProfile\/profile\/[0-9A-F]{32}\/info\/\?promo=false$'

    urls_of_interest = []

    # loop over all_hrefs list extracted above with Selenium to extract company profile urls
    for href in hrefs_list:
        match = re.search(pattern, str(href))
        if match:
            url = (match.group())
            urls_of_interest.append(url)
    return urls_of_interest        

In [29]:
pattern = r'https:\/\/\w+\.wir\.ch\/de\/companyProfile\/profile\/[0-9A-F]{32}\/info\/\?promo=false$'


company_profile_urls = regex_pattern_urls(pattern = pattern, hrefs_list = all_hrefs)
print(len(all_hrefs))
print(len(company_profile_urls))


104806
67083


In [32]:
print (company_profile_urls[:10])

['https://wirmarket.wir.ch/de/companyProfile/profile/3E429FAF47432A65E0540010E0244DC9/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/3E429FAF47432A65E0540010E0244DC9/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/3E429FAF47432A65E0540010E0244DC9/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/3E429FAF56142A65E0540010E0244DC9/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/3E429FAF56142A65E0540010E0244DC9/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/3E429FAF56142A65E0540010E0244DC9/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/4F0E0429C42D6459E054A0369F14B95F/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/4F0E0429C42D6459E054A0369F14B95F/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/4F0E0429C42D6459E054A0369F14B95F/info/?promo=false', 'https://wirmarket.wir.ch/de/companyProfile/profile/3E

In [33]:
unique_company_profile_urls = unique_urls(company_profile_urls)

In [34]:
print(len(unique_company_profile_urls))

14227
