# Impact Nexus Scraping Exercise

This notebook uses techniques of `web scraping` to collect information of all available text data in a website and storing the output in a json file.


### Algorithm Design

Ths exercise was divided into several steps, which were implemented in separated functions:

- **Step 1**: Check the desidered format of output provide in the example data.

- **Step 2** (`get_output` function): Set the desired output format.

- **Step 3** (`get_links` function): Retrieve all links found within the top level of the website.

- **Step 4** (`crawl` function): Recursively crawl the website for finding all links within the other levels of the website.

- **Step 5** (`get_content` function): Retrieves the content text of a URL.

- **Step 6** (`get_website` function): Function to scrape an entire website.


Let's start with importing relevant libraries. 
For the scraping part, we opt for applying the `Beautiful soup` library, although others such as spacy and others are available.

In [71]:
import numpy as np
import pandas as pd 
from bs4 import BeautifulSoup
import requests
import json

**Step 1:** The assignement of the project has an example file from which we establish the format of the desirable output file.  

In [101]:
# Open example file

file = open('Scraping task_Example json_1_data_Agrifarm.json', 'r')

example = json.load(file)

file.close()

# Checking format and keys of the file

print(example['website'].keys())
[type(value) for value in example['website'].values()]

#example['website']['external_links'].items()

example['website']['website_content']

dict_keys(['company_name', 'website_content', 'external_links', 'website_url', 'internal_links'])


{'https://agrifarm.dk/en/glaedelig-jul-og-godt-nytaar': 'Translated: Skip to content Home News Projects Concept & Technologies Service Contact Write to us Dansk Search for: Merry Christmas & Happy New Year Home / Information / Merry Christmas & Happy New Year Previous Next View Larger Image Merry Christmas & Happy New Year Agrifarm wishes everyone a Merry Christmas and a Happy New Year. We look forward to more exciting projects with existing and new customers in the new year. We´re having a Christmas holiday from Friday, December 20, 2019 until Monday, January 6, 2020. (both days exclusive) The service phone is of course open and other calls as well as mail are answered if possible to the extent necessary. Bo Rosborg 2019-12-13T16:47:36+01:00 Del Historien Facebook Twitter LinkedIn Email Related Posts Build stage Sweden Gallery Build stage Sweden Quick links Intellifarm Agri AirClean About us Contact us today Adress: Niels Pedersens Alle 2 DK-8830 Tjele +45 89 99 25 77 agri@agrifarm.dk

In [102]:
# Check whether all links are contained in top level of website

toplevel = requests.get(example['website']['website_url']).content.decode('utf-8')

count_in_toplevel = 0

count_not_in_toplevel = 0

for link in example['website']['internal_links']:

    if link in toplevel:
        count_in_toplevel += 1
    else:
        count_not_in_toplevel += 1

print('{} out of {} links found in top level of website'
      .format(count_in_toplevel, count_in_toplevel + count_not_in_toplevel))

21 out of 50 links found in top level of website


This shows that a recursive approach is needed to catch the links in other levels.

Building file format for output:

In [103]:
# Formating input and output data

# Getting websites to be scraped (select header from row 1)
df_web = pd.read_excel('Scraping task_10 Berlin Sustainability_Startups.xlsx', header = 1)

# Checking exercise data
df_web

Unnamed: 0,Nr.,id,NAME,WEBSITE,Unnamed: 4
0,0,1803190,Planetly,https://www.planetly.org/en/,
1,1,949865,Solytic,https://www.solytic.com/en/,
2,3,1686088,Sanity Group,https://sanitygroup.com/,
3,4,1520081,Grandpal,https://grandpal.co/,
4,6,1759246,Inne,https://inne.io/,
5,7,1512234,Nuventura,http://www.nuventura.com,
6,9,884183,Grover,https://www.grover.com/de-en,Difficult as webshop
7,11,964166,Amboss,https://www.amboss.com/us,
8,28,95092,Ubitricity,https://www.ubitricity.co.uk/,
9,36,931862,AiServe Technologies,http://www.aiserve.co/,


**Step 2:** Building the `output` file in the desired shape:

In [114]:
def get_output(name, website_url):
    '''
    Set the desired output format.
    
    Arguments:
    `name`: Company/Startup name
    `website_url`: website URL of the institution
    
    '''
    
    output= {'website': {} }
    output['website']['company_name'] = name
    output['website']['website_content'] = {}
    output['website']['external_links'] = {'youtube': [], 'twitter': [], 'facebook': [], 
                                           'linkedin': [], 'other_links': []}
    output['website']['website_url'] = website_url
    output['website']['internal_links'] = []
    return output

# Checking format
get_output(df_web['NAME'][0], df_web['WEBSITE'][0])

{'website': {'company_name': 'Planetly',
  'website_content': {},
  'external_links': {'youtube': [],
   'twitter': [],
   'facebook': [],
   'linkedin': [],
   'other_links': []},
  'website_url': 'https://www.planetly.org/en/',
  'internal_links': []}}

**Steps 3-5:** Creating functions to retrieve information on website content and all links:

In [135]:
def get_links(website_url):
    
    '''
    Retrieve all links found within the top level of a website and returns a list.
    
    Argument:
    `website_url`: main website URL
    '''
    # Getting the website

    url = requests.get(website_url)
   
    # Parsing html website 
    soup = BeautifulSoup(url.content, 'html.parser')

    # List of all links

    links = []
    for link in soup.find_all('a', href=True):
        links.append(link['href'])

    return links


def crawl(website_url, url, internal, external):
    '''
    Recursively crawl the given website for links.
    
    Arguments:
    `website_url`: top-level URL to website
                   (only links starting with this are considered internal)
    `url`: starting URL to be crawled
    `internal`: set of internal links (to guaratee it has unique links)
    `external`: set of external links (to guaratee it has unique links)
    '''
    # Check all links at given URL
    links = get_links(url)

    for link in links:
        # Only follow internal links (starting with the website URL)

        if not link.startswith(website_url):
            if link.startswith('http') and link not in external:
                print('External link:', link)
                external.add(link)
            continue

        # If link is not yet visited, recursively check for more links
        if link in internal:
            continue
        else:
            print('Internal link:', link)
            internal.add(link)
            crawl(website_url, link, internal, external)
            
    # Return all links found
    return internal, external
    
def get_content(suburl):  
    '''
    Retrieves the content text of a URL.
    
    Argument:
    `suburl`: website URL within main website
    
    '''    
    # Getting the website
    req = requests.get(suburl)
    
    # Parsing html website 
    soup = BeautifulSoup(req.content, 'html.parser')
    text = soup.get_text()
    
    # Replace newlines with spaces
    text = text.replace('\n', ' ')
    
    # Remove multiple spaces
    text = ' '.join(text.split())
    
    return text

Testing `crawl` function:

In [147]:
# Create an empty set every time crawl function is called

internal = set()
external = set()

# Test example (use same URLs to start from top of website and crawl to other levels)
website_url = 'https://www.solytic.com/en/' 
url = 'https://www.solytic.com/en/'

crawl(website_url, url, internal, external)

Internal link: https://www.solytic.com/en/
Internal link: https://www.solytic.com/en/pv-monitoring/
Internal link: https://www.solytic.com/en/prices/
Internal link: https://www.solytic.com/en/platform/
External link: https://marketplace.solytic.com/customer/account/login/referer/aHR0cHM6Ly9tYXJrZXRwbGFjZS5zb2x5dGljLmNvbS8%2C/
External link: https://marketplace.solytic.com/customer/account/create/
External link: https://marketplace.solytic.com/
External link: https://marketplace.solytic.com/checkout/cart/
External link: https://marketplace.solytic.com/catalogsearch/advanced/
External link: https://marketplace.solytic.com/betrieb-wartung.html
External link: https://marketplace.solytic.com/betrieb-wartung/direktvermarktung.html
External link: https://marketplace.solytic.com/betrieb-wartung/e-check-pv.html
External link: https://marketplace.solytic.com/betrieb-wartung/wechselrichter-garantieverlangerung.html
External link: https://marketplace.solytic.com/betrieb-wartung/solaranlagen-reinig

External link: https://www.solytic.com/datenschutzbestimmungen/
External link: http://www.solytic.com
External link: https://www.privacyshield.gov/participant?id=a2zt00000000001L5AAI&status=Active
External link: https://cloud.google.com/maps-platform/terms/maps-controller-terms/
External link: https://www.google.de/intl/de/policies/privacy/
External link: http://tools.google.com/dlpage/gaoptout
External link: http://www.google.com/analytics/terms/de.html
External link: https://www.google.de/settings/ads
External link: http://www.google.com/settings/ads/plugin
External link: https://policies.google.com/technologies/ads?hl=en
External link: http://developers.facebook.com/docs/plugins/
External link: https://www.privacyshield.gov/participant?id=a2zt000000000GnywAAC&status=Active
External link: http://de-de.facebook.com/policy.php
External link: https://www.facebook.com/business/help/651294705016616
External link: https://www.facebook.com/settings?tab=ads
External link: https://www.privacy

Internal link: https://www.solytic.com/en/blog/vattenfall-ewe-invest/
External link: https://www.solytic.com/blog/vattenfall-ewe-investieren/
External link: https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fvattenfall-ewe-invest%2F&t=Vattenfall%20and%20EWE%20invest%20in%20Solytic
External link: https://twitter.com/intent/tweet?text=Vattenfall%20and%20EWE%20invest%20in%20Solytic&url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fvattenfall-ewe-invest%2F
External link: https://www.linkedin.com/shareArticle?url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fvattenfall-ewe-invest%2F&title=Vattenfall%20and%20EWE%20invest%20in%20Solytic&mini=true
External link: https://group.vattenfall.com/
External link: https://www.ewe.com/en
Internal link: https://www.solytic.com/en/blog/vattenfall-ewe-invest/#respond
Internal link: https://www.solytic.com/en/blog/microsoft-handelsblatt-energy-awards-2020/
External link: https://www.solytic.com/blog/microsoft-handelsblatt-

Internal link: https://www.solytic.com/en/blog/professional-in-house-pv-monitoring/
External link: https://www.solytic.com/blog/professionell-eigenentwickelt-pv-monitoring/
Internal link: https://www.solytic.com/en/author/mikekondulaen/
Internal link: https://www.solytic.com/en/blog/professional-in-house-pv-monitoring/#respond
External link: https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fprofessional-in-house-pv-monitoring%2F&t=Professional%20or%20in-house%20PV%20monitoring%20solution%3F
External link: https://twitter.com/intent/tweet?text=Professional%20or%20in-house%20PV%20monitoring%20solution%3F&url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fprofessional-in-house-pv-monitoring%2F
External link: https://www.linkedin.com/shareArticle?url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fprofessional-in-house-pv-monitoring%2F&title=Professional%20or%20in-house%20PV%20monitoring%20solution%3F&mini=true
External link: https://www.bos-ten.net/de/
Inte

({'https://www.solytic.com/en/',
  'https://www.solytic.com/en/about-us/',
  'https://www.solytic.com/en/about-us/#ir',
  'https://www.solytic.com/en/author/julia-buchterkirchen/',
  'https://www.solytic.com/en/author/mikekondulaen/',
  'https://www.solytic.com/en/author/niko/',
  'https://www.solytic.com/en/blog-en/',
  'https://www.solytic.com/en/blog/',
  'https://www.solytic.com/en/blog/100000-solar-pv-plants-2-years/',
  'https://www.solytic.com/en/blog/100000-solar-pv-plants-2-years/#respond',
  'https://www.solytic.com/en/blog/europas-award/',
  'https://www.solytic.com/en/blog/europas-award/#respond',
  'https://www.solytic.com/en/blog/exel-solar-partnership/',
  'https://www.solytic.com/en/blog/exel-solar-partnership/#respond',
  'https://www.solytic.com/en/blog/how-a-new-data-platform-business-could-energize-a-110-year-old-electric-utility/',
  'https://www.solytic.com/en/blog/how-a-new-data-platform-business-could-energize-a-110-year-old-electric-utility/#respond',
  'https:

**Step 6:** Creating postprocessing for selecting internal and external links:

In [145]:
def get_website(name, website_url):
    
    '''
    Function to scrape an entire website.
    
    Arguments:
    `name`: Company/Startup name
    `website_url`: website URL of the institution
       
    '''
    
    output = get_output(name, website_url)
    
    # Store all links in a variable from website url
    all_links = get_links(website_url)

    # Crawl website for internal and external links
    internal = set()
    external = set()
    crawl(website_url, website_url, internal, external)

    # Postprocessing of internal and external links to the desired shape

    # Convert to list so it can be converted to JSON
    output['website']['internal_links'] = list(internal)

    # External links postprocessing
    for link in external:
        if link.startswith('https://www.youtube.com'):    
            output['website']['external_links']['youtube'].append(link) 
        elif link.startswith('https://www.twitter.com'):    
            output['website']['external_links']['twitter'].append(link) 
        elif link.startswith('https://www.facebook.com'):    
            output['website']['external_links']['facebook'].append(link) 
        elif link.startswith('https://www.linkedin.com'):    
            output['website']['external_links']['linkedin'].append(link) 
        else:    
            output['website']['external_links']['other_links'].append(link)   

    # Add website content   
    for suburl in internal:
        output['website']['website_content'][suburl] = get_content(suburl)

    return output

Export the final output dictionary as a JSON file:

In [146]:
# Uncomment for running for all websites
'''
for index, row in df_web.iterrows():
    output = get_website(row['NAME'], row['WEBSITE'])
    f = open(row['NAME'] + '.json', 'w')
    json.dump(output, f)
    f.close()   
'''
# Test for one website
output = get_website(df_web['NAME'][1], df_web['WEBSITE'][1])

# Checking json format and output
print(json.dumps(output, indent=2))

Internal link: https://www.solytic.com/en/
Internal link: https://www.solytic.com/en/pv-monitoring/
Internal link: https://www.solytic.com/en/prices/
Internal link: https://www.solytic.com/en/platform/
External link: https://marketplace.solytic.com/customer/account/login/referer/aHR0cHM6Ly9tYXJrZXRwbGFjZS5zb2x5dGljLmNvbS8%2C/
External link: https://marketplace.solytic.com/customer/account/create/
External link: https://marketplace.solytic.com/
External link: https://marketplace.solytic.com/checkout/cart/
External link: https://marketplace.solytic.com/catalogsearch/advanced/
External link: https://marketplace.solytic.com/betrieb-wartung.html
External link: https://marketplace.solytic.com/betrieb-wartung/direktvermarktung.html
External link: https://marketplace.solytic.com/betrieb-wartung/e-check-pv.html
External link: https://marketplace.solytic.com/betrieb-wartung/wechselrichter-garantieverlangerung.html
External link: https://marketplace.solytic.com/betrieb-wartung/solaranlagen-reinig

External link: https://www.solytic.com/impressum/
Internal link: https://www.solytic.com/en/privacy-statement/
External link: https://www.solytic.com/datenschutzbestimmungen/
External link: http://www.solytic.com
External link: https://www.privacyshield.gov/participant?id=a2zt00000000001L5AAI&status=Active
External link: https://cloud.google.com/maps-platform/terms/maps-controller-terms/
External link: https://www.google.de/intl/de/policies/privacy/
External link: http://tools.google.com/dlpage/gaoptout
External link: http://www.google.com/analytics/terms/de.html
External link: https://www.google.de/settings/ads
External link: http://www.google.com/settings/ads/plugin
External link: https://policies.google.com/technologies/ads?hl=en
External link: http://developers.facebook.com/docs/plugins/
External link: https://www.privacyshield.gov/participant?id=a2zt000000000GnywAAC&status=Active
External link: http://de-de.facebook.com/policy.php
External link: https://www.facebook.com/business/h

Internal link: https://www.solytic.com/en/blog/vattenfall-ewe-invest/
External link: https://www.solytic.com/blog/vattenfall-ewe-investieren/
External link: https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fvattenfall-ewe-invest%2F&t=Vattenfall%20and%20EWE%20invest%20in%20Solytic
External link: https://twitter.com/intent/tweet?text=Vattenfall%20and%20EWE%20invest%20in%20Solytic&url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fvattenfall-ewe-invest%2F
External link: https://www.linkedin.com/shareArticle?url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fvattenfall-ewe-invest%2F&title=Vattenfall%20and%20EWE%20invest%20in%20Solytic&mini=true
External link: https://group.vattenfall.com/
External link: https://www.ewe.com/en
Internal link: https://www.solytic.com/en/blog/vattenfall-ewe-invest/#respond
Internal link: https://www.solytic.com/en/blog/microsoft-handelsblatt-energy-awards-2020/
External link: https://www.solytic.com/blog/microsoft-handelsblatt-

Internal link: https://www.solytic.com/en/blog/professional-in-house-pv-monitoring/
External link: https://www.solytic.com/blog/professionell-eigenentwickelt-pv-monitoring/
Internal link: https://www.solytic.com/en/author/mikekondulaen/
Internal link: https://www.solytic.com/en/blog/professional-in-house-pv-monitoring/#respond
External link: https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fprofessional-in-house-pv-monitoring%2F&t=Professional%20or%20in-house%20PV%20monitoring%20solution%3F
External link: https://twitter.com/intent/tweet?text=Professional%20or%20in-house%20PV%20monitoring%20solution%3F&url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fprofessional-in-house-pv-monitoring%2F
External link: https://www.linkedin.com/shareArticle?url=https%3A%2F%2Fwww.solytic.com%2Fen%2Fblog%2Fprofessional-in-house-pv-monitoring%2F&title=Professional%20or%20in-house%20PV%20monitoring%20solution%3F&mini=true
External link: https://www.bos-ten.net/de/
Inte

## References
- [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [How to Extract All Website Links using BeautifulSoup in Python](https://morioh.com/p/93eb90f9e62a)

## Possible improvements

- Dealing with broken links
- Set a cut-off for how many internal websites the code should go through
- Follow other links than website (starting with `/`)