# Example for website scraping

In this specific example, we are looking for Dutch nature areas with a certain lookup word in it. The process is as follows:
1. Parse a list with urls of all nature areas from www.natura2000.nl/gebieden.
2. Retrieve all text from each of the nature pages.
3. Search for the lookup words in the text and return their occurances.

Scrape and parse using the `requests` and `BeautifulSoup` libraries. Use `pandas` only for tidying up the output.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

We are scraping the natura2000 website which has an overview of all nature areas in the Netherlands.

In [None]:
URL = 'https://www.natura2000.nl'

lookup_words = ['water', 'vijf']

### 1. Get urls of all nature areas

In [None]:
page = requests.get(URL + '/gebieden')
soup = BeautifulSoup(page.content, "html.parser")

In [None]:
gebieden_urls = [gebied.find('a')['href'] for gebied in 
                 soup.find_all("li", class_="gebieden-row")]

### 2. Parse text on each nature page

In [None]:
gebieden_pages = [
    BeautifulSoup(requests.get(URL + gebied_url).content, "html.parser") 
    for gebied_url in gebieden_urls
]

In [None]:
# retrieve area name from HTML
gebieden_names = [
    gebied_page.select_one("h1 span").text
    for gebied_page in gebieden_pages
]

# concatenate all HTML text paragraphs on a page into one single string
gebieden_text = [
    " ".join(
        [textbox.text for textbox in 
         gebied_page.find_all("div", class_="field field--name-field-body content-item")]
    ).replace('\n', ' ')
    for gebied_page in gebieden_pages
]

### 3. Return dataframe of areas with a count for each lookup word

In [None]:
def count_word_occurance(texts: list, word: str):
    return [text.lower().count(word) for text in texts]

output_df = pd.DataFrame({
    'Area': gebieden_names,
    'URL': [URL + gebied_url for gebied_url in gebieden_urls],
    # iteration yielding word counts per lookup word
     **{'count_'+word:count_word_occurance(gebieden_text, word) for word in lookup_words}
})

output_df.to_csv('word_count_per_area.csv')

Done.