### Webscraping notes
I used BeautifulSoup library for webscraping and EC2 on AWS to run it. This part actually took quite long, because I needed to account for many errors (including both request errors and issues with the entered url), so often the code would run for several hours and then break. I ended up batching it first. And then reran as a whole df.

I recorded both info from the homepage and the status errors, to consider whether the urls provided were no longer in existence. This also contributed to the "freshness" of the client company info, which I will discuss in the futher notebooks.

In [None]:
import pandas as pd
import numpy as np
import re
import requests
from bs4 import BeautifulSoup
from requests import ReadTimeout, ConnectTimeout, HTTPError, Timeout, ConnectionError

In [None]:
df = pd.read_pickle("./clean_urls")

In [None]:
df.head()

### 1. Scraping function: ping the website to check if it's still alive, scrape homepage, record error message if there is one.

The reason for recording error messages to later see whether a company's url was 403 (website no longer available) and note that those clients are potentially no longer there.

In [None]:
def scrape_soup(url):
    import idna
    try:
        res = requests.get(url)
        status_code = res.status_code
        url = res.url
        error = "No error"
        soup = (BeautifulSoup(res.content, "html5lib"))
        print(url, error, status_code)
    except requests.ConnectionError as e:
        error = ("CONNECTION ERROR: " + str(e))
        url = url
        status_code = "Error1"
        soup = "Error1"
        print(url, error, status_code)
    except idna.IDNAError as e:
        error = ("IDNA ERROR:" + str(e))
        url = url
        status_code = "Error2"
        soup = "Error2"
        print(url, error, status_code)
    except requests.exceptions.ReadTimeout as e:
        error = ("TIMEOUT ERROR:" + str(e))
        url = url
        status_code = "Error3"
        soup = "Error3"
        print(url, error, status_code)
   
    return (url, status_code, error, soup)

### 2. Scraping (note I did in batches using EC2 on AWS).

In [None]:
# DO NOT RUN, UNLESS YOU ARE RERUNNING ON PURPOSE. IT'LL TAKE A REALLY LONG TIME.
results = df['website'].apply(scrape_soup)

In [None]:
#split that data into status code (to see whether the website is still running) and the rest (for future processing)
df['url'] = [result[0] for result in results]
df['status_code'] = [result[1] for result in results]
df['error'] = [result[2] for result in results]
df['soup'] = [result[3] for result in results]

### 3. URLs with info from scraping

In [None]:
df_new = pd.read_pickle("./FINAL_pickle_soup")

In [None]:
df_new.status_code.value_counts()

Out all the urls ust under 90% came back with status code 200, around 5% with status code 403 and the rest with difference status code errors.