# Data Scraping

In [165]:
from bs4 import BeautifulSoup as bs
import requests
import re
import pandas as pd

First, I access a csv file with a few URLs from some finnish companies which I have previously created.

In [166]:
#Get data from a csv into a dataframe

df = pd.read_csv('urls_db.csv')
print(df)

                          url
0           https://wolt.com/
1      https://www.nokia.com/
2  https://metacoregames.com/


Secondly, I define the function to get the http requests to a specific page. If the requests are completed succesfully, it creates an object with beautiful soup that retrieves the page content. If it is unsuccesfully, returns the response status code.

In [167]:
def get_request(url):
    page = requests.get(url)
    if page.status_code <= 299 and page.status_code >= 200:
        print ("The HTTP request has been successfully completed")
    
        soup = bs(page.content, 'html5lib')
        
    else:
        print("The HTTP request has not been successfully completed. \nStatus code: " + str(page.status_code))
    
    return soup

Then, I created a function that gets the title, page url and description of the website. If no title is found, the text "No title found" will be saved instead. The same will happen for the rest to the variables. The function returns a tuple with the values.

In [168]:
def get_metadata(soup):
    title = soup.find("meta", property="og:title")
    url = soup.find("meta", property="og:url")
    description = soup.find("meta", property="og:description")

    title = title["content"] if title else "No title found"
    url = url["content"] if url else "No url found"
    description = description["content"] if url else "No description found"
    
    return title, url, description

A function to get the address is created. Because this information cannot be found on the main website page, the function looks and retrieves the contact page. From there, searches for Finland and retrieves the address if found. 

In case that the companies searched were based outside Finland, this function would need to be changed for another one which looked for a address based on structure.

In [169]:
def get_address(soup, url):
    contact = soup.find(text=re.compile('Contact$'))
    if contact: 
        parent = contact.parent
        href = parent['href']

        href = href.split("/")
        href = "/" + href[-1]

        subdomain_url = url + href
        subdomain = requests.get(subdomain_url)
        soup2 = bs(subdomain.content, 'html5lib')

        finland = soup2.findAll(text=re.compile('Finland$'))
        parent = finland[0].parent
        address = parent.text

        return address
    
    else:
        return "No address found"

Afterwards, it iterates throughout the urls in the dataframe and gets the title, url, description and address for each one of them and saves them in lists.

In [170]:
titles = []
urls = []
descriptions = []
addresses = []

for row in df.index:
    print(df['url'][row])
    #print(type(df['url'][row]))
    soup = get_request(df['url'][row])
    title, url, description = get_metadata(soup)
    address = get_address(soup, df['url'][row])
    
    titles.append(title)
    urls.append(url)
    descriptions.append(description)
    addresses.append(address)
    
    print("Title: " + title)
    print("Description: " + description)
    print("Address: " + address + "\n")

https://wolt.com/
The HTTP request has been successfully completed
Title: Wolt Delivery: Food and more – Finland
Description: Wolt delivers from the best restaurants and stores around you.
Address: WoltArkadiankatu 600100 HelsinkiSuomi, Finland

https://www.nokia.com/
The HTTP request has been successfully completed
Title: Home | Nokia
Description: Matkapuhelin-, kiinteiden ja pilvipalveluverkkojen teknologiajohtajana olemme mukana rakentamassa tuottavampaa, vastuullisempaa ja monimuotoisempaa maailmaa.
Address: Visiting address:
Karakaari 7
02610 Espoo, Finland

https://metacoregames.com/
The HTTP request has been successfully completed
Title: Metacore | The game company where players are the closest thing to a boss
Description: We are the game company where players are the closest thing to a boss. That means our dream is to make our players’ dreams come true.
Address: Porkkalankatu 24,00180 HelsinkiFinland



Finally, it populates de dataframe with lists obtained and exports and saves the dataframe into a csv.

In [171]:
df['Title'] = titles
df['Extracted Url'] = urls
df['Description'] = descriptions
df['Address'] = addresses

In [172]:
df.to_csv("company_information2.csv")