# Web Scraping from Wikipédia

The goal is to collect information on 6 companies by scraping the wikipedia.org website. The companies are:
- Microsoft
- Salesforce
- BNP Paribas
- HSBC
- Dataiku
- Bouygues Construction

The information we are interested in is:
- Returned
- Number of employees
- The headquarters
- Social media links
- Website links

To solve the problem we will use the Selenium library. This involves creating a Dashboard in the form of a web application where all the information collected will be represented. The application must contain a map to locate the head offices

In [94]:
from selenium import webdriver
import pandas as pd
import numpy as np
import re

## Web Scraping

In [95]:
#store website in a variable
url = 'https://en.wikipedia.org/wiki/'
#chromedriver
driver = webdriver.Chrome('chromedriver.exe')

  driver = webdriver.Chrome('chromedriver.exe')


In [96]:
companies = ['Microsoft','Salesforce','BNP Paribas','HSBC','Dataiku','Bouygues Construction']
effectif = []
site_web = []
siege_social = []
revenue = []

for company in companies :
    
    #wikipedia extraction
    #go to the wikipage
    driver.get(url + company.replace(" ", "_"))
    
    #collect raw information
    wiki_data = driver.find_elements_by_xpath('//table[@class="infobox vcard"]/tbody')
    
    #collect information we need using RegEx
    effectif.append(re.findall(r"Number of employees(.*?)\n",wiki_data[0].text)[0])
    site_web.append(re.findall(r"Website\s\w+\.\w+\.\w+|Website\s\w+\.\w+|Website\s\.*",wiki_data[0].text)[0])
    siege_social.append(re.findall(r"Headquarters(.*?)\n(.*?)\n",wiki_data[0].text)[0])
    revenue.append(re.findall(r"Revenue(.*?)\n",wiki_data[0].text)[0])
       

  wiki_data = driver.find_elements_by_xpath('//table[@class="infobox vcard"]/tbody')


## From raw data to preprocessed CSV file

In [97]:
#create dataframe
dataframe = pd.DataFrame(data=np.array([companies,effectif,site_web,revenue,siege_social]).T, columns=["Name", "Effectif", "Site Web", "Revenue", "Adresse"])

#preprocessing 
dataframe["Site Web"] = dataframe["Site Web"].apply(lambda x : "https://www." + x.replace("Website","").replace(" ",""))
dataframe["Effectif"] = dataframe["Effectif"].apply(lambda x : int(x.split("(")[0].replace(',','').replace('+','')))
dataframe['Revenue'] = dataframe['Revenue'].apply(lambda x : x.split("(")[0])
dataframe["Adresse"] = dataframe["Adresse"].apply(lambda x : " ".join(map(str, x)).split("Key")[0].split("Area")[0])
dataframe["Revenue"] = dataframe["Revenue"].apply(lambda x : x.replace(".","").replace(" million","").replace(" billion", "000"))
dataframe["Revenue"] = dataframe["Revenue"].apply(lambda x : int(x.replace("US$", "") if "US$" in x else 1.04*int(x.replace("€", ""))))
dataframe["LinkedIn"] = dataframe["Name"].apply(lambda x : "https://www.linkedin.com/company/" + x.lower().replace(" ","-") +"/")

  dataframe = pd.DataFrame(data=np.array([companies,effectif,site_web,revenue,siege_social]).T, columns=["Name", "Effectif", "Site Web", "Revenue", "Adresse"])


For the map we need to find the geographical coordinates (lontitude, latitude) from the scraped addresses.

In [107]:
from geopy import Nominatim
from geopy.extra.rate_limiter import RateLimiter


locator = Nominatim(user_agent="myGeocoder")
location = locator.geocode("Washington, U.S.")

# 1 - conveneint function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=2)

def get_geocode(address):
    """
    Get longitude and latitude. If the 
    starting address does not allow it.
    Words are deleted until the address
    is understood by the 
    geopy.Nominatim.geocode function.
    
    Parameters :
    ------------
    
    address : str
    
    Return 
    ------
    
    longitude, latitude : float
    
    """
    
    while geocode(address) == None :
        address = " ".join(address.split()[1:])


        if address == "":
            return None, None
        
    return geocode(address).longitude, geocode(address).latitude

In [114]:
longitude = []
latitude = []

for address in dataframe["Adresse"] :
    
    long, lat = get_geocode(address)
    
    longitude.append(long)
    latitude.append(lat)

In [115]:
dataframe["Longitude"] = longitude
dataframe["Latitude"] = latitude

In [116]:
dataframe

Unnamed: 0,Name,Effectif,Site Web,Revenue,Adresse,LinkedIn,Longitude,Latitude
0,Microsoft,221000,https://www.microsoft.com,1983000,"One Microsoft Way Redmond, Washington, U.S.",https://www.linkedin.com/company/microsoft/,-122.17795,47.618895
1,Salesforce,73542,https://www.salesforce.com,2649000,"Salesforce Tower San Francisco, California, U.S.",https://www.linkedin.com/company/salesforce/,-60.685076,-33.029148
2,BNP Paribas,190000,https://www.group.bnpparibas,480480,"Boulevard des Italiens, Paris, France",https://www.linkedin.com/company/bnp-paribas/,2.339403,48.87189
3,HSBC,219697,https://www.www.hsbc.com,49552000,"8 Canada Square London, England, UK",https://www.linkedin.com/company/hsbc/,-0.01744,51.505432
4,Dataiku,1000,https://www.dataiku.com,150,"New York City, United States",https://www.linkedin.com/company/dataiku/,-74.006015,40.712728
5,Bouygues Construction,124600,https://www.,3909360,"8th arrondissement, Paris, France",https://www.linkedin.com/company/bouygues-cons...,2.31765,48.87748


In [118]:
dataframe.to_csv("delpha\collected_data\delpha_data.txt")

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (<ipython-input-118-06676fb32b11>, line 1)