### Web Scrapping

**Problema real**: A nuestra compañera Tengya su jefe le ha mandado la ardua tarea de recopilar información de la siguiente página WEB: https://biocat.force.com/Catalonialifesciencesdatabase/s/

Hay más de 1600 empresas páginadas de 20 en 20, y para cada una de ellas hay que recopilar los siguientes campos:

In [14]:
campos = ['company', 'location', 'main_sector', 'subsector', 'primary', 'description']

Yendo rápido, podemos hacer una estimaciín del tiempo que nos ocuparía:

$$tiempo\:necesario = 1609\times2\:min = 3218\:min = 53h = 6.7\:jornadas\:laborales$$

**Alternativa**: Utilizar nuestros conocimientos de **SCRAPPING** como **data analysts**


#### Librerias necesarias

conda install -c conda-forge selenium   (o bien pip install selenium)

conda install -c anaconda beautifulsoup4   (o bien pip install beautifulsoup4)

**Es necesario tener le navegador Chrome instalado**

#### Código:

In [15]:
import time
import pandas as pd
import re
from bs4 import BeautifulSoup
from selenium import webdriver

Definimos la URL:

In [16]:
url_base = "https://biocat.force.com/Catalonialifesciencesdatabase/s/"

Vamos a controlar una instancia de Chrome a través de Python:

In [17]:
driver = webdriver.Chrome("./chromedriver.exe")

Nos descargamos toda la información de la página, incluído el html que se genera con el javascript que llega incrustado:

In [18]:
driver.get(url_base)
time.sleep(3)

Para obtener la página en HTML usamos:

In [19]:
html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
time.sleep(2)

Una vez tenemos el HTML generado ya podemos utilizar BeautifulSoup:

In [20]:
soup = BeautifulSoup(html, "html.parser")

Ahora la gracia está en analizar bien el HTML y donde están las etiquetas de los elementos que nos interesa scrappear.

Para ver el documento HTML de la forma más ordenada posible, podemos utilizar:

In [21]:
from lxml import etree, html

document_root = html.fromstring(str(soup))
print(etree.tostring(document_root, encoding='unicode', pretty_print=True))

<html>
  <head><title>Catalonia Health and Life Sciences Data Platform</title><link href="https://biocat.force.com/Catalonialifesciencesdatabase/resource/1664541382000/BiocatFavicon" rel="shortcut icon"/><meta content="initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0, minimal-ui" name="viewport"/>
<style>
        .auraMsgMask, #auraErrorMask, body .auraLoadingBox {
            display: none;
        }

        .spaError {
            padding: 10px;
        }

        .spaErrorLink {
            padding: 10px 0;
            display: block
        }

        </style><style/><link class="auraCss" data-href="/Catalonialifesciencesdatabase/s/sfsites/l/%7B%22mode%22%3A%22PROD%22%2C%22app%22%3A%22siteforce%3AcommunityApp%22%2C%22loaded%22%3A%7B%22APPLICATION%40markup%3A%2F%2Fsiteforce%3AcommunityApp%22%3A%22HU7u5uZWBH_8Nshn9opI8g%22%7D%2C%22styleContext%22%3A%7B%22c%22%3A%22webkit%22%2C%22x%22%3A%5B%22isDesktop%22%5D%2C%22tokens%22%3A%5B%22markup%3A%2F%2Fsiteforce%3AserializedTokens%22

Una vez hemos identificado los campos que queremos recoger nos creamos una función que nos lo haga automáticamente:

In [22]:
def biocat_parser(card):
    # title parsing (company name)
    title_line = card.find_all(True, {"class": "card-title"})
    title = re.findall(r'>(.*?)<', str(title_line))
    company = [row for row in title if re.findall("[A-z]", row)][0]
    try:
        # City, Province
        subtitle_line = card.find_all(True, {"class": "card-subtitle"})
        subtitle = re.findall(r'>(.*?)<', str(subtitle_line))
        location = [row for row in subtitle if re.findall("[A-z]", row)][0]

        # Main sector, subsector, Primary therapeutic areas and Description
        section = str(card.find_all(True, {"class": "section-desc"}))
        section = [row for row in re.findall(r'>(.*?)<', section) if re.findall("[A-z]", row)]
        main_sector = section[0]
        subsector = section[1]
        if len(section) == 4:
            primary = section[2]
            description = section[3]
        elif len(section) == 3:
            primary = ''
            description = section[2]
        else:
            return None, company

        return company, location, main_sector, subsector, primary, description
    except:
        return None, company

Tambien tendremos que encontrar el botton de `Next` para que el programa vaya navegando por las distintas paginaciones. Lo podemos encontrar por el texto que contiene:

In [23]:
boton_next = driver.find_element_by_xpath('//button[text()="Next"]')

Nos definimos un DataFrame vació donde iremos guardando la información que queremos recoger sobre las compañías:

In [24]:
columns = ['company', 'location', 'main_sector', 'subsector', 'primary', 'description']
data = pd.DataFrame([], columns=columns)

Ahora, recorreremos todas las página de resultados con el botón `Next`e iremos recogiendo toda la información que nos interese con la función que hemos creado `biocat_parser`:

In [12]:
not_finished = True
potential_conflict_companies = []
potential_conflict_indexes = []
pagination_lengths = []
time.sleep(3)
while not_finished:
    try:
        html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        soup = BeautifulSoup(html, "html.parser")
        cards = soup.find_all(True, {"class": "card-container"})
        pagination_lengths.append(len(cards))
        for i, card in enumerate(cards):
            try:
                item_df = pd.DataFrame([list(biocat_parser(card))], columns=columns)
                if item_df.empty or biocat_parser(card) is None:
                    pass
                data = pd.concat([data, item_df], ignore_index=True)
                print(f"Data from company extracted: {item_df['company'].iloc[0]}")
            except:
                try:
                    _, company_conflict = biocat_parser(card)
                except:
                    pass
                potential_conflict_companies.append(f"{company_conflict}")
                print(f"\n********************************** Potential issue in company: "
                      f"{company_conflict}\n")
        driver.find_element_by_xpath('//button[text()="Next"]').click()
        time.sleep(1)
    except:
        not_finished = False

Data from company extracted: AB Medica Group
Data from company extracted: Abac Capital
Data from company extracted: Abac Therapeutics
Data from company extracted: Abamed Pharma
Data from company extracted: AB-Biotics
Data from company extracted: ABC Farma Internacional
Data from company extracted: Abeona Health
Data from company extracted: ABG Patentes
Data from company extracted: Ability Pharma
Data from company extracted: ABLE Human Motion
Data from company extracted: Abzu
Data from company extracted: Accelerate Diagnostics
Data from company extracted: ACCIÓ Agència per a la competitivitat de l'empresa


In [13]:
pd.DataFrame({"conflicts": potential_conflict_companies}).to_csv("conflicts.csv")
data.to_csv("biocat.csv")

In [25]:
not_finished = True
potential_conflict_companies = []
potential_conflict_indexes = []

time.sleep(1)
t = time.time()

while not_finished:
    try:
        html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        soup = BeautifulSoup(html, "html.parser")
        cards = soup.find_all(True, {"class": "card-container"})
        for i, card in enumerate(cards):
            try:
                item_df = pd.DataFrame([list(biocat_parser(card))], columns=columns)
                data = pd.concat([data, item_df], ignore_index=True)
                print(f"Data from company extracted: {item_df['company'].iloc[0]}")
            except:
                potential_conflict_indexes.append(i)
                potential_conflict_companies.append(f"{item_df['company'].iloc[0]}")
                print(f"\n********************************** Potential issue in company with index {i}: {item_df['company']}\n")
        driver.find_element_by_xpath('//button[text()="Next"]').click()
        time.sleep(1)
    except:
        not_finished = False

print(f"\n\nTiempo total: {time.time()-t:.2f} segundos")

Data from company extracted: AB Medica Group
Data from company extracted: Abac Capital
Data from company extracted: Abac Therapeutics
Data from company extracted: Abamed Pharma
Data from company extracted: AB-Biotics
Data from company extracted: ABC Farma Internacional
Data from company extracted: Abeona Health
Data from company extracted: ABG Patentes
Data from company extracted: Ability Pharma
Data from company extracted: ABLE Human Motion
Data from company extracted: Abzu
Data from company extracted: Accelerate Diagnostics
Data from company extracted: ACCIÓ Agència per a la competitivitat de l'empresa

********************************** Potential issue in company with index 13: 0    ACCIÓ Agència per a la competitivitat de l'emp...
Name: company, dtype: object

Data from company extracted: Accure Therapeutics
Data from company extracted: ACEFE
Data from company extracted: Acellera
Data from company extracted: Actelion Pharmaceuticals España
Data from company extracted: Acteon Medico

In [26]:
len(data)

1389

In [27]:
len(potential_conflict_companies)

221

In [31]:
len(data) + len(potential_conflict_companies)

1610

**Dato curioso**: Ahora EY tiene información de mi start-up **LEDMOTIVE**!!!

In [28]:
'Ledmotive' in list(data['company'])

True

In [29]:
data[data['company']=="Ledmotive"]

Unnamed: 0,company,location,main_sector,subsector,primary,description
820,Ledmotive,"Sant Adrià del Besos, Barcelona",Supplier &amp; Engineering,Instrumentation,,Ledmotive is specialized in spectrally tunable...


In [30]:
data[data['company']=="Ledmotive"]['description'].iloc[0]

'Ledmotive is specialized in spectrally tunable LED light engines with 10 independent channels. Integrated spectrometer to check the light output in real-time. Spectroscopy, microscopy, medical sciences, biology and optogenetics, photonics scientific fields can take advantage of their light sources and scientific software.'