# Belgian AI Landscape

## Part 1: Scraping and storing data

### 1. Import Modules

In [1]:
import numpy as np
import pandas as pd
import time
from scrapeData import getData, createDriver, getAddress
from concurrent.futures import ThreadPoolExecutor

### 2. Scrape company information from AI4Belgium

In [None]:
# Create a DataFrame from the AI4Belgium website and save as a DataFrame
ai_df = getData('https://www.ai4belgium.be/ai-landscape/')

### 3. Search for the company name in Google search and Maps to obtain addresses

    Note: Here we use threading to speed-up our code.

In [10]:
start_time = time.process_time() 

#Build as many drivers as there are threads, so each thread gets own driver
with ThreadPoolExecutor(max_workers=10) as executor:
     result= [executor.submit(getAddress, name) for name in ai_df['Company Name']]

print(round(time.process_time() - start_time,2), "seconds" ) 

# Save addresses in a list using list-comprehension
addresses =  [item.result()[0] for item in result]

# Add the addresses as a new column in the DataFrame
ai_df['Address']  = addresses

# Save DataFrame as a CSV file
ai_df.to_csv('data/AILandscape_from_script.csv', index = False)

38.36 seconds


### 4. Check for Missing Values

    Currently, our script scrappes 312 addresses out of 437 (i.e. 125 missing). 

In [18]:
(ai_df['Address']=='').sum()

125

### 5. Next Steps

1. Automate filling in missing values. Idea: Check other websites for scraping.
2. Visualize gathered data in a map.
3. Create Streamlit App and deploy.

To address the missing values, we opted to manually input the missing addresses while working on automating this process.

To get our data ready for our app, we wanted to preprocess the addresses to also extract the zipcode and city as separate columns for further filtering.

## Part II: Cleaning 

In [3]:
ai_df_complete = pd.read_csv('data\AI_Landscape_BE_completed.csv')
ai_df_complete.head()

Unnamed: 0,Company Name,Link,Region,Address
0,AE Projects,https://www.ae.be/en/,Flanders,"Interleuvenlaan 27/B, 3001 Leuven"
1,Agilytic,http://www.agilytic.be,Wallonia,"Rue F. Dubois 2, 1310 La Hulpe"
2,Aptus,https://www.aptus.be/#,Flanders,"Meensesteenweg 449, 8501 Kortrijk"
3,Arinti,https://arinti.ai/,Flanders,"Duigemhofstraat 101, 3020 Herent"
4,Around Media,https://www.around.media/,Flanders,"Kortrijksesteenweg 1127/0002, 9051, 9000 Gent"


We note that for most addresses the format is similar, we will use this to our advantage.

In [4]:
ai_df_complete[['Street','Location']] = ai_df_complete['Address'].str.split(",", n=1, expand=True)

# Remove leading and trailing whitespaces
ai_df_complete['Street'] = ai_df_complete['Street'].apply(lambda x: str(x).strip())
ai_df_complete['Location'] = ai_df_complete['Location'].apply(lambda x: str(x).strip())

ai_df_complete[['ZipCode','City']] = ai_df_complete['Location'].str.split(" ", n=1, expand=True)


In [29]:
ai_df_complete.head(20)

Unnamed: 0,Company Name,Link,Region,Address,Street,Location,ZipCode,City
0,AE Projects,https://www.ae.be/en/,Flanders,"Interleuvenlaan 27/B, 3001 Leuven",Interleuvenlaan 27/B,3001 Leuven,3001,Leuven
1,Agilytic,http://www.agilytic.be,Wallonia,"Rue F. Dubois 2, 1310 La Hulpe",Rue F. Dubois 2,1310 La Hulpe,1310,La Hulpe
2,Aptus,https://www.aptus.be/#,Flanders,"Meensesteenweg 449, 8501 Kortrijk",Meensesteenweg 449,8501 Kortrijk,8501,Kortrijk
3,Arinti,https://arinti.ai/,Flanders,"Duigemhofstraat 101, 3020 Herent",Duigemhofstraat 101,3020 Herent,3020,Herent
4,Around Media,https://www.around.media/,Flanders,"Adres: Kortrijksesteenweg 1127/0002, 9051, 900...",Adres: Kortrijksesteenweg 1127/0002,"9051, 9000 Gent",9051,9000 Gent
5,B12 Consulting,https://www.b12-consulting.com/,Wallonia,"Boucle Odon Godart, 2, 1348, Ottignies-Louvain...",Boucle Odon Godart,"2, 1348, Ottignies-Louvain-la-Neuve",2,"1348, Ottignies-Louvain-la-Neuve"
6,Blendr.io,https://www.blendr.io/,Flanders,"Grauwpoort 1, 9000 Gent",Grauwpoort 1,9000 Gent,9000,Gent
7,BOBUP,http://www.bobup.be,Brussels,"Rue de Ransbeek 230, 1120 Bruxelles",Rue de Ransbeek 230,1120 Bruxelles,1120,Bruxelles
8,Boltzmann,http://www.boltzmann.be,Flanders,"Gaston Crommenlaan 12, 9031 Gent",Gaston Crommenlaan 12,9031 Gent,9031,Gent
9,Brainjar,http://www.brainjar.ai,Flanders,"Gaston Geenslaan 11, 3001 Leuven",Gaston Geenslaan 11,3001 Leuven,3001,Leuven


In [28]:
ai_df_complete.to_csv('data/AILandscape_cleaned.csv', index = False)