# Belgian AI Landscape

## Part 1: Scraping and storing data

### 1. Import Modules

In [6]:
import numpy as np
import pandas as pd
import time
from scrapeData import getData, createDriver, getAddress
from concurrent.futures import ThreadPoolExecutor

### 2. Scrape company information from AI4Belgium

In [None]:
# Create a DataFrame from the AI4Belgium website and save as a DataFrame
ai_df = getData('https://www.ai4belgium.be/ai-landscape/')

### 3. Search for the company name in Google search and Maps to obtain addresses

    Note: Here we use threading to speed-up our code.

In [10]:
start_time = time.process_time() 

#Build as many drivers as there are threads, so each thread gets own driver
with ThreadPoolExecutor(max_workers=10) as executor:
     result= [executor.submit(getAddress, name) for name in ai_df['Company Name']]

print(round(time.process_time() - start_time,2), "seconds" ) 

# Save addresses in a list using list-comprehension
addresses =  [item.result()[0] for item in result]

# Add the addresses as a new column in the DataFrame
ai_df['Address']  = addresses

# Save DataFrame as a CSV file
ai_df.to_csv('../data/AILandscape_from_script.csv', index = False)

38.36 seconds


### 4. Check for Missing Values

    Currently, our script scrappes 312 addresses out of 437 (i.e. 125 missing). 

In [18]:
(ai_df['Address']=='').sum()

125

### 5. Next Steps

1. Automate filling in missing values. Idea: Check other websites for scraping.
2. Visualize gathered data in a map.
3. Create Streamlit App and deploy.

To address the missing values, we opted to manually input the missing addresses while working on automating this process.

To get our data ready for our app, we wanted to preprocess the addresses to also extract the zipcode and city as separate columns for further filtering.

## Part II: Cleaning 

In [6]:
ai_df_complete = pd.read_csv('../data/AI_Landscape_BE_completed.csv')
ai_df_complete.head()

Unnamed: 0,Company Name,Link,Region,Address
0,AE Projects,https://www.ae.be/en/,Flanders,"Interleuvenlaan 27/B, 3001 Leuven"
1,Agilytic,http://www.agilytic.be,Wallonia,"Rue F. Dubois 2, 1310 La Hulpe"
2,Aptus,https://www.aptus.be/#,Flanders,"Meensesteenweg 449, 8501 Kortrijk"
3,Arinti,https://arinti.ai/,Flanders,"Duigemhofstraat 101, 3020 Herent"
4,Around Media,https://www.around.media/,Flanders,"Kortrijksesteenweg 1127/0002, 9051, 9000 Gent"


We note that for most addresses the format is similar, we will use this to our advantage.

In [7]:
ai_df_complete[['Street','Location']] = ai_df_complete['Address'].str.split(",", n=1, expand=True)

# Remove leading and trailing whitespaces
ai_df_complete['Street'] = ai_df_complete['Street'].apply(lambda x: str(x).strip())
ai_df_complete['Location'] = ai_df_complete['Location'].apply(lambda x: str(x).strip())

ai_df_complete[['ZipCode','City']] = ai_df_complete['Location'].str.split(" ", n=1, expand=True)

We drop the missing values since these companies do not have an active website and their information is outdated. This is about 9 companies in total.

In [8]:
ai_df_complete.dropna(inplace=True)

In [9]:
# We add the country to the address in preparation for our visualization
ai_df_complete['Address'] = ai_df_complete['Address'].apply(lambda x: x + ', Belgium')

Finally, we saved our new DataFrame as a csv file.

In [10]:
ai_df_complete.to_csv('../data/AILandscape_cleaned.csv', index = False)

## Part III: Visualizing Data in a Map.

In [11]:
import geopy
import pandas as pd
# OpenStreetMap API
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")

ai_df_complete['gcode'] = ai_df_complete['Address'].apply(geolocator.geocode)
print('Missing Coordinates: ' + str(ai_df_complete['gcode'].isna().sum()))

Missing Coordinates: 75


In [12]:
# Drop Missing Values
ai_df_complete['gcode'].dropna(inplace = True)

In [14]:
import numpy as np
ai_df_complete['Latitude'] = [g.latitude if g is not None else np.nan for g in ai_df_complete.gcode ]
ai_df_complete['Longitude'] = [g.longitude if g is not None else np.nan for g in ai_df_complete.gcode]

In [15]:
ai_df_complete.to_csv('../data/AILandscape_geocoded.csv', index = False)

In [16]:
#Test interactive map function
import pandas as pd
from interactiveMap import get_location_interactive

ai_df = pd.read_csv('../data/AILandscape_geocoded.csv')
get_location_interactive(ai_df)
