# Making sublists of companies

With our dataset cleaned in for geolocation queries, the next steps is to start working down the criteria. We can start with the criteria that we can tackle with just the company dataset we have. 

1. Designers like to go to design talks and share knowledge. There must be some nearby companies that also do design.
2. Developers like to be near successful tech startups that have raised at least 1 Million dollars.
3. Nobody in the company likes to have companies with more than 10 years in a radius of 2 KM.

On criteria 1, just including design companies seems very restrictive. We can interpret design more broadly, and include companies that make games in there as well, especially given that this is a gaming company we are placing.

Criteria 3 we will ignore for now, since a quick query showed that it severely restricts the amount of companies.


In [1]:
# We can use the following regex for it.
# regex filter for design: {description_1: {$regex: "design" }}


In [2]:
from pymongo import MongoClient
import pandas as pd


In [3]:
client = MongoClient("mongodb://localhost/companies")
db = client.get_database()
c = "companies_wlocation"

In [4]:
moneyRaised = list(db[c].find({"description_1": {"$regex": "game*|design" }}, {"total_money_raised":1}
))
# moneyRaised

Having looked at how the total money raised cell is structured we notice two things: 

1. It seems that all of these companies have raised money in USD and will likely be in the US. We will probably need another database if we want to expand our reach beyond the US.
2. We can use a regex to select those companies which have raised more than 1M USD

In [5]:
over1M = pd.DataFrame(list(db[c].find({"$and": [{"description_1": {"$regex": "game*|design"}}, {"total_money_raised": {"$regex": "M" }}]}))) #["name", "total_money_raised", "latitude", "longitude", "location"] # This is a projection we can add
over1M.shape
# This leaves us with 85 companies that we can place ours close to.

(85, 51)

In [6]:
over1M.columns
over1M.head()
#over1M.explode('location')

Unnamed: 0,_id,name,permalink,crunchbase_url,homepage_url,blog_url,blog_feed_url,twitter_username,category_code,number_of_employees,...,description_2,address1,address2,zip_code,city,state_code,country_code,latitude,longitude,location
0,5e41272299bd8e3148da766e,Curse,curse,http://www.crunchbase.com/company/curse,http://www.curse.com,,,cursenetwork,games_video,58.0,...,San Francisco,60 Broadway,,94111.0,San Francisco,CA,USA,37.787092,-122.399972,"{'type': 'Point', 'coordinates': [-122.399972,..."
1,5e41272299bd8e3148da7674,Curse,curse,http://www.crunchbase.com/company/curse,http://www.curse.com,,,cursenetwork,games_video,58.0,...,Huntsville,150 West Park Loop NW,,35806.0,Huntsville,AL,USA,,,
2,5e41272299bd8e3148da76bd,Grockit,grockit,http://www.crunchbase.com/company/grockit,http://grockit.com,http://blog.grockit.com,,grockit,social,25.0,...,,500 Third Street,Suite 260,94107.0,San Francisco,CA,USA,37.775196,-122.419204,"{'type': 'Point', 'coordinates': [-122.419204,..."
3,5e41272299bd8e3148da77cf,MocoSpace,mocospace,http://www.crunchbase.com/company/mocospace,http://www.mocospace.com,,,mocospace,games_video,25.0,...,,,,2111.0,Boston,MA,USA,42.350274,-71.058768,"{'type': 'Point', 'coordinates': [-71.058768, ..."
4,5e41272299bd8e3148da7816,OMGPOP,omgpop,http://www.crunchbase.com/company/omgpop,http://omgpop.com,http://blog.iminlikewithyou.com/,http://blog.iminlikewithyou.com/rss,omgpop,games_video,50.0,...,,SoHo,,,New York,NY,USA,40.723384,-74.001704,"{'type': 'Point', 'coordinates': [-74.001704, ..."


In [7]:
import geopandas as gpd
from geopy.distance import distance
from shapely.geometry import Point
import matplotlib.pyplot as plt
from cartoframes.viz import Map, Layer
from cartoframes.viz.helpers import size_continuous_layer
from cartoframes.viz.widgets import histogram_widget
import numpy as np

%matplotlib inline


I got an error when trying to visualize the dataframe, which is likely because there are some lattitudes/longitudes that do not exist. We can filter these out of the dataframe. We need to limit latitude to -90 to 90, and longitude to -180 to 180.

In [8]:
over1M = over1M[(over1M['latitude'] < 90) & (over1M['latitude'] > -90) & (over1M['longitude'] < 180) & (over1M['longitude'] > -180)]
over1M.to_json("../source/gaming-designover1M.json", orient="records")

OverflowError: Unterminated UTF-8 sequence when encoding string

In [0]:
gdf = gpd.GeoDataFrame(over1M, geometry=gpd.points_from_xy(over1M.longitude, over1M.latitude))
print(f'Tipo: {type(gdf)}')
gdf.head()
gdf.to_csv(index=)

In [0]:
gdf2 = gdf.dropna()
gdf2.drop(columns = ['location'], inplace = True)
gdf2.head()

In [0]:
gdf2.to_csv("../output/gaming-designcompanies1M.csv")
type(gdf2)

In [0]:
geojson = gdf2[['geometry']].to_file("output.geojson", driver="GeoJSON")

#df = gpd.read_file("../output/gaming-designcompanies1M.json", crs='EPSG:4326')

In [0]:
df = gpd.read_file("output.geojson", crs='EPSG:4326')
gdf2.head()

In [0]:
Map(Layer(df))


# Switching to Folium

Unfortunately our efforts in trying to convert these points into a cartoframe have not been very succesful. We can try to visualize the list of points we have now in Folium instead.

In [0]:
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster


In [0]:
start_lat = 40.408561
start_lon = -3.6917665
m = folium.Map(location=[start_lat, start_lon], zoom_start=3)
#heat_m

In [0]:
coords = zip(gdf2.longitude, gdf.latitude)
#print(list(coords))
for lon, lat in coords:
   m.add_child(Marker([lat, lon], icon=folium.Icon(color='green', icon='asterisk'))) 

In [0]:
heat_m

In [0]:
# In order to get a different type of marker, we would have to look further into the circle option. For now we're not quite getting it to work. 
# link here: https://python-visualization.github.io/folium/quickstart.html
coords = zip(gdf2.longitude, gdf.latitude)
#print(list(coords))
for lon, lat in coords:
   m.add_child(folium.CircleMarker([lat, lon], radius=10)) 

Now we have a general idea of where the ~80 companies we selected are located. We can now continue to explore the other conditions.