# Python BI project
## The goal of this project is to decide where (geographical position) to start/found a startup.
### Startup details (simulation that we want to found it):
* **Name**: Habits.AI
* **Description:** B2B platform for well-being and productivity. Through artificial intelligence, gamification, and behavioral science, we build healthy cultures that increase your employees’ engagement, productivity, loyalty and health.
* **Category**: Digital Health, well-being, B2B, technology, artificial intelligence & machine learning.
* **Important details/questions about the place**:
    * Want to be close to other tech companies.
    * There should be an entrepreneur ecosystem around the city and country (events, meetups, networking, community, etc.).
    * There are big companies around (potencial clients).
    * How difficult is to start a company in that city/country?
    * How many startups fails in the city/country?

## First step: organize data about startups around the world that are registered in Crunchbase
Dataset url: https://www.kaggle.com/arindam235/startup-investments-crunchbase

I imported the data set to MongoDB so I can practice how to use this tool.

In [1]:
# importar librerias
import pymongo
import pandas as pd

In [2]:
# crear el cliente de Mongo
cliente = pymongo.MongoClient()

In [4]:
# comprobar conexión
cliente

MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

In [7]:
# llamar a la base de datos
bbdd = cliente.pythonBIproject

In [38]:
# llamar a la colección
startups_mongo = bbdd.StartUpInvestments
startups_mongo

Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'pythonBIproject'), 'StartUpInvestments')

In [80]:
# llamar solo a las startups que SÍ tienen ciudad y evitar que nos llegue alguna información que no la vamos a usar
startups_lst = list(startups_mongo.find({"city":{"$ne": ""}},
                   {"_id":0,"permalink":0,'round_C': 0,'round_D': 0,
                    'round_E': 0,'round_F': 0,'round_G': 0,'round_H': 0,
                    'post_ipo_equity':0,'post_ipo_debt':0,'secondary_market':0,
                   'product_crowdfunding':0,'debt_financing':0,'convertible_note':0,
                   'equity_crowdfunding':0,'undisclosed':0,'private_equity':0}))

In [81]:
# crear un dataframe con la lista guardada
startups = pd.DataFrame(startups_lst)

In [82]:
# muestra del dataframe de startups
startups.head()

Unnamed: 0,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,...,founded_quarter,founded_year,first_funding_at,last_funding_at,seed,venture,angel,grant,round_A,round_B
0,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,News,1750000,acquired,USA,NY,New York City,New York,...,2012-Q2,2012.0,2012-06-30 00:00:00,2012-06-30 00:00:00,1750000,0.0,0,0,0,0
1,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,Los Angeles,...,,,2010-06-04 00:00:00,2010-09-23 00:00:00,0,4000000.0,0,0,0,0
2,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Publishing,40000,operating,EST,,Tallinn,Tallinn,...,2012-Q4,2012.0,2012-08-09 00:00:00,2012-08-09 00:00:00,40000,0.0,0,0,0,0
3,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Electronics,1500000,operating,GBR,,London,London,...,2011-Q2,2011.0,2011-04-01 00:00:00,2011-04-01 00:00:00,1500000,0.0,0,0,0,0
4,-R- Ranch and Mine,,|Tourism|Entertainment|Games|,Tourism,60000,operating,USA,TX,Dallas,Fort Worth,...,2014-Q1,2014.0,2014-08-17 00:00:00,2014-09-26 00:00:00,0,0.0,0,0,0,0


In [94]:
# hay dos nombres de columnas con espacios extras
startups.columns = [i.strip() for i in list(startups.columns)]
startups.columns

Index(['name', 'homepage_url', 'category_list', 'market', 'funding_total_usd',
       'status', 'country_code', 'state_code', 'region', 'city',
       'funding_rounds', 'founded_at', 'founded_month', 'founded_quarter',
       'founded_year', 'first_funding_at', 'last_funding_at', 'seed',
       'venture', 'angel', 'grant', 'round_A', 'round_B'],
      dtype='object')

In [124]:
# vamos a entender qué categorias de startups hay
lst = startups["market"].value_counts()
#print(lst.to_string()) #imprimir la lista completa de todas las categorias de  la tabla

In [129]:
# todas las categorias tienen un espacio antes y atras de cada palabra, vamos a limpiar
startups["market"] = startups["market"].apply(lambda x: x.strip())

In [137]:
# vamos a filtrar el dataframe a solo estás categorias relacionadas a la startup que queremos fundar: 
cat_lst = ["Apps","Technology","SaaS","Medical","Health and Wellness","Health Care","Mobile",
           "Software","Medical Devices","Internet of Things","Health Care Information Technology",
           "Productivity Software","Machine Learning","Artificial Intelligence","Healthcare Services",
           "Mobile Health","Health and Insurance"]

startups_filt = startups[startups["market"].isin(cat_lst)]

# muestra de la tabla
startups_filt.head()

Unnamed: 0,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,...,founded_quarter,founded_year,first_funding_at,last_funding_at,seed,venture,angel,grant,round_A,round_B
5,.Club Domains,http://nic.club/,|Software|,Software,7000000,,USA,FL,Ft. Lauderdale,Oakland Park,...,2011-Q4,2011.0,2013-05-31 00:00:00,2013-05-31 00:00:00,0,7000000.0,0,0,0,7000000
7,004 Technologies,http://004gmbh.de/en/004-interact,|Software|,Software,-,operating,USA,IL,"Springfield, Illinois",Champaign,...,2010-Q1,2010.0,2014-07-24 00:00:00,2014-07-24 00:00:00,0,0.0,0,0,0,0
10,1-4 All,,|Entertainment|Games|Software|,Software,-,operating,USA,NC,NC - Other,Connellys Springs,...,,,2013-04-21 00:00:00,2013-04-21 00:00:00,0,0.0,0,0,0,0
11,1-800-DENTIST,http://www.1800dentist.com,|Health and Wellness|,Health and Wellness,-,operating,USA,CA,Los Angeles,Los Angeles,...,1986-Q1,1986.0,2010-08-19 00:00:00,2010-08-19 00:00:00,0,0.0,0,0,0,0
12,1-800-DOCTORS,http://1800doctors.com,|Health and Wellness|,Health and Wellness,1750000,operating,USA,NJ,Newark,Iselin,...,1984-Q1,1984.0,2011-03-02 00:00:00,2011-03-02 00:00:00,0,0.0,0,0,0,0


In [145]:
startups_filt.pivot_table(index = ['country_code','city'],
                          columns = "market", aggfunc = "count",
                          margins = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,angel,angel,angel,angel,angel,angel,angel,angel,angel,angel,...,venture,venture,venture,venture,venture,venture,venture,venture,venture,venture
Unnamed: 0_level_1,market,Apps,Artificial Intelligence,Health Care,Health Care Information Technology,Health and Insurance,Health and Wellness,Healthcare Services,Internet of Things,Machine Learning,Medical,...,Machine Learning,Medical,Medical Devices,Mobile,Mobile Health,Productivity Software,SaaS,Software,Technology,All
country_code,city,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
ARE,Dubai,1.0,,,,,,,,,,...,,,,3.0,,,,1.0,,5
ARG,Buenos Aires,,,1.0,,,,,,,,...,,,,9.0,,,,6.0,,16
ARG,C�rdoba,,1.0,,,,,,,,,...,,,,,,,,,,1
ARG,Finca Elisa,,,,,,,,,,,...,,,,,,,,1.0,,1
ARG,Mar Del Plata,,,,,,,,,,,...,,,,1.0,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZAF,Cape Town,,,,,,,,,,,...,,,,3.0,,,1.0,3.0,,7
ZAF,Gauteng,,,,,,,,,,,...,,,,1.0,,,,,,1
ZAF,Johannesburg,,,,,,,,,,,...,,,,1.0,,,,,,1
ZAF,Stellenbosch,,,,,,,,,,,...,,,,1.0,,,,,,1


In [146]:
startups_filt[(startups_filt["market"]=="Apps")&(startups_filt["city"]=="Dubai")]

Unnamed: 0,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,city,...,founded_quarter,founded_year,first_funding_at,last_funding_at,seed,venture,angel,grant,round_A,round_B
38809,Trippifi,http://www.trippifi.com,|Location Based Services|iPhone|Android|Apps|T...,Apps,150000,operating,ARE,,Dubai,Dubai,...,2014-Q1,2014,2014-01-01 00:00:00,2014-01-01 00:00:00,150000,0.0,0,0,0,0
