# Introduction

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups went to the site and released the data online. Since the information on the startups and their fundraising rounds is always changing, the dataset we'll be using isn't completely up to date.

Throughout this project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes 10.3 megabytes of disk space, we know that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's multiple string columns).

In [9]:
import pandas as pd
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
df = pd.read_csv('crunchbase-investments.csv', nrows=5, encoding='ISO-8859-1')
df

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000
3,/company/zoopshop,ZoopShop,software,USA,OH,Columbus,columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012,20000
4,/company/efuneral,eFuneral,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011,20000


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   company_permalink       5 non-null      object
 1   company_name            5 non-null      object
 2   company_category_code   5 non-null      object
 3   company_country_code    5 non-null      object
 4   company_state_code      4 non-null      object
 5   company_region          5 non-null      object
 6   company_city            4 non-null      object
 7   investor_permalink      5 non-null      object
 8   investor_name           5 non-null      object
 9   investor_category_code  4 non-null      object
 10  investor_country_code   5 non-null      object
 11  investor_state_code     5 non-null      object
 12  investor_region         5 non-null      object
 13  investor_city           5 non-null      object
 14  funding_round_type      5 non-null      object
 15  funded_at 

In [27]:
# count missing values
mvs = []
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
for chunk in chunk_iter:
    mvs.append(chunk.isnull().sum())
    
mvs = pd.concat(mvs)
mvs.groupby(mvs.index).sum().sort_values(ascending=False)

investor_category_code    50427
investor_state_code       16809
investor_city             12480
investor_country_code     12001
raised_amount_usd          3599
company_category_code       643
company_city                533
company_state_code          492
funding_round_type            3
funded_year                   3
funded_month                  3
funded_at                     3
funded_quarter                3
investor_name                 2
investor_permalink            2
investor_region               2
company_region                1
company_permalink             1
company_name                  1
company_country_code          1
dtype: int64

In [34]:
# memory by column
mems = []
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
for chunk in chunk_iter:
    mems.append(chunk.memory_usage(deep=True))
    
mems = pd.concat(mems)
mems = mems.groupby(mems.index).sum().sort_values(ascending=False).drop("Index")
mems

investor_permalink        4980548
company_permalink         4057788
investor_name             3915666
company_name              3591326
funded_at                 3542185
company_city              3505926
company_category_code     3421104
company_region            3411585
funding_round_type        3410707
investor_region           3396281
funded_month              3383584
funded_quarter            3383584
company_country_code      3172176
company_state_code        3106051
investor_city             2885083
investor_country_code     2647292
investor_state_code       2476607
investor_category_code     622424
raised_amount_usd          422960
funded_year                422960
dtype: int64

In [35]:
# total memory in MB
mems.sum() / (1024 * 1024)

56.9876070022583

# Clean Data

In [36]:
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)
date_cols = ['funded_at', 'funded_month', 'funded_quarter']

In [51]:
mems = []
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols, parse_dates=date_cols)
for chunk in chunk_iter:
    chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'])
    mems.append(chunk.memory_usage(deep=True))
    
mems = pd.concat(mems)
mems = mems.groupby(mems.index).sum().sort_values(ascending=False).drop("Index")
mems.sum() / (1024 * 1024)

39.1527214050293

We were able to reduce the memory size from 57MB to 39MB.

# SQLite

In [52]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

for chunk in chunk_iter:
    chunk.to_sql("investments", conn, if_exists='append', index=False)

pd.read_sql('PRAGMA table_info(investments);', conn)

We will use the pandas/SQLite workflow to answer one or two questions:

-    What proportion of the total amount of funds did the top 10 companies raise? The bottom 10%?
-    Which category of company attracted the most investments?
-    Which funding round was the most popular? Which was the least popular?

In [86]:
# What proportion of the total amount of funds did the top 10 companies raise? The bottom 10%?

amt_by_company = pd.read_sql("select company_name, sum(raised_amount_usd) amt from investments where raised_amount_usd is not null group by company_name", conn)
amt_total = pd.read_sql("select sum(raised_amount_usd) amt from investments where raised_amount_usd is not null", conn)

amt_by_company['percent'] = round(amt_by_company['amt'] / float(amt_total['amt']), 2)
amt_by_company.sort_values('percent', ascending=False, inplace=True)

amt_by_company

Unnamed: 0,company_name,amt,percent
1828,Clearwire,2.968000e+10,0.04
3653,Groupon,1.018540e+10,0.01
5584,Nanosolar,4.505000e+09,0.01
3004,Facebook,4.154100e+09,0.01
0,#waywire,8.750000e+06,0.00
...,...,...,...
3454,GigaSpaces Technologies,3.300000e+07,0.00
3455,GigaTrust,3.300000e+07,0.00
3456,Gigamon,2.280000e+07,0.00
3457,Gigi Hill,3.000000e+06,0.00


In [91]:
amt_by_company.head(10).percent.sum(), amt_by_company.tail(10).percent.sum()

(0.07, 0.0)

The top 10 companies were responsible for 7% of total revenue, the bottom 10 companies were responsible to 0% of total revenues (possible rounding error).

In [95]:
# Which category of company attracted the most investments?

pd.read_sql("select company_category_code, count(*) from investments where raised_amount_usd is not null group by company_category_code order by 2 desc limit 1", conn)

Unnamed: 0,company_category_code,count(*)
0,software,6861


Software was the #1 investment category, with 6861 total investments

In [105]:
# Which funding round was the most popular? Which was the least popular?

funding = pd.read_sql("select funding_round_type from investments where raised_amount_usd is not null", conn)
funding.iloc[:, 0].value_counts()

series-a          13377
series-c+         10764
series-b           8630
venture            8031
angel              7190
other               934
private-equity      312
post-ipo             29
crowdfunding          4
Name: funding_round_type, dtype: int64

Series-a was the most popular; crowdfunding was the least popular.