# Analyzing Startup Fundraising Deals from Crunchbase

In this project, we'll use some of the techniques needed to deal with processing large datasets in order to analyze startup investments from Crunchbase.com.

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.

The data set of investments we'll be exploring is current as of October 2013. We can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this project, we'll work assuming __memory constraints__. Let's assume we only have 10 megabytes of available memory. While crunchbase-investments.csv consumes 10.3 megabytes of disk space, we know that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

## Previewing first few rows of data

In [1]:
import sqlite3
import pandas as pd
pd.options.display.max_columns = 99

In [2]:
first_five = pd.read_csv("crunchbase-investments.csv", nrows=5, encoding="ISO-8859-1")

In [3]:
first_five

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000
3,/company/zoopshop,ZoopShop,software,USA,OH,Columbus,columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012,20000
4,/company/efuneral,eFuneral,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011,20000


## Determining chunksize to select and length of entire dataset

Let's start with nrows=1000

In [4]:
first_1k = pd.read_csv("crunchbase-investments.csv", nrows=1000, encoding="ISO-8859-1")
print(first_1k.memory_usage(deep=True).sum() / (1024 * 1024))

1.1114320755004883


Since we have 10 MB memory to spare for the dataframe, we go for dataframe size around 5 MB (to be on safe side). First 1000 rows occupy around 1.11 MB, so let's select chunksize=5000 and check memory usage of each chunk.

In [5]:
nrows = 0
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize=5000, encoding="ISO-8859-1")
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum() / (1024 * 1024))
    nrows += chunk.shape[0]

5.579240798950195
5.528232574462891
5.535050392150879
5.5282087326049805
5.52435302734375
5.553458213806152
5.531436920166016
5.5096588134765625
5.396136283874512
4.639497756958008
2.6637144088745117


Thus, selection of chunksize=5000 seems okay.

In [6]:
print("Dataset has {} rows and {} columns".format(nrows, chunk.shape[1]))

Dataset has 52870 rows and 20 columns


## Missing values, memory usage of columns

Across all of the chunks, we can explore:
- Each column's missing value counts
- Each column's memory footprint
- The total memory footprint of all of the chunks combined
- Which column(s) we can drop because they aren't useful for analysis

In [7]:
cols_missing_val_counts = []
cols_mem_usage_MB = []
total_mem_usage_MB = 0
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize=5000, encoding="ISO-8859-1")
for chunk in chunk_iter:
    cols_missing_val_counts.append(chunk.isnull().sum())
    cols_mem_usage_MB.append(chunk.memory_usage(deep=True) / (1024 * 1024))
    # Need to drop index column of the series generated after call to df.memory_usage()
    # We can chose not to drop the index column, the share of Index column in the memory
    # footprint across all chunks would be negligible
    total_mem_usage_MB += (chunk.memory_usage(deep=True).drop("Index").sum() / (1024 * 1024))
cols_missing_val_counts_combined = pd.concat(cols_missing_val_counts)
final_cols_missing_value_counts = \
cols_missing_val_counts_combined.groupby(cols_missing_val_counts_combined.index).sum()

cols_mem_usage_MB_combined = pd.concat(cols_mem_usage_MB)
final_cols_mem_usage_MB = \
cols_mem_usage_MB_combined.groupby(cols_mem_usage_MB_combined.index).sum()

### Each column's missing value counts

In [8]:
print(final_cols_missing_value_counts.sort_values())

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64


### Each column's memory footprint in MB

In [9]:
print(final_cols_mem_usage_MB.sort_values())

Index                     0.001381
funded_year               0.403366
raised_amount_usd         0.403366
investor_category_code    0.593590
investor_state_code       2.361876
investor_country_code     2.524654
investor_city             2.751430
company_state_code        2.962161
company_country_code      3.025223
funded_quarter            3.226837
funded_month              3.226837
investor_region           3.238946
funding_round_type        3.252704
company_region            3.253541
company_category_code     3.262619
company_city              3.343512
funded_at                 3.378091
company_name              3.424955
investor_name             3.734270
company_permalink         3.869808
investor_permalink        4.749821
dtype: float64


#### Drop memory footprint calculation for the Index

In [10]:
final_cols_mem_usage_MB.drop("Index", inplace=True)
print(final_cols_mem_usage_MB.sort_values())

raised_amount_usd         0.403366
funded_year               0.403366
investor_category_code    0.593590
investor_state_code       2.361876
investor_country_code     2.524654
investor_city             2.751430
company_state_code        2.962161
company_country_code      3.025223
funded_quarter            3.226837
funded_month              3.226837
investor_region           3.238946
funding_round_type        3.252704
company_region            3.253541
company_category_code     3.262619
company_city              3.343512
funded_at                 3.378091
company_name              3.424955
investor_name             3.734270
company_permalink         3.869808
investor_permalink        4.749821
dtype: float64


#### Trying alternate way to calculate memory footprint of each column

In [11]:
col_mem_usages = pd.Series(dtype="float64")
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize=5000, encoding="ISO-8859-1")
for chunk in chunk_iter:
    if col_mem_usages.empty:
        col_mem_usages = chunk.memory_usage(deep=True) / (1024 * 1024)
    else:
        col_mem_usages += chunk.memory_usage(deep=True) / (1024 * 1024)

# Drop memory footprint calculation for the index.
col_mem_usages.drop("Index", inplace=True)
print(col_mem_usages.sort_values())

raised_amount_usd         0.403366
funded_year               0.403366
investor_category_code    0.593590
investor_state_code       2.361876
investor_country_code     2.524654
investor_city             2.751430
company_state_code        2.962161
company_country_code      3.025223
funded_month              3.226837
funded_quarter            3.226837
investor_region           3.238946
funding_round_type        3.252704
company_region            3.253541
company_category_code     3.262619
company_city              3.343512
funded_at                 3.378091
company_name              3.424955
investor_name             3.734270
company_permalink         3.869808
investor_permalink        4.749821
dtype: float64


### Total memory footprint of all the chunks combined

In [12]:
print(total_mem_usage_MB)

56.9876070022583


Or, alternatively:

In [13]:
print(col_mem_usages.sum())

56.9876070022583


## Selecting columns to drop

We can drop the columns that are not useful for analysis - `investor_permalink`, `company_permalink` as these columns just represent company URLs. We can also drop `investor_category_code` as this column has way too many missing values (> 90%) .

In [14]:
drop_cols = ["investor_permalink", "company_permalink", "investor_category_code"]

We can later use `df.columns.drop(drop_cols, inplace=True)` to actually drop these columns as and when needed.

## Selecting Data Types

We are working with certain memory constraints. We can't have all the data loaded into our main memory at a time and perform analysis. One solution is to load all the data in a SQLite database. The SQLite database file can be very large as we are constrained only by the amount of available disk space in this case. If we need to perform any analysis, we can select required data to be loaded in dataframe - if not feasible all at once then via chunks - then perform required analysis.

Before we can load all our data from the chunks to the SQLite database, we need to select data types for each column. For some of the columns, the data types may vary across chunks - need to identify such columns as well so that we can use same data type for each column while loading data to SQLite database.

In [15]:
col_types = {}
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize=5000, encoding="ISO-8859-1")
for chunk in chunk_iter:
    for col in chunk.columns:
        if col in col_types:
            col_types[col].add(str(chunk[col].dtype))
        else:
            col_types[col] = set([str(chunk[col].dtype)])

for col, type_ in col_types.items():
    print(col, type_)

company_permalink {'object'}
company_name {'object'}
company_category_code {'object'}
company_country_code {'object'}
company_state_code {'object'}
company_region {'object'}
company_city {'object'}
investor_permalink {'object'}
investor_name {'object'}
investor_category_code {'float64', 'object'}
investor_country_code {'float64', 'object'}
investor_state_code {'float64', 'object'}
investor_region {'object'}
investor_city {'float64', 'object'}
funding_round_type {'object'}
funded_at {'object'}
funded_month {'object'}
funded_quarter {'object'}
funded_year {'float64', 'int64'}
raised_amount_usd {'float64'}


In [16]:
first_five

Unnamed: 0,company_permalink,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_permalink,investor_name,investor_category_code,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
0,/company/advercar,AdverCar,advertising,USA,CA,SF Bay,San Francisco,/company/1-800-flowers-com,1-800-FLOWERS.COM,,USA,NY,New York,New York,series-a,2012-10-30,2012-10,2012-Q4,2012,2000000
1,/company/launchgram,LaunchGram,news,USA,CA,SF Bay,Mountain View,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-23,2012-01,2012-Q1,2012,20000
2,/company/utap,uTaP,messaging,USA,,United States - Other,,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2012-01-01,2012-01,2012-Q1,2012,20000
3,/company/zoopshop,ZoopShop,software,USA,OH,Columbus,columbus,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,angel,2012-02-15,2012-02,2012-Q1,2012,20000
4,/company/efuneral,eFuneral,web,USA,OH,Cleveland,Cleveland,/company/10xelerator,10Xelerator,finance,USA,OH,Columbus,Columbus,other,2011-09-08,2011-09,2011-Q3,2011,20000


Selecting dtype for some columns using `col_types` where there are multiple entries for a column or just choosing a different data type to optimize memory:

In [17]:
# Not including "investor_category_code" as we plan to drop it anyway.
wanted_dtypes = {
    "investor_country_code": "object",
    "investor_state_code": "object",
    "investor_city": "object", 
    "funded_year": "float32"
}

## Loading Chunks Into a table in SQLite database 

In [18]:
conn = sqlite3.connect("crunchbase.db")
cur = conn.cursor()
cur.execute("DROP TABLE IF EXISTS investments")
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize=5000, encoding="ISO-8859-1",
                         parse_dates=["funded_at"])
for chunk in chunk_iter:
    # Dropped columns: "investor_permalink", "company_permalink", "investor_category_code"
    chunk.drop(drop_cols, axis=1, inplace=True)
    for col in wanted_dtypes:
        chunk[col] = chunk[col].astype(wanted_dtypes[col])
    chunk.to_sql("investments", conn, if_exists="append", index=False)

Verify data types in the sqlite database:

In [19]:
df = pd.read_sql("PRAGMA table_info('investments');", conn)
df

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,company_name,TEXT,0,,0
1,1,company_category_code,TEXT,0,,0
2,2,company_country_code,TEXT,0,,0
3,3,company_state_code,TEXT,0,,0
4,4,company_region,TEXT,0,,0
5,5,company_city,TEXT,0,,0
6,6,investor_name,TEXT,0,,0
7,7,investor_country_code,TEXT,0,,0
8,8,investor_state_code,TEXT,0,,0
9,9,investor_region,TEXT,0,,0


## Analysis

We can perform some analysis do demonstrate pandas SQLite workflow.

### Which category of company attracted the most investments?

For this analysis, columns of interest are `company_category_code` and `raised_amount_usd`.
We will select these columns from `investments` table in the `crunchbase` database. We will perform the analysis part in dataframe itself as computations are faster in-memory.

In [20]:
query = """
    SELECT company_category_code, raised_amount_usd
    FROM investments;
"""
df = pd.read_sql(query, conn)
df

Unnamed: 0,company_category_code,raised_amount_usd
0,advertising,2000000.0
1,news,20000.0
2,messaging,20000.0
3,software,20000.0
4,web,20000.0
...,...,...
52865,enterprise,3800000.0
52866,mobile,10300000.0
52867,software,350000.0
52868,enterprise,8400000.0


In [21]:
df.groupby("company_category_code").sum().sort_values(by="raised_amount_usd", ascending=False)

Unnamed: 0_level_0,raised_amount_usd
company_category_code,Unnamed: 1_level_1
biotech,110396400000.0
software,73084520000.0
mobile,64777380000.0
cleantech,52705230000.0
enterprise,45860930000.0
web,40143260000.0
medical,25367110000.0
advertising,25076660000.0
ecommerce,22567220000.0
network_hosting,22419680000.0


### Which investor contributed the most money (across all startups)

In [22]:
query = """
    SELECT investor_name, raised_amount_usd
    FROM investments;
"""
df = pd.read_sql(query, conn)
df

Unnamed: 0,investor_name,raised_amount_usd
0,1-800-FLOWERS.COM,2000000.0
1,10Xelerator,20000.0
2,10Xelerator,20000.0
3,10Xelerator,20000.0
4,10Xelerator,20000.0
...,...,...
52865,Zohar Gilon,3800000.0
52866,Zohar Gilon,10300000.0
52867,zohar israel,350000.0
52868,Zorba Lieberman,8400000.0


In [23]:
df.groupby("investor_name").sum().sort_values(by="raised_amount_usd", ascending=False)

Unnamed: 0_level_0,raised_amount_usd
investor_name,Unnamed: 1_level_1
Kleiner Perkins Caufield & Byers,1.121783e+10
New Enterprise Associates,9.692542e+09
Accel Partners,6.472126e+09
Goldman Sachs,6.375459e+09
Sequoia Capital,6.039402e+09
...,...
Charles Crawford,0.000000e+00
New York City Economic Development Council,0.000000e+00
Michael Milken,0.000000e+00
Robin Cremeens,0.000000e+00


### Which investors contributed the most money per startup

In [24]:
query = """
    SELECT company_name, investor_name, raised_amount_usd
    FROM investments;
"""
df = pd.read_sql(query, conn)
df

Unnamed: 0,company_name,investor_name,raised_amount_usd
0,AdverCar,1-800-FLOWERS.COM,2000000.0
1,LaunchGram,10Xelerator,20000.0
2,uTaP,10Xelerator,20000.0
3,ZoopShop,10Xelerator,20000.0
4,eFuneral,10Xelerator,20000.0
...,...,...,...
52865,Garantia Data,Zohar Gilon,3800000.0
52866,DudaMobile,Zohar Gilon,10300000.0
52867,SiteBrains,zohar israel,350000.0
52868,Comprehend Systems,Zorba Lieberman,8400000.0


In [25]:
grouped = df.groupby("company_name")
company_names = df["company_name"].dropna().unique()
print(len(company_names))


11573


Since there are too many companies, we will print top investors from first 10 companies only.

In [26]:
for n, company in enumerate(company_names):
    grouped_company = grouped.get_group(company)
    sorted_investors = grouped_company.groupby("investor_name")["raised_amount_usd"].sum().sort_values(ascending=False)
    print("Company = {}\n".format(company))
    print(sorted_investors)
    print("\n-------------------------------------------------------------\n")
    if n == 9:
        break

Company = AdverCar

investor_name
TiE Angels                        2000000.0
New Orleans Startup Fund          2000000.0
Jit Saxena                        2000000.0
Canaan Partners                   2000000.0
Branford Castle Private Equity    2000000.0
1-800-FLOWERS.COM                 2000000.0
Name: raised_amount_usd, dtype: float64

-------------------------------------------------------------

Company = LaunchGram

investor_name
500 Startups    50000.0
10Xelerator     20000.0
Name: raised_amount_usd, dtype: float64

-------------------------------------------------------------

Company = uTaP

investor_name
10Xelerator    20000.0
Name: raised_amount_usd, dtype: float64

-------------------------------------------------------------

Company = ZoopShop

investor_name
10Xelerator    20000.0
Name: raised_amount_usd, dtype: float64

-------------------------------------------------------------

Company = eFuneral

investor_name
JumpStart                            250000.0
Innovation F