# Analysing Startup Fundraising Deals From Crunchbase

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.

The data set of investments we'll be exploring is current as of October 2013. It is available at [this GitHub repository](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

Throughout this project, we'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes 10.3 megabytes of disk space, we know that pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns).

In [62]:
import pandas as pd

pd.options.display.max_columns = 99
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

mv_list = []
for chunk in chunk_iter:
    mv_list.append(chunk.isnull().sum())
    
combined_mv_vc = pd.concat(mv_list)
unique_combined_mv_vc = combined_mv_vc.groupby(combined_mv_vc.index).sum()
print('Missing values in each column:\n')
print(unique_combined_mv_vc.sort_values(ascending=False))

Missing values in each column:

investor_category_code    50427
investor_state_code       16809
investor_city             12480
investor_country_code     12001
raised_amount_usd          3599
company_category_code       643
company_city                533
company_state_code          492
funding_round_type            3
funded_year                   3
funded_month                  3
funded_at                     3
funded_quarter                3
investor_name                 2
investor_permalink            2
investor_region               2
company_region                1
company_permalink             1
company_name                  1
company_country_code          1
dtype: int64


In [63]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')
series_memory_fp = pd.Series()
for i, chunk in enumerate(chunk_iter):
    mem_usage = chunk.memory_usage(deep=True) / (1024 * 1024)
    if i == 0:
        series_memory_fp = mem_usage
    else:
        series_memory_fp += mem_usage
    print('Chunk {0}: memory usage: {1:.2f} MB; num rows: {2}'.format(i, mem_usage.sum(), chunk.shape[0]))
        
# Drop memory footprint calculation for the index.
series_memory_fp = series_memory_fp.drop('Index')
print('\nMemory usage (MB) for each column:\n')
print(series_memory_fp.sort_values(ascending=False))
print('\nTotal memory footprint for the dataset is {:.3f} MB'.format(series_memory_fp.sum()))

Chunk 0: memory usage: 5.58 MB; num rows: 5000
Chunk 1: memory usage: 5.53 MB; num rows: 5000
Chunk 2: memory usage: 5.54 MB; num rows: 5000
Chunk 3: memory usage: 5.53 MB; num rows: 5000
Chunk 4: memory usage: 5.52 MB; num rows: 5000
Chunk 5: memory usage: 5.55 MB; num rows: 5000
Chunk 6: memory usage: 5.53 MB; num rows: 5000
Chunk 7: memory usage: 5.51 MB; num rows: 5000
Chunk 8: memory usage: 5.40 MB; num rows: 5000
Chunk 9: memory usage: 4.64 MB; num rows: 5000
Chunk 10: memory usage: 2.66 MB; num rows: 2870

Memory usage (MB) for each column:

investor_permalink        4.749821
company_permalink         3.869808
investor_name             3.734270
company_name              3.424955
funded_at                 3.378091
company_city              3.343512
company_category_code     3.262619
company_region            3.253541
funding_round_type        3.252704
investor_region           3.238946
funded_quarter            3.226837
funded_month              3.226837
company_country_code     

In [64]:
# Drop columns representing URL's or containing way too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)
keep_cols.tolist()

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_month',
 'funded_quarter',
 'funded_year',
 'raised_amount_usd']

## Selecting data types

Now that we have a good sense of the missing values, let's get familiar with the column types before adding the data into SQLite.

### Investigation

- column data types
- view data sample
- unique values
- missing values
- checking for integers

In [66]:
# Key: Column name, Value: List of types
col_types = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', 
                         usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))
            
uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
uniq_col_types

{'company_category_code': {'object'},
 'company_city': {'object'},
 'company_country_code': {'object'},
 'company_name': {'object'},
 'company_region': {'object'},
 'company_state_code': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'funding_round_type': {'object'},
 'investor_city': {'float64', 'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_name': {'object'},
 'investor_region': {'object'},
 'investor_state_code': {'float64', 'object'},
 'raised_amount_usd': {'float64'}}

In [90]:
chunk.head(5)

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_quarter,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-Q4,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-Q4,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-Q2,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-Q1,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-Q1,


In [86]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', 
                         usecols=keep_cols)

# Key: Column name, Value: list of unique value counts
unique_values = {}
for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in unique_values:
            unique_values[col] = [chunk[col].value_counts()]
        else:
            unique_values[col].append(chunk[col].value_counts())

unique_value_counts = {}
for col,values in unique_values.items():
    df = pd.concat(values)
    unique_value_counts[col] = df.groupby(df.index).sum().size
    
print('Unique values in each column:\n')
print(pd.Series(unique_value_counts).sort_values())

Unique values in each column:

company_country_code         2
funding_round_type           9
company_category_code       43
company_state_code          50
investor_state_code         50
funded_quarter              72
investor_country_code       72
company_region             546
investor_region            585
investor_city              990
company_city              1229
raised_amount_usd         1458
funded_at                 2808
investor_name            10465
company_name             11573
dtype: int64


In [87]:
import numpy as np
def is_integer(val):
    if not np.isnan(val):
        try:
            if str(val).split('.')[1] == 0:
                return True
        except Exception as e:
            print(val)
            print(e)
    else:
        return 'null'
    return False
chunk['raised_amount_usd'].apply(is_integer).value_counts()

False    2580
null      290
Name: raised_amount_usd, dtype: int64

### Converting and reexamining footprint

We can afford to drop the `funded_month` and `funded_year` columns, if we can convert the `funded_at` column to *datetime*. We can also strip the `funded_quarter` columnd and and convert to *category*. 

The following *string* columns should be converted to *category*:

- `company_country_code`
- `funding_round_type`
- `company_category_code`
- `company_state_code`
- `investor_state_code`
- `investor_country_code`
- `company_region`
- `investor_region`
- `investor_city`
- `company_city`

We will keep `raised_amount` as *float* for now, as seems to have decimal values. 

Finally, the remaining columns will stay as *string*:

- `investor_name`
- `company_name`

In [88]:
def strip_quarter(val):
    if isinstance(val, str):
        return val[-2]
    return val

convert_category_cols = {
    'company_country_code': 'category',
    'funding_round_type': 'category',
    'company_category_code': 'category',
    'company_state_code': 'category',
    'investor_state_code': 'category',
    'investor_country_code': 'category',
    'company_region': 'category',
    'investor_region': 'category',
    'investor_city': 'category',
    'company_city': 'category'
}

keep_cols = list(['company_name', 'company_category_code', 'company_country_code', 'company_state_code',
                  'company_region', 'company_city', 'investor_name', 'investor_country_code', 'investor_state_code',
                  'investor_region', 'investor_city', 'funding_round_type', 'funded_at', 'funded_quarter', 'raised_amount_usd'])

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', 
                         usecols=keep_cols, dtype=convert_category_cols)

for i, chunk in enumerate(chunk_iter):
    chunk['funded_at'] = pd.to_datetime(chunk['funded_at'], format='%Y-%m-%d')
    chunk['funded_quarter'].apply(strip_quarter)
    chunk['funded_quarter'] = chunk['funded_quarter'].astype('category')
    mem_usage = chunk.memory_usage(deep=True) / (1024 * 1024)
    if i == 0:
        series_memory_fp = mem_usage
    else:
        series_memory_fp += mem_usage
    print('Chunk {0}: memory usage: {1:.2f} MB; num rows: {2}'.format(i, mem_usage.sum(), chunk.shape[0]))
        
# Drop memory footprint calculation for the index.
series_memory_fp = series_memory_fp.drop('Index')
print('\nMemory usage (MB) for each column:\n')
print(series_memory_fp.sort_values(ascending=False))
print('\nTotal memory footprint for the dataset is {:.3f} MB'.format(series_memory_fp.sum()))

Chunk 0: memory usage: 0.97 MB; num rows: 5000
Chunk 1: memory usage: 0.97 MB; num rows: 5000
Chunk 2: memory usage: 0.97 MB; num rows: 5000
Chunk 3: memory usage: 0.94 MB; num rows: 5000
Chunk 4: memory usage: 0.97 MB; num rows: 5000
Chunk 5: memory usage: 0.97 MB; num rows: 5000
Chunk 6: memory usage: 0.97 MB; num rows: 5000
Chunk 7: memory usage: 0.96 MB; num rows: 5000
Chunk 8: memory usage: 0.93 MB; num rows: 5000
Chunk 9: memory usage: 0.84 MB; num rows: 5000
Chunk 10: memory usage: 0.50 MB; num rows: 2870

Memory usage (MB) for each column:

investor_name            3.734270
company_name             3.424955
company_city             0.624051
raised_amount_usd        0.403366
funded_at                0.403366
company_region           0.317376
investor_city            0.300412
investor_region          0.217028
funded_quarter           0.117315
company_category_code    0.091980
company_state_code       0.091649
investor_state_code      0.079806
investor_country_code    0.079145
fun

All of this data type optimisation has enabled us to reduce the memory footprint from almost 57MB, to less than 10MB. This is significant difference.

## Loading chunks into SQLite

Now we're in good shape to start exploring and analyzing the data. The next step is to load each chunk into a table in a SQLite database so we can query the full data set.

In [91]:
import sqlite3
conn = sqlite3.connect('crunchbase.db')
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', 
                         usecols=keep_cols, dtype=convert_category_cols)

for chunk in chunk_iter:
    chunk['funded_at'] = pd.to_datetime(chunk['funded_at'], format='%Y-%m-%d')
    chunk['funded_quarter'].apply(strip_quarter)
    chunk['funded_quarter'] = chunk['funded_quarter'].astype('category')
    chunk.to_sql("investments", conn, if_exists='append', index=False)

In [96]:
!wc *

     723     2218    25889 Basics.ipynb
       0   263913 10339663 crunchbase-investments.csv
    5866   287787  7294976 crunchbase.db
    6589   553918 17660528 total


The physical size of the databse file is not too different from the csv file, but allows us to query as required. 

## Data exploration and analysis

Now that the data is in SQLite, we can use the pandas SQLite workflow to explore and analyze startup investments. 

*Note:* Each row isn't a unique company, but a unique investment from a single investor. This means that many startups will span multiple rows.

### Most popular company category

In [101]:
query = '''
    SELECT company_category_code category, COUNT(*) investments
    FROM investments
    GROUP BY 1
    ORDER BY 2 DESC;
'''
pd.read_sql(query, conn).head()

Unnamed: 0,category,investments
0,software,7243
1,web,5015
2,biotech,4951
3,enterprise,4489
4,mobile,4067


### Biggest investors

In [134]:
query = '''
    SELECT 
        investor_name investor, 
        SUM(raised_amount_usd) / 1000000000 'total invested ($ bils)', 
        COUNT(DISTINCT(company_name)) 'num companies invested in'
    FROM investments
    GROUP BY 1
    ORDER BY 2 DESC;
'''
pd.read_sql(query, conn).head(10)

Unnamed: 0,investor,total invested ($ bils),num companies invested in
0,Kleiner Perkins Caufield & Byers,11.217826,225
1,New Enterprise Associates,9.692542,283
2,Accel Partners,6.472126,186
3,Goldman Sachs,6.375459,96
4,Sequoia Capital,6.039402,215
5,Intel,5.9692,14
6,Google,5.8088,21
7,Time Warner,5.73,9
8,Comcast,5.669,6
9,Greylock Partners,4.960983,158


In [133]:
query = '''
    SELECT 
        investor_name investor, 
        COUNT(DISTINCT(company_name)) 'num companies invested in', 
        (SUM(raised_amount_usd) / COUNT(DISTINCT(company_name))) / 1000000000 'total invested / company ($ bils)'
    FROM investments
    GROUP BY 1
    ORDER BY 3 DESC;
'''
pd.read_sql(query, conn).head()

Unnamed: 0,investor,num companies invested in,total invested / company ($ bils)
0,BrightHouse,1,4.7
1,Marlin Equity Partners,1,2.6
2,Sprint Nextel,1,2.5
3,GI Partners,1,1.05
4,Comcast,6,0.944833


### Funding round popularity

In [132]:
query = '''
    SELECT 
        funding_round_type, 
        COUNT(raised_amount_usd) 'number of investments',
        SUM(raised_amount_usd)/1000000000 'total raised ($ bil)'
    FROM investments
    GROUP BY 1
    ORDER BY 2 DESC;
'''
pd.read_sql(query, conn)

Unnamed: 0,funding_round_type,number of investments,total raised ($ bil)
0,series-a,13377,86.542151
1,series-c+,10764,265.753464
2,series-b,8630,128.326776
3,venture,8031,130.556496
4,angel,7190,4.962075
5,other,934,18.507258
6,private-equity,312,16.159876
7,post-ipo,29,30.9176
8,crowdfunding,4,0.006491
9,,0,


## Next steps

Here are some ideas for further exploration:

- Repeat the tasks in this guided project using stricter memory constraints (under 1 megabyte).
- Clean and analyze the other Crunchbase data sets from the same GitHub repo.
    - Understand which columns the data sets share, and how the data sets are linked.
    - Create a relational database design that links the data sets together and reduces the overall disk space the database file consumes.
    - Use pandas to populate each table in the database, create the appropriate indexes, and so on.