# Analyzing Startup Fundraising Deals from Crunchbase

In this project, we'll practice using some of the techniques we learned to analyze startup investments from Crunchbase.com.
Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.
In return, Crunchbase makes the data available through a web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups went to the site and released the data online. Since the information on the startups and their fundraising rounds is always changing, the dataset we'll be using isn't completely up to date.
The dataset of investments we'll be exploring from October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013/blob/master/crunchbase-investments.csv).

We'll practice working with different memory constraints. In this step, let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes 10.3 megabytes of disk space, we know that pandas typically requires significantly more space in memory than the file does on disk (especially when there are multiple string columns). The exact memory usage can vary depending on the pandas version, data types, and specific operations, but it's often several times larger than the original file size.

We'll process the data in chunks of 5000 rows and learn about each colum's missing value counts, memory foortprint, the total memory footprint of all chunks combined. We will also drop columns not needed for the analysis.

In [37]:
import pandas as pd
import numpy as np
import sqlite3

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [2]:
# Total rows

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1')
total_rows = 0
for chunk in chunk_iter:
    total_rows += len(chunk)
print(total_rows)

52870


In [4]:
# First five rows

chunk = pd.read_csv('crunchbase-investments.csv', nrows=5, encoding='latin1') 
print(chunk)
cols = chunk.columns

     company_permalink company_name company_category_code  \
0    /company/advercar     AdverCar           advertising   
1  /company/launchgram   LaunchGram                  news   
2        /company/utap         uTaP             messaging   
3    /company/zoopshop     ZoopShop              software   
4    /company/efuneral     eFuneral                   web   

  company_country_code company_state_code         company_region  \
0                  USA                 CA                 SF Bay   
1                  USA                 CA                 SF Bay   
2                  USA                NaN  United States - Other   
3                  USA                 OH               Columbus   
4                  USA                 OH              Cleveland   

    company_city          investor_permalink      investor_name  \
0  San Francisco  /company/1-800-flowers-com  1-800-FLOWERS.COM   
1  Mountain View        /company/10xelerator        10Xelerator   
2            NaN       

In [13]:
# Missing values

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1')
missing_counts = [chunk.isna().sum() for chunk in chunk_iter]
combined_counts = pd.concat(missing_counts).groupby(level=0).sum()
print(combined_counts)

company_category_code       643
company_city                533
company_country_code          1
company_name                  1
company_permalink             1
company_region                1
company_state_code          492
funded_at                     3
funded_month                  3
funded_quarter                3
funded_year                   3
funding_round_type            3
investor_category_code    50427
investor_city             12480
investor_country_code     12001
investor_name                 2
investor_permalink            2
investor_region               2
investor_state_code       16809
raised_amount_usd          3599
dtype: int64


In [17]:
# Each column's memory footprint

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1')
footprint = [chunk.memory_usage(deep=True) for chunk in chunk_iter]
combined_footprint = pd.concat(footprint).groupby(level=0).sum() / (2 ** 20)  # in MB
total_footprint = combined_footprint.sum()
print(combined_footprint)
print(total_footprint)

Index                     0.001381
company_category_code     3.262619
company_city              3.343473
company_country_code      3.025223
company_name              3.424955
company_permalink         3.869808
company_region            3.253503
company_state_code        2.962161
funded_at                 3.378091
funded_month              3.226837
funded_quarter            3.226837
funded_year               0.403366
funding_round_type        3.252704
investor_category_code    0.593590
investor_city             2.751430
investor_country_code     2.524654
investor_name             3.734270
investor_permalink        4.749821
investor_region           3.238946
investor_state_code       2.361876
raised_amount_usd         0.403366
dtype: float64
56.988911628723145


It seems that we have at least one column (`investor_category_code`) with more than 90% missing values. Additionally, the variables `investor_permalink` and `company_permalink`) do not seem to contribute too much to the analysis but takes up a fair amount of memory. We'll delet these columns.

In [18]:
# Drop columns representing URLs or containing too many missing values (>90% missing)
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code']
keep_cols = chunk.columns.drop(drop_cols)

In [20]:
keep_cols.tolist()

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_month',
 'funded_quarter',
 'funded_year',
 'raised_amount_usd']

## Selecting Data Types

Now that we have a good sense of the missing values, let's get familiar with the column types before adding the data into SQLite.

In [21]:
# Writing the dtype for each column in each chunk into a dictionary
col_types = {}
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1', usecols=keep_cols)

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))

In [22]:
# For each dictionary entry only keep the unique dtypes

uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
print(uniq_col_types)

{'company_name': {'object'}, 'company_category_code': {'object'}, 'company_country_code': {'object'}, 'company_state_code': {'object'}, 'company_region': {'object'}, 'company_city': {'object'}, 'investor_name': {'object'}, 'investor_country_code': {'object', 'float64'}, 'investor_state_code': {'object', 'float64'}, 'investor_region': {'object'}, 'investor_city': {'object', 'float64'}, 'funding_round_type': {'object'}, 'funded_at': {'object'}, 'funded_month': {'object'}, 'funded_quarter': {'object'}, 'funded_year': {'float64', 'int64'}, 'raised_amount_usd': {'float64'}}


It seems as if the different date columns `funded_at`, `funded_month`, `funded_quarter`, `funded_year` contain somewhat redundant information. We can convert `funded_at` into a timestamp, and the `funded_quarter` into a `period` type. The year and month can later still be extracted from the `funded_at` column, and the information about the quarter will be found in the `funded_quarter` column. The rest of those columns can be deleted.

In [23]:
# Convert the 'funded_at' column to a datetime and convert funded_quarter to a period
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1')
for chunk in chunk_iter:
    chunk['funded_at'] = pd.to_datetime(chunk['funded_at'])
    chunk['funded_quarter'] = chunk['funded_at'].dt.to_period('Q')

In [24]:
# Update keep and drop lists
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code', 'funded_month', 'funded_year']
keep_cols = chunk.columns.drop(drop_cols)
keep_cols.tolist()

['company_name',
 'company_category_code',
 'company_country_code',
 'company_state_code',
 'company_region',
 'company_city',
 'investor_name',
 'investor_country_code',
 'investor_state_code',
 'investor_region',
 'investor_city',
 'funding_round_type',
 'funded_at',
 'funded_quarter',
 'raised_amount_usd']

In [26]:
chunk = pd.read_csv('crunchbase-investments.csv', nrows=5, encoding='latin1') 
print(chunk)

     company_permalink company_name company_category_code  \
0    /company/advercar     AdverCar           advertising   
1  /company/launchgram   LaunchGram                  news   
2        /company/utap         uTaP             messaging   
3    /company/zoopshop     ZoopShop              software   
4    /company/efuneral     eFuneral                   web   

  company_country_code company_state_code         company_region  \
0                  USA                 CA                 SF Bay   
1                  USA                 CA                 SF Bay   
2                  USA                NaN  United States - Other   
3                  USA                 OH               Columbus   
4                  USA                 OH              Cleveland   

    company_city          investor_permalink      investor_name  \
0  San Francisco  /company/1-800-flowers-com  1-800-FLOWERS.COM   
1  Mountain View        /company/10xelerator        10Xelerator   
2            NaN       

In [30]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1', usecols=keep_cols)
unique_counts = [chunk.nunique() for chunk in chunk_iter]
combined_counts = pd.concat(unique_counts).groupby(level=0).sum()
print(combined_counts)

company_category_code      451
company_city              5181
company_country_code        12
company_name             30586
company_region            2066
company_state_code         494
funded_at                18304
funded_quarter             656
funding_round_type          88
investor_city             2038
investor_country_code      310
investor_name            10485
investor_region           1337
investor_state_code        327
raised_amount_usd         6280
dtype: int64


The following string columns look like candidates for the `category`datetype: `company_category_code`, `company_country_code`, `company_state_code`, `company_region`, `investor_country_code`, `investor_state_code`, `investor_region`, `funding_round_type`. We will retain `company_name`and `investor_name`as string objects. 

With regard to the numeric columns, the `raised_amount_usd` looks like  a candidate for an `integer` type. Let's implement all these changes and check the memory footprint again.

In [36]:
# Convert to category type and int type
category_cols = ['company_category_code', 'company_country_code', 'company_state_code', 'company_region', 'investor_country_code', 'investor_state_code', 'investor_region', 'funding_round_type']
footprints = []
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1', usecols=keep_cols)
for chunk in chunk_iter:
    for col in category_cols:
        chunk[col] = chunk[col].astype('category')
    chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'], errors='coerce').astype('Int64')
    footprints.append(chunk.memory_usage(deep=True))

# Calculate and print memory footprints
combined_footprint = pd.concat(footprints).groupby(level=0).sum() / (2 ** 20)  # in MB
total_footprint = combined_footprint.sum()

print(combined_footprint)
print(total_footprint)

Index                    0.001381
company_category_code    0.089798
company_city             3.343473
company_country_code     0.052244
company_name             3.424955
company_region           0.300736
company_state_code       0.089467
funded_at                3.378091
funded_quarter           3.226837
funding_round_type       0.059038
investor_city            2.751430
investor_country_code    0.077566
investor_name            3.734270
investor_region          0.208169
investor_state_code      0.078227
raised_amount_usd        0.453787
dtype: float64
21.26947021484375


Implementing all those changes allowed us to reduce the overall memory footprint from 57 MB to only 21 MB. That's a reduction by 63%, so a lot more efficient.

## Loading Chunks into SQLite

Now we're in good shape to start exploring and analyzing the data. The next step is to load each chunk into a table in a SQLite database so we can query the full dataset.

In [38]:
conn = sqlite3.connect('crunchbase.db')
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='latin1', usecols=keep_cols)
for chunk in chunk_iter:
    for col in category_cols:
        chunk[col] = chunk[col].astype('category')
        chunk['raised_amount_usd'] = pd.to_numeric(chunk['raised_amount_usd'], errors='coerce').astype('Int64')
    chunk.to_sql("investments", conn, if_exists='append', index=False)

In [40]:
conn = sqlite3.connect('crunchbase.db')
results_df = pd.read_sql("""
PRAGMA table_info(investments);
""", conn)
print(results_df)

    cid                   name     type  notnull dflt_value  pk
0     0           company_name     TEXT        0       None   0
1     1  company_category_code     TEXT        0       None   0
2     2   company_country_code     TEXT        0       None   0
3     3     company_state_code     TEXT        0       None   0
4     4         company_region     TEXT        0       None   0
5     5           company_city     TEXT        0       None   0
6     6          investor_name     TEXT        0       None   0
7     7  investor_country_code     TEXT        0       None   0
8     8    investor_state_code     TEXT        0       None   0
9     9        investor_region     TEXT        0       None   0
10   10          investor_city     TEXT        0       None   0
11   11     funding_round_type     TEXT        0       None   0
12   12              funded_at     TEXT        0       None   0
13   13         funded_quarter     TEXT        0       None   0
14   14      raised_amount_usd  INTEGER 

In [41]:
conn = sqlite3.connect('crunchbase.db')
row_count = pd.read_sql("SELECT COUNT(*) as count FROM investments;", conn)
print(row_count)

   count
0  52870


Loading our table into `crunchbase.db` works fine - in the end all 52870 rows were tranferred. But the data types are not great. Most colums are stored as `TEXT`. Since SQLite does not have a dedicated `DATE` type or `category`type, these columns are cast as `TEXT`. At least the `raised_amount_usd` column was stored as `INTEGER`, as intended. Storing our data in a PostgreSQL database, which has a lot more datatype options could help us here.