# Analyzing Startup Fundraising Deals from Crunchbase

# Introduction

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. Throughout this project, we'll practice working with different memory constraints.

# Working with 10 MB of Memory

Let's assume we only have 10 megabytes of available memory. While `crunchbase-investments.csv` consumes 10.3 megabytes of disk space, pandas often requires 4 to 6 times amount of space in memory as the file does on disk (especially when there's many string columns). Let's explore the data set in chunks to become familiar with what's contained in the data.

In [1]:
# Importing required libraries
import pandas as pd
import numpy as np
import sqlite3

In [2]:
# Dictionary to store each column's missing value counts
missing_val_counts = {}
# Dictionary to store each column's memory footprint
memory_footprint = {}

In [3]:
# Checking encoding of the csv file
with open("crunchbase-investments.csv") as f:
    print(f)

<_io.TextIOWrapper name='crunchbase-investments.csv' mode='r' encoding='UTF-8'>


In [4]:
# Reading in chunks of 5,000 rows as the data set contains over 50,000 rows
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize=5000, encoding = "ISO-8859-1")
for chunk in chunk_iter:
    for col in chunk.columns:
        col_missing_value_counts = chunk[col].isnull().sum()
        col_memory_footprint = chunk[col].memory_usage(deep=True)
        if col in missing_val_counts:
            missing_val_counts[col] += col_missing_value_counts
        else:
            missing_val_counts[col] = col_missing_value_counts
        if col in memory_footprint:
            memory_footprint[col] += col_memory_footprint
        else:
            memory_footprint[col] = col_memory_footprint

In [5]:
missing_val_counts_series = pd.Series(missing_val_counts).sort_values(ascending=True)
missing_val_counts_series

company_permalink             1
company_name                  1
company_country_code          1
company_region                1
investor_permalink            2
investor_name                 2
investor_region               2
funded_year                   3
funded_quarter                3
funded_month                  3
funded_at                     3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

In [6]:
memory_footprint_series = pd.Series(memory_footprint).sort_values(ascending=True)
memory_footprint_series

raised_amount_usd          424408
funded_year                424408
investor_category_code     623872
investor_state_code       2478055
investor_country_code     2648740
investor_city             2886531
company_state_code        3107499
company_country_code      3173624
funded_month              3385032
funded_quarter            3385032
investor_region           3397729
funding_round_type        3412155
company_region            3413033
company_category_code     3422552
company_city              3507374
funded_at                 3543633
company_name              3592774
investor_name             3917114
company_permalink         4059236
investor_permalink        4981996
dtype: int64

In [7]:
total_mem_mb = memory_footprint_series.sum() / 2**20
total_mem_mb

57.015225410461426

All the chunks combined would consume around 57 MB of memory.

Taking look at the columns, not all columns would be useful for analysis. Hence, only the following columns would be retained and the others would be dropped.

In [8]:
# Drop columns representing URLs or containing way too many missing values
drop_cols = ["investor_permalink", "company_permalink", "investor_category_code"]
useful_cols = chunk.columns.drop(drop_cols)

In [9]:
useful_cols

Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at', 'funded_month',
       'funded_quarter', 'funded_year', 'raised_amount_usd'],
      dtype='object')

# Selecting Data Types

In [10]:
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = "ISO-8859-1",
                        usecols = useful_cols)

In [11]:
# Checking the datatypes for each column
col_types = {}

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))

In [12]:
uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
uniq_col_types

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}

In [13]:
chunk

Unnamed: 0,company_name,company_category_code,company_country_code,company_state_code,company_region,company_city,investor_name,investor_country_code,investor_state_code,investor_region,investor_city,funding_round_type,funded_at,funded_month,funded_quarter,funded_year,raised_amount_usd
50000,NuORDER,fashion,USA,CA,Los Angeles,West Hollywood,Mortimer Singer,,,unknown,,series-a,2012-10-01,2012-10,2012-Q4,2012,3060000.0
50001,ChaCha,advertising,USA,IN,Indianapolis,Carmel,Morton Meyerson,,,unknown,,series-b,2007-10-01,2007-10,2007-Q4,2007,12000000.0
50002,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2008-04-18,2008-04,2008-Q2,2008,500000.0
50003,Binfire,software,USA,FL,Bocat Raton,Bocat Raton,Moshe Ariel,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,750000.0
50004,Unified Color,software,USA,CA,SF Bay,South San Frnacisco,Mr. Andrew Oung,,,unknown,,angel,2010-01-01,2010-01,2010-Q1,2010,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52865,Garantia Data,enterprise,USA,CA,SF Bay,Santa Clara,Zohar Gilon,,,unknown,,series-a,2012-08-08,2012-08,2012-Q3,2012,3800000.0
52866,DudaMobile,mobile,USA,CA,SF Bay,Palo Alto,Zohar Gilon,,,unknown,,series-c+,2013-04-08,2013-04,2013-Q2,2013,10300000.0
52867,SiteBrains,software,USA,CA,SF Bay,San Francisco,zohar israel,,,unknown,,angel,2010-08-01,2010-08,2010-Q3,2010,350000.0
52868,Comprehend Systems,enterprise,USA,CA,SF Bay,Palo Alto,Zorba Lieberman,,,unknown,,series-a,2013-07-11,2013-07,2013-Q3,2013,8400000.0


First, looking at the numeric columns, certain columns are not necessary. The `funded_at` provides a full information on the date, while columns like `funded_month`, `funded_quarter` and `funded_year` can be obtained from `funded_at`. These 3 columns can thus be dropped and the `funded_at` column can be parsed when reading using pandas as a date.

In [14]:
useful_cols = useful_cols.drop(["funded_month", "funded_quarter", "funded_year"])

In [15]:
useful_cols

Index(['company_name', 'company_category_code', 'company_country_code',
       'company_state_code', 'company_region', 'company_city', 'investor_name',
       'investor_country_code', 'investor_state_code', 'investor_region',
       'investor_city', 'funding_round_type', 'funded_at',
       'raised_amount_usd'],
      dtype='object')

In [16]:
# Looking at the unique value counts of the text columns
text_cols = ["company_name", "company_category_code", "company_country_code", "company_state_code",
             "company_region", "company_city", "investor_name", "investor_country_code",
            "investor_state_code", "investor_region", "investor_city", "funding_round_type"]

val_counts = {}
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = "ISO-8859-1",
                        usecols = useful_cols, parse_dates = ["funded_at"])

for chunk in chunk_iter:
    for col in text_cols:
        col_val_count = chunk[col].value_counts()
        if col in val_counts:
            val_counts[col].append(col_val_count)
        else:
            val_counts[col] = [col_val_count]
            
combined_val_counts = {}            
for col in val_counts:
    combined = pd.concat(val_counts[col])
    combined_val_counts[col] = combined.groupby(combined.index).sum()

In [17]:
combined_val_counts

{'company_name': #waywire          5
 0xdata            1
 1-800-DENTIST     2
 1000memories     10
 100Plus           4
                  ..
 yaM Labs          1
 ybuy              4
 zozi             38
 zulily            6
 zuuka!            3
 Name: company_name, Length: 11573, dtype: int64,
 'company_category_code': 2/7/08                 1
 advertising         3200
 analytics           1863
 automotive           164
 biotech             4951
 cleantech           1948
 consulting           233
 design                55
 ecommerce           2168
 education            783
 enterprise          4489
 fashion              368
 finance              931
 games_video         1893
 government            10
 hardware            1537
 health               670
 hospitality          331
 legal                 87
 local                 22
 manufacturing        310
 medical             1315
 messaging            452
 mobile              4067
 music                287
 nanotech             216
 n

In [18]:
# Dictionary to store the number of unique values each text column has,
# where key: column name, value: number of unique values
unique_val_counts = {}
for col in combined_val_counts:
    unique_val_counts[col] = len(combined_val_counts[col])

In [19]:
unique_val_counts

{'company_name': 11573,
 'company_category_code': 43,
 'company_country_code': 2,
 'company_state_code': 50,
 'company_region': 546,
 'company_city': 1229,
 'investor_name': 10465,
 'investor_country_code': 72,
 'investor_state_code': 50,
 'investor_region': 585,
 'investor_city': 990,
 'funding_round_type': 9}

Looking at the above unique value counts, the `funding_round_type` is ideal for conversion to category dtype. Taking a look closer at the `company_country_code` column, it seems like all the companies in this dataset relate to US companies. Notably, there appears to be an incorrect row, where the `company_country_code` consists of a date instead. We will leave it as such and do the dropping of this incorrect row in SQLite instead.

# Loading into SQLite

Now that we are in good shape to start exploring and analyzing the data. Next step is to load the dataset in chunks into a SQLite database so we can query the full data set.

In [20]:
chunk_iter = pd.read_csv("crunchbase-investments.csv", chunksize = 5000, encoding = "ISO-8859-1",
                        usecols = useful_cols, parse_dates = ["funded_at"], dtype={"funding_round_type":"category"})

In [21]:
# Setting up connection to the sqlite database
conn = sqlite3.connect("crunchbase.db")
for chunk in chunk_iter:
    chunk.to_sql("crunchbase_investments", conn, if_exists="append", index=False)