# Analyzing Startup Fundraising Deals from Crunchbase

Every year, thousands of startup companies raise financing from investors. Each time a startup raises money, we refer to the event as a fundraising round. Crunchbase is a website that crowdsources information on the fundraising rounds of many startups. The Crunchbase user community submits, edits, and maintains most of the information in Crunchbase.

In return, Crunchbase makes the data available through a Web application and a fee-based API. Before Crunchbase switched to the paid API model, multiple groups crawled the site and released the data online. Because the information on the startups and their fundraising rounds is always changing, the data set we'll be using isn't completely up to date.

The data set of investments we'll be exploring is current as of October 2013. You can download it from [GitHub](https://github.com/datahoarder/crunchbase-october-2013).

Since the data set contains over 50,000 rows, we will need to read the data set into dataframes using 5,000 row chunks to ensure that each chunk consumes much less than 10 megabytes of memory. For each chunk, we will compute:

- each chunk's missing value counts
- each chunk's memory footprint
- the total memory footprint of all of the chunks combined

We will then analyze which column(s) we can drop because they aren't useful for analysis

In [3]:
import pandas as pd

chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')   

In [4]:
# missing value counts calculation

mv_list = []
for chunk in chunk_iter:
    mv_list.append(chunk.isnull().sum())
    
mv_counts = pd.concat(mv_list)
unique_mv_counts = mv_counts.groupby(mv_counts.index).sum()
unique_mv_counts.sort_values()

company_country_code          1
company_name                  1
company_permalink             1
company_region                1
investor_region               2
investor_permalink            2
investor_name                 2
funded_quarter                3
funded_at                     3
funded_month                  3
funded_year                   3
funding_round_type            3
company_state_code          492
company_city                533
company_category_code       643
raised_amount_usd          3599
investor_country_code     12001
investor_city             12480
investor_state_code       16809
investor_category_code    50427
dtype: int64

Any columns that have +90% of missing values are likely not going to be of much use for our analysis and therefore we will drop them. 

In [16]:
drop_cols = ['investor_permalink', 'company_permalink', 'investor_category_code'] #columns to drop
keep_cols = chunk.columns.drop(drop_cols).to_list() #columsn to keep

In [6]:
# memory footprint per column
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1')

counter = 0
series_memory_fp = pd.Series(dtype='float64')
for chunk in chunk_iter:
    if counter == 0:
        series_memory_fp = chunk.memory_usage(deep=True)
    else:
        series_memory_fp += chunk.memory_usage(deep=True)
    counter += 1

series_memory_fp = series_memory_fp.drop('Index')
series_memory_fp    

company_permalink         4057788
company_name              3591326
company_category_code     3421104
company_country_code      3172176
company_state_code        3106051
company_region            3411545
company_city              3505886
investor_permalink        4980548
investor_name             3915666
investor_category_code     622424
investor_country_code     2647292
investor_state_code       2476607
investor_region           3396281
investor_city             2885083
funding_round_type        3410707
funded_at                 3542185
funded_month              3383584
funded_quarter            3383584
funded_year                422960
raised_amount_usd          422960
dtype: int64

In [8]:
print(f'Total memory consumed is: {series_memory_fp.sum() / (1024 * 1024)}')


Total memory consumed is: 56.98753070831299


In [17]:
type(keep_cols)

list

## Data Types

Let's now get familiar with the column types before adding the data into SQLite. We will:

- Identify the types for each column.
- Identify the numeric columns we can represent using more space efficient types.
- For text columns:
    - Analyze the unique value counts across all of the chunks to see if we can convert them to a numeric type.
    - See if we clean clean any text columns and separate them into multiple numeric columns without adding any overhead when querying.

The overall memory the data consumed should stay under 10 megabytes.

In [22]:
chunk_iter = pd.read_csv('crunchbase-investments.csv', chunksize=5000, encoding='ISO-8859-1', usecols=keep_cols)

col_types = {}

for chunk in chunk_iter:
    for col in chunk.columns:
        if col not in col_types:
            col_types[col] = [str(chunk.dtypes[col])]
        else:
            col_types[col].append(str(chunk.dtypes[col]))

uniq_col_types = {}
for k,v in col_types.items():
    uniq_col_types[k] = set(col_types[k])
uniq_col_types

{'company_name': {'object'},
 'company_category_code': {'object'},
 'company_country_code': {'object'},
 'company_state_code': {'object'},
 'company_region': {'object'},
 'company_city': {'object'},
 'investor_name': {'object'},
 'investor_country_code': {'float64', 'object'},
 'investor_state_code': {'float64', 'object'},
 'investor_region': {'object'},
 'investor_city': {'float64', 'object'},
 'funding_round_type': {'object'},
 'funded_at': {'object'},
 'funded_month': {'object'},
 'funded_quarter': {'object'},
 'funded_year': {'float64', 'int64'},
 'raised_amount_usd': {'float64'}}