## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Omar Abbasi: Project Administration, Conceptualization, Formal Analysis, Visualization, Writing – Original Draft  
Zahir Ali: Conceptualization, Visualization, Software, Writing – Reviewing/Edits  
Adam Hamadene: Research, Formal Analysis  
Mostafa Darwish: Visualization, Writing – Original Draft, Data Curation  
Yasir Rizvi: Data Cleaning 


## Research Question

How do funding size, industry sector, and geographic location influence both the likelihood and timing of startup failure versus acquisition?



## Background and Prior Work

As the world continues to see the intersection of human ingenuity and technological accumulation grow extremely rapidly, this shift in the corporate landscape can be tied back to a specific niche: the prevalence and growth of startups in the post-modern era. As technical knowledge and tools continue to develop, human ingenuity has found itself employed in finding the most useful ways to leverage and expand upon the current era of technology and artificial intelligence. Examples of post-modern startups include Uber, Robinhood, Stripe, Databricks, Canva, and Slack. Ideas for growth and innovation stem from all fields, and are catalyzed from a variety of sources such as corporate America, educational insititutions, and small communities all across the country. However, although the majority of the startups known today are those that found success in climbing the barrier between idea and impact, it is the majority that fall short of overcoming this hurdle and end up failing as a product. In this report, we aim to look at a multitude of variables directly and intrinsically tied to startups and their growth, to determine the coefficient of correlation between various factors such as industry sector, funding, location, and size, and how they impact a startup's ability to come to fruition. Because the growth of startups is relatively new and tied to very modern technological advancements, there is a scarce amount of research done into the causes behind their success and failures. For example, venture capital firms and startup accelerators such as Y Combinator were founded in 2005, making the funding rounds for successful startups a very new principle. Our curiosity lies in looking at the underlying details of the successes and failures for startups in the United States, as it provides an opportunity to discover findings in a modern niche that does not possess the level of academic study that other corporate fields in America do.

## Hypothesis


Startups with larger funding sizes, operating in high-growth industry sectors, and located within established entrepreneurial ecosystems are less likely to fail and tend to experience longer survival times before either failure or acquisition, whereas startups with smaller funding, in low-growth sectors, or in emerging regions face higher failure risks and shorter time-to-event durations.


## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

- Dataset #3
  - Dataset Name: Startup Failures
  - Link to the dataset: "https://www.kaggle.com/datasets/dagloxkankwanda/startup-failures"
  - Number of observations: 815
  - Number of variables: 20
  - Description of the variables most relevant to this project:
  Name – Name of the startup
Years of Operation – Duration the startup was active (e.g., 2010–2023)
What They Did – Description of the startup’s product or service
How Much They Raised – Total funding amount (in USD, usually in millions)
Why They Failed – Main reason for failure
Takeaway – Key lesson or insight from the failure
  - Descriptions of any shortcomings this dataset has with repsect to the project:
  Some shortcomings with the dataset are that the dataset does not include certain important metrics such as employee count, location details, acquisition status, or time-to-failure in months, all of which could strengthen temporal or regional analyses. Secondly, the column “Why They Failed” condenses complex, multi-faceted causes into short summaries (e.g., “competition,” “cash flow issues”). This makes it difficult to capture nuanced or overlapping factors that contribute to business failure. Lastly, although the dataset focuses on failures, it provides limited data on successful startups or acquisitions, which restricts the ability to make direct comparisons between what leads to success versus failure.

Though Startup Outcomes & Lifespan Dataset is our guiding resource, we shall definitely employ this data in conjunction with other data sets focusing on quantitative detail regarding acquisition, source of funding, as well as location-based distribution. The present data set sets forth the basics in terms of definition, namely those that failed, amount of startup capital, as well as duration of existence, but subsidiary data sets may further detail this via acquisition statistics, startup location, as well as investor data sets, respectively. In exploring this data set in conjunction with data sets for investments in general, as well as geographies, a more analytical exploration of failure, rather than descriptive, could be undertaken to identify factors contributing to failure as well as acquisition, as this data set also provides insights into startups that acquire companies, in addition to those that fail, within the global startup community.



In [6]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://www.kaggle.com/datasets/arindam235/startup-investments-crunchbase', 'filename':'investments_VC.csv'},
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

ModuleNotFoundError: No module named 'get_data'

### Crunchbase Startup Investments Data (to 2015) 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
    
   
   This dataset has around 50,000 (39 variables) startups with details on their funding size, industry, location, and company status. After cleaning the dataset (removing duplicates, handling missing values, and converting funding amounts to numeric values), I made visuals representing top countries by total funding, funding amount distribution, and total VC funding over time. Results showed that the US dominates global funding, that most startups raise small amounts, and that investment activity peaked around 2010-2012. 
   
   Unfortunately the dataset ends before 2015 which limits its relevance to modern startup trends to an extent. Additionally Crunchbase data is partly self-reported/crowdsourced, which makes it prone to missing, biased, or inaccurate entries (especially outside of US and tech sectors).

   


In [None]:
!unzip data.zip

13528.60s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


unzip:  cannot find or open data.zip, data.zip.zip or data.zip.ZIP.


In [19]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("investments_VC.csv", encoding="latin1", low_memory=False)


In [None]:

df.drop_duplicates(inplace=True)


In [None]:

null_threshold = 0.4
df = df.loc[:, df.isna().mean() < null_threshold]

In [None]:
df.fillna({"country_code":"Unknown", "state_code":"Unknown", "city":"Unknown"}, inplace=True)
df.fillna(0, inplace=True)

In [None]:
if "funding_total_usd" in df.columns:
    df["funding_total_usd"] = (
        df["funding_total_usd"]
        .astype(str)
        .str.replace(r"[^\d.]", "", regex=True)
        .replace("", "0")
        .astype(float)
    )


In [None]:
df.to_csv("investments_clean_simple.csv", index=False)


In [None]:
amt = next((c for c in df.columns if re.search(r'fund.*total.*usd', c, re.I)), None)
if not amt: raise KeyError("Funding column not found.")
df[amt] = pd.to_numeric(df[amt].astype(str).str.replace(r'[^\d.]','',regex=True), errors='coerce')

### Dataset #2 
 - Dataset Name: Big Startup Success/Fail Dataset from Crunchbase
 - Link to the dataset: https://www.kaggle.com/datasets/yanmaksi/big-startup-secsees-fail-dataset-from-crunchbase
 - Number of observations: Computed in the code cell below (after loading)
 - Number of variables: Computed in the code cell below (after loading)
 - Description of the variables most relevant to this project:
 - Funding-related: `funding_total_usd`, `funding_rounds`, `first_funding_at`, `last_funding_at`
 - Outcome/status fields: `status` (operating, acquired, closed, ipo)
 - Company profile: `name`, `category_list`, `country_code`, `state_code`, `region`, `city`
 - Timing variables: `founded_at` and milestone event dates for computing time-to-failure or time-to-acquisition
 - Descriptions of any shortcomings this dataset has with respect to the project:
 - Potential label noise: status values may be outdated; “operating” does not guarantee long-term success; “closed” labels may lag behind real closure dates
 - Missing or inconsistent date fields that complicate survival/time-to-event analysis
 - Possible duplicates (companies appearing multiple times under variant naming)
 - Inconsistent or messy industry taxonomy in `category_list`
 - Survivorship and reporting bias inherent to Crunchbase data (overrepresentation of certain regions and funded companies)
 - This will be crossexmained across all the datasets that we are using since our research quesiton is multi-faceted.


In [20]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #3 - Startup Outcomes and Lifespan Dataset

These datasets rigorously document startup outcomes across multiple industries—Finance, Food, Healthcare, Information Technology, Manufacturing, and Retail—and form the bedrock for an analysis of how factors like funding size, industry sector, and location influence a startup's likelihood and timing of failure versus acquisition; each record represents one failed startup and includes its "Name," "Years of Operation," "What They Did," "How Much They Raised" (total investor funding in U.S. dollars, millions, e.g., $500M), "Why They Failed" (a textual description of the main cause), and a "Takeaway" (a concise lesson summarizing the insight), attributes that enable the exploration of patterns between financial backing, market type, and survival duration to quantify and compare "startup lifespan" across sectors. Although these data provide valuable cross-industry insight, they are limited because they focus exclusively on failed startups sourced from CB Insights' "Startup Failure Post-Mortem," which tends to highlight high-profile companies, creating a selection bias that primarily captures patterns among well-funded ventures rather than the average startup, and also lacks quantitative details like geographic coordinates or acquisition outcomes necessary for a complete survival analysis. To prepare the data, which was already tidy upon inspection with every variable having its own column and each observation representing a single startup, we fixed spelling and formatting errors, combined the six core sector files (Finance, Food, Healthcare, Information, Manufacturing, and Retail) to create a total of 409 unique observations, and excluded remaining entries that lacked sufficient information or belonged to industries with too few startups to extract meaningful insights. A check for null-values showed minimal missingness, appearing randomly in columns like "Takeaway" and "Why They Failed," and while we checked absurdly high funding outliers (over or equal to $1B) in the "How Much They Raise" column, we kept them as they were determined to be legitimate values representing large, real-world funded startups; finally, the nearly clean dataset required only minor formatting adjustments, like adding underscores or removing trailing whitespace, and while no rows were dropped due to the scarcity of missingness, we utilized `.dropna()` and `.drop_duplicates()` just in case any null values or duplicates were present.



In [30]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

import pandas as pd 

finance_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failure_Finance_clean.csv")
food_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failure_Food_clean.csv")
healthcare_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failure_Healthcare_Clean.csv")
info_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failure_Information_Clean.csv")
manufactures_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failure_Manufactures_Clean.csv")
retail_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failure_Retail_Clean.csv")
all_df = pd.read_csv("data/02-processed/cleaned_dataset_3/Startup_Failures_clean.csv")


In [31]:
from IPython.display import display

print("Finance Dataset:")
display(finance_df.head())

print("\nFood Dataset:")
display(food_df.head())

print("\nHealthcare Dataset:")
display(healthcare_df.head())

print("\nInformation Dataset:")
display(info_df.head())

print("\nManufactures Dataset:")
display(manufactures_df.head())

print("\nRetail Dataset:")
display(retail_df.head())

print("\nAll Startups Combined Dataset:")
display(all_df.head())

Finance Dataset:


Unnamed: 0,Name,Years of Operation,What They Did,How Much They Raised,Why They Failed,Takeaway
0,Avant,2012-2023,Online personal loans,$655M,Lost to LendingClub and high defaults,Lending needs risk balance
1,Bitpass,2002-2008,Micropayments platform,$2M,Lost to PayPal and low adoption,Micropayments need mass use
2,Cake Financial,2006-2011,Portfolio tracking tool,$3M,Lost to Mint and sold to TradeKing,Finance tools need scale
3,Circle,2013-2023,Crypto payments and stablecoin,$500M,Lost to Coinbase and market shifts,Crypto needs stability
4,Clarity Money,2016-2022,Personal finance app,$11M,Lost to Mint/Acorns and sold to Goldman,Finance apps need edge



Food Dataset:


Unnamed: 0,name,what_they_did,why_they_failed,takeaway,how_much_they_raised,years_of_operation
0,Cafe X,Robotic coffee kiosks,Closed 2021; low adoption; lost to Starbucks,Humans trump robots,$15M,6 (2015-2021)
1,Caviar,Premium food delivery,Sold 2020; couldn't scale; lost to DoorDash,Premium loses to scale,$90M,8 (2012-2020)
2,Chef'd,Meal kit delivery,Closed 2019; high costs; lost to Blue Apron,Costs cook meal kits,$35M,5 (2014-2019)
3,ChowNow,Restaurant ordering platform,Faded 2023; lost to DoorDash,Middlemen get squeezed,$64M,12 (2011-2023)
4,Clover,Plant-based fast food chain,Closed 2020; niche; lost to McDonald's,Niche eats need taste,$20M,8 (2012-2020)



Healthcare Dataset:


Unnamed: 0,name,years_of_operation,what_they_did,how_much_they_raised,why_they_failed,takeaway
0,Aira Health,2015-2019,Personalized asthma/allergy app,$12M,Small user base and cash shortage,Niche apps need big audiences
1,Amino,2013-2021,Doctor search and cost estimation,$45M,Lost to Zocdoc/GoodRx and slow adoption,Narrow focus beats broad
2,Arivale,2015-2019,Personalized health coaching,$50M,High costs and low demand,Premium needs mass market
3,Augmedix,2012-2024,Remote medical scribes,$150M,Lost to software rivals and acquired,Flexibility beats rigidity
4,Avizia,2014-2018,Telemedicine for hospitals,$32M,Outpaced by bigger rivals and acquired,Niche needs a moat



Information Dataset:


Unnamed: 0,name,years_of_operation,what_they_did,how_much_they_raised,why_they_failed,takeaway
0,Airy Labs,2 (2010-2012),Educational mobile games for kids,$1.5M,Shut down in 2012 after chaotic sprint; too ma...,Focus beats frenzy
1,Ask Jeeves,11 (1996-2007),Early search engine with butler mascot,$20M,Faded by 2007; lost to Google's algorithm and ...,Innovation isn't enough
2,Bebo,14 (2005-2019),Social networking site popular in UK,$12.8M,Shut down in 2019; lost to Facebook; AOL misma...,Network effects can crush
3,Burbn,2 (2010-2012),Check-in app with photo-sharing,$0.5M,Closed in 2012 but pivoted to Instagram; too c...,Pivots can save
4,Canvas,6 (2011-2017),Collaborative document editing platform,$9M,Shut down in 2017; lost to Google Docs; Dropbo...,Stand out or drown



Manufactures Dataset:


Unnamed: 0,name,years_of_operation,what_they_did,how_much_they_raised,why_they_failed,takeaway
0,Airware,2011-2018,Drone hardware/software for industry,$70M,Lost to DJI and high costs,Drones need simplicity
1,Anki,2010-2019,AI-powered toy robots,$200M,High costs and competition from Lego/Sphero,Consumer hardware needs mass pricing
2,Aptera Motors,2005-2011,Three-wheeled electric vehicles,$40M,Lost to Tesla and quirky design,EVs need mainstream appeal
3,Aria Insights,2008-2019,Tethered industrial drones,$39M,Small market and lost to DJI/Skydio,Hardware niches need big adopters
4,August Home,2012-2017,Smart locks and doorbells,$73M,Lost to Ring/Nest and acquired,Smart home needs ecosystem power



Retail Dataset:


Unnamed: 0,name,years_of_operation,what_they_did,how_much_they_raised,why_they_failed,takeaway
0,99dresses,3 (2010-2013),Fashion swapping app,$0.5M,Shut down 2013; low retention; funding fell th...,Retention is king
1,Ahalife,7 (2010-2017),Curated luxury goods marketplace,$20M,Closed 2017; high marketing costs; lost to Amazon,Niche doesn't defend
2,AllRomance,10 (2006-2016),E-book retailer for romance novels,$1M,Closed 2016; financial losses; lost to Kindle,Adapt or die
3,Auctionata,6 (2012-2018),Online auction house for art and luxury,$95M,Shut down 2018; high costs; lost to eBay; valu...,Trust and economics matter
4,Augury Books,5 (2012-2017),Indie e-commerce bookstore for poetry,$0.5M,Closed 2017; couldn't scale; lost to Amazon,Niche retail bleeds



All Startups Combined Dataset:


Unnamed: 0,Name,Sector,Years,Period,Start_Year,End_Year
0,99dresses,Retail Trade,3,(2010-2013),2010,2013
1,Ahalife,Retail Trade,7,(2010-2017),2010,2017
2,Airy Labs,Information,2,(2010-2012),2010,2012
3,AllRomance,Retail Trade,10,(2006-2016),2006,2016
4,Ampush,Professional Scientific and Technical Services,13,(2010-2023),2010,2023


## Ethics

- **Bias & fairness:** Datasets may favor startups or sectors that are much higher in popularity and show a bias towards well known startups. There are also concerns with issues of generlzation as some areas may have high density of startups compared to others potential misrepresenting the data.
- **Generalization limits:** As mentioned previously, Kaggle datasets in particular may overrepresent high success startups due to the ease of accessing the data. This results in the data not geenrlazaing to non-tech or smaller companies. We intend to avoid general claims and make specifc statements that are contextualized by environment maturity.
- **Data sensitivity:** Although the data gathered is public, names and emails can be used to reidenifty an individual, or any specific data points that can be used to triangulate a person or startup. 
- **Non-Consensual Use of Company Information:** Despite the data being public, startups in the dataset did not consent to be analyzed or used for prediction exercises. We will be using aggregated analysis and not single out any companies.
- **Potential Misrepresentation Due to Inaccurate or Incomplete Data:** Startup databases are often incomplete, outdated, or wrong because the data is crowdsourced. We will treat the data as approximate and emphasize uncertainty rather than presenting results as definitive truth.
- **Responsibility to Prevent Harmful Use of Results:** If someone misuses the findings it may influence funding or hiring decisions, or perceptions of certain industries/regions. We will explicitly state that the work should not be used for investment decisions.

## Team Expectations 

In regards to communication, we plan on using iMessage to text one another. We believe this is one of the most efficient and easiest ways for us to contact one another. As for the ,essaging itself, all members have agreed to expect a rresponse, whether that be a message or a reaction, from each group member within 3 hours of the initial text being sent. We will meet twice per week, every Monday morning we will reserve a study space in the Geisel Library to meet and prodvide updates, followed by a Zoom meetimng every Friday afternoon to consolidate the designated progress from Onday's meeting. In regarcs to tone, we agree to all expect respectful interactions. Even when disagreement takes place, the person expressing the lack of approval should explain their reasonibg, as well as an alkternative method that they beleive is better. We will be concise and to the point, but still maintaining respect for one another and everyone's ideas. We plan to use voting to make decisions as a group, especially for disagreements and changes to our original plan. The project administrator will be in charge of calling teh vote, ad we will go with the majority ruling, as wel have3 5 people. We will not accept abstaining from votes. We do have specialized roles for each person, hwoever, because there is overlap, we do plan to share a lot of the responsibilties. As we are a team, we plan on heloing each other out when possible, especially if one person is struggling with a specific task. We assign roles and tasks based on the skillsets of the members, which we have laready discussed in detail. We have set a policy that struggles are inevitbale, as we are all busy. We have a guideline that whenever someone is falling behind, there is no hassle or problem in expressing that as early as possible. We would rather know what to fix earlier on in the process, rather than have a last minute lack of execution. Struggles with certain tasks hoild be expressd immediately, as we will set egos aside ti help regardless of role/assignned tasks, in roder to priotitize the team as a unit/whole.


## Project Timeline Proposal

## Project Timeline
### Week 3
- **Monday:** Meeting to brainstorm project topics, confirm individual dataset search responsibilities and initiate the plan for our topic.
- **Friday:** Zoom meeting to vote on and finalize topic.,
- **Sunday:** Zoom meeting to discuss roles then consolidate and review the datasets we’ve individually found.

### Week 4
- **Monday:** Meet in Geisel to consolidate datasets and assign roles for our project proposal. Begin working on data cleaning plan, early transformations, and outline visualization goals.
- **Wednesday:** Proofread and finalize proposal


### Week 5,
- **Monday:** Begin data cleaning and preprocessing our datasets.
- **Wednesday:** Continue data cleaning and finalize our structured and processed dataset; share cleaned files with each other.

### Week 6
- **Sunday:** Zoom meeting to compare our processed datasets and make sure everything is consistent.
- **Monday:** Discuss trends and finalize consensus on dataset selection and structure.
- **Wednesday:** Complete initial EDA preparation, finalize plan for visualization types.

### Week 7
- **Monday:** Start building visualizations, assign figure responsibilities to group members.
- **Friday:** Review meeting to make sure visualizations are progressing and discuss results and narrative.

### Week 8
- **Monday:** Compile all visualizations and ensure consistency.
- **Wednesday:** Polish everything, ensure code runs cleanly.

### Week 9
- **Monday:** Final review session, proofread notebook text, verify rubric requirements, and finalize project. Begin preparing video.
- **Wednesday:** Submit final project and recording; complete team evaluation.
