# Data Understanding: Indian Startup Funding Dataset

This notebook focuses on understanding the structure, scope, and quality of the dataset
before performing any statistical analysis.

No assumptions are made at this stage.


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data/startup_funding.csv")

In [3]:
df.head()

Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks
0,1,09/01/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,
2,3,09/01/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,
3,4,02/01/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,
4,5,02/01/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,


In [4]:
df.shape

(3044, 10)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Sr No              3044 non-null   int64 
 1   Date dd/mm/yyyy    3044 non-null   object
 2   Startup Name       3044 non-null   object
 3   Industry Vertical  2873 non-null   object
 4   SubVertical        2108 non-null   object
 5   City  Location     2864 non-null   object
 6   Investors Name     3020 non-null   object
 7   InvestmentnType    3040 non-null   object
 8   Amount in USD      2084 non-null   object
 9   Remarks            419 non-null    object
dtypes: int64(1), object(9)
memory usage: 237.9+ KB


## Column Overview

Below is a brief understanding of each column based on initial inspection.
This helps frame later analysis and identify potential data quality issues.

In [6]:
df.columns

Index(['Sr No', 'Date dd/mm/yyyy', 'Startup Name', 'Industry Vertical',
       'SubVertical', 'City  Location', 'Investors Name', 'InvestmentnType',
       'Amount in USD', 'Remarks'],
      dtype='object')

In [7]:
df.isnull().sum().sort_values(ascending=False)

Remarks              2625
Amount in USD         960
SubVertical           936
City  Location        180
Industry Vertical     171
Investors Name         24
InvestmentnType         4
Sr No                   0
Date dd/mm/yyyy         0
Startup Name            0
dtype: int64

### Initial Observations on Missing Data

Some columns contain missing values, which is expected in real-world datasets.
At this stage, no imputation or removal is performed.


In [8]:
df.duplicated().sum()

np.int64(0)

## Key Takeaways from Data Understanding

- The dataset contains real-world inconsistencies and missing values.
- Several columns require cleaning before numerical analysis.
- Understanding data structure first helps avoid misleading conclusions later.

The next step will focus on cleaning only what is necessary for descriptive statistics.