# Overview
Interested in the Indian startup ecosystem just like me? Wanted to know what type of startups are getting funded in the last few years? Wanted to know who are the important investors? Wanted to know the hot fields that get a lot of funding these days?
This dataset is a chance to explore the Indian start up scene. Deep dive into funding data and derive insights into the future!

This dataset has funding information of the Indian startups from January 2015 to August 2017. It includes columns with the date funded, the city the startup is based out of, the names of the funders, and the amount invested (in USD)

### Columns include;
<b>Sr No</b> - Serial Number<br>
<b>Date dd/mm/yyyy</b> - Date of funding in dd/mm/yyy format<br>
<b>Startup Name</b> - Name of the startup<br>
<b>Industry Vertical</b> - Industry vertical of the startup<br>
<b>SubVertical</b> - Industry sub-vertical<br>
<b>City Location</b> - city of location<br>
<b>Investors Name</b> - Name of the investors<br>
<b>InvestmentnType</b> - Type of investment<br>
<b>Amount</b> - Amount in USD<br>
<b>Remarks</b> - Other remarks if any<br>

# Questions 
1. How does the funding ecosystem change with time?
2. Do cities play a major role in funding?
3. Which industries are favored by investors for funding?
4. Who are the important investors in the Indian Ecosystem?
5. How much funds does startups generally get in India?

In [1]:
# import relevant libraries
import pandas as pd
import numpy as np
from summarytools import dfSummary
import seaborn as sns
import matplotlib.pyplot as plt

import re

In [36]:
# Read data
data = pd.read_csv("startup_funding.csv")

In [37]:
data.head()

Unnamed: 0,Sr No,Date dd/mm/yyyy,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Remarks
0,1,9/1/2020,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,
1,2,13/01/2020,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,
2,3,9/1/2020,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,
3,4,2/1/2020,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,
4,5,2/1/2020,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,


In [22]:
data.shape

(3044, 10)

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3044 entries, 0 to 3043
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Sr No              3044 non-null   int64 
 1   Date dd/mm/yyyy    3044 non-null   object
 2   Startup Name       3044 non-null   object
 3   Industry Vertical  2873 non-null   object
 4   SubVertical        2108 non-null   object
 5   City  Location     2864 non-null   object
 6   Investors Name     3020 non-null   object
 7   InvestmentnType    3040 non-null   object
 8   Amount in USD      2084 non-null   object
 9   Remarks            419 non-null    object
dtypes: int64(1), object(9)
memory usage: 237.9+ KB


# Univariate Analysis

In [24]:
data.describe()

Unnamed: 0,Sr No
count,3044.0
mean,1522.5
std,878.871435
min,1.0
25%,761.75
50%,1522.5
75%,2283.25
max,3044.0


In [30]:
dfSummary(data, is_collapsible = False)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Sr No [int64],Mean (sd) : 1522.5 (878.9) min < med < max: 1.0 < 1522.5 < 3044.0 IQR (CV) : 1521.5 (1.7),"3,044 distinct values",,0 (0.0%)
2,Date [object],1. 2/2/2015 2. 8/7/2015 3. 30/11/2016 4. 4/10/2016 5. 23/07/2015 6. 1/6/2015 7. 8/2/2016 8. 21/06/2016 9. 22/01/2016 10. 4/5/2016 11. other,"11 (0.4%) 11 (0.4%) 11 (0.4%) 10 (0.3%) 9 (0.3%) 9 (0.3%) 9 (0.3%) 9 (0.3%) 9 (0.3%) 9 (0.3%) 2,947 (96.8%)",,0 (0.0%)
3,Startup Name [object],1. Ola Cabs 2. Swiggy 3. Paytm 4. Meesho 5. Nykaa 6. NoBroker 7. Medinfi 8. UrbanClap 9. Flipkart 10. Grofers 11. other,"8 (0.3%) 8 (0.3%) 7 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 6 (0.2%) 5 (0.2%) 5 (0.2%) 2,981 (97.9%)",,0 (0.0%)
4,Industry Vertical [object],1. Consumer Internet 2. Technology 3. eCommerce 4. nan 5. Healthcare 6. Finance 7. ECommerce 8. Logistics 9. E-Commerce 10. Education 11. other,941 (30.9%) 478 (15.7%) 186 (6.1%) 171 (5.6%) 70 (2.3%) 62 (2.0%) 61 (2.0%) 32 (1.1%) 29 (1.0%) 24 (0.8%) 990 (32.5%),,171 (5.6%)
5,SubVertical [object],1. nan 2. Online Lending Platform 3. Online Pharmacy 4. Food Delivery Platform 5. Online Learning Platform 6. Online Education Platform 7. Online Lending 8. Online lending platform 9. Education 10. Online Food Delivery 11. other,"936 (30.7%) 11 (0.4%) 10 (0.3%) 8 (0.3%) 5 (0.2%) 5 (0.2%) 5 (0.2%) 5 (0.2%) 5 (0.2%) 4 (0.1%) 2,050 (67.3%)",,936 (30.7%)
6,City Location [object],1. Bangalore 2. Mumbai 3. New Delhi 4. Gurgaon 5. nan 6. Bengaluru 7. Pune 8. Hyderabad 9. Chennai 10. Noida 11. other,700 (23.0%) 567 (18.6%) 421 (13.8%) 287 (9.4%) 180 (5.9%) 141 (4.6%) 105 (3.4%) 99 (3.3%) 97 (3.2%) 92 (3.0%) 355 (11.7%),,180 (5.9%)
7,Investors Name [object],1. Undisclosed Investors 2. Undisclosed investors 3. Ratan Tata 4. nan 5. Indian Angel Network 6. Kalaari Capital 7. Group of Angel Investors 8. Sequoia Capital 9. Undisclosed Investor 10. Accel Partners 11. other,"39 (1.3%) 30 (1.0%) 25 (0.8%) 24 (0.8%) 23 (0.8%) 16 (0.5%) 15 (0.5%) 15 (0.5%) 12 (0.4%) 12 (0.4%) 2,833 (93.1%)",,24 (0.8%)
8,InvestmentnType [object],1. Private Equity 2. Seed Funding 3. Seed/ Angel Funding 4. Seed / Angel Funding 5. Seed\\nFunding 6. Debt Funding 7. Series A 8. Seed/Angel Funding 9. Series B 10. Series C 11. other,"1,356 (44.5%) 1,355 (44.5%) 60 (2.0%) 47 (1.5%) 30 (1.0%) 25 (0.8%) 24 (0.8%) 23 (0.8%) 20 (0.7%) 14 (0.5%) 90 (3.0%)",,4 (0.1%)
9,Amount in USD [object],"1. nan 2. 10,00,000 3. 5,00,000 4. 20,00,000 5. 30,00,000 6. 50,00,000 7. 1,00,00,000 8. 1,00,000 9. 1,50,000 10. 2,00,000 11. other","960 (31.5%) 165 (5.4%) 108 (3.5%) 69 (2.3%) 66 (2.2%) 66 (2.2%) 60 (2.0%) 57 (1.9%) 45 (1.5%) 44 (1.4%) 1,404 (46.1%)",,960 (31.5%)
10,Remarks [object],1. nan 2. Series A 3. Series B 4. Pre-Series A 5. Series C 6. Series D 7. Strategic Investment 8. Late Stage 9. At the 10 minute million event 10. Strategic Funding 11. other,"2,625 (86.2%) 175 (5.7%) 63 (2.1%) 37 (1.2%) 28 (0.9%) 11 (0.4%) 11 (0.4%) 9 (0.3%) 6 (0.2%) 6 (0.2%) 73 (2.4%)",,"2,625 (86.2%)"


### Deep Dive on Date Column 

In [38]:
data.rename(columns={"Date dd/mm/yyyy": "Date"}, inplace=True)

In [39]:
with pd.option_context('display.max_rows', None,):
    print(data["Date"])

0          9/1/2020
1        13/01/2020
2          9/1/2020
3          2/1/2020
4          2/1/2020
5        13/01/2020
6         10/1/2020
7        12/12/2019
8         6/12/2019
9         3/12/2019
10       13/12/2019
11       17/12/2019
12       16/12/2019
13       16/12/2019
14       14/12/2019
15       11/12/2019
16       20/12/2019
17       13/11/2019
18       14/11/2019
19       13/11/2019
20       17/11/2019
21       18/11/2019
22       15/11/2019
23       20/11/2019
24       12/11/2019
25       20/11/2019
26       11/11/2019
27       19/11/2019
28       18/11/2019
29       15/11/2019
30       19/11/2019
31       25/11/2019
32        4/10/2019
33        2/10/2019
34       21/10/2019
35         5/9/2019
36         4/9/2019
37         4/9/2019
38         4/9/2019
39         4/9/2019
40         4/9/2019
41         4/9/2019
42         4/9/2019
43         3/9/2019
44         1/8/2019
45         1/8/2019
46         1/8/2019
47         1/8/2019
48         1/8/2019
49        12/8/2019


In [40]:
# Check for data type
data["Date"].dtypes

dtype('O')

In [42]:
data["Date"].replace(["05/072018","01/07/015","22/01//2015","12/05.2015", "13/04.2015","15/01.2015"], 
                     ["05/07/2018", "01/07/2015", "22/01/2015","12/05/2015","13/04/2015", "15/01/2015"], 
                     inplace = True, regex = True)

In [43]:
# convert from object to datetime
data["Date"] = pd.to_datetime(data["Date"],format = "%d/%m/%Y")

In [44]:
# Check for data type
data["Date"].dtypes

dtype('<M8[ns]')

In [34]:
# Feature engineer Date time
data['Year'] = data['Date'].dt.year

#Drop the Date, Sr No column and Remarks
data.drop(["Date","Sr No", "Remarks"], axis = 1, inplace = True)

In [35]:
data.head(50)

Unnamed: 0,Startup Name,Industry Vertical,SubVertical,City Location,Investors Name,InvestmentnType,Amount in USD,Year
0,BYJU’S,E-Tech,E-learning,Bengaluru,Tiger Global Management,Private Equity Round,200000000,2020
1,Shuttl,Transportation,App based shuttle service,Gurgaon,Susquehanna Growth Equity,Series C,8048394,2020
2,Mamaearth,E-commerce,Retailer of baby and toddler products,Bengaluru,Sequoia Capital India,Series B,18358860,2020
3,https://www.wealthbucket.in/,FinTech,Online Investment,New Delhi,Vinod Khatumal,Pre-series A,3000000,2020
4,Fashor,Fashion and Apparel,Embroiled Clothes For Women,Mumbai,Sprout Venture Partners,Seed Round,1800000,2020
5,Pando,Logistics,"Open-market, freight management platform",Chennai,Chiratae Ventures,Series A,9000000,2020
6,Zomato,Hospitality,Online Food Delivery Platform,Gurgaon,Ant Financial,Private Equity Round,150000000,2020
7,Ecozen,Technology,Agritech,Pune,Sathguru Catalyzer Advisors,Series A,6000000,2019
8,CarDekho,E-Commerce,Automobile,Gurgaon,Ping An Global Voyager Fund,Series D,70000000,2019
9,Dhruva Space,Aerospace,Satellite Communication,Bengaluru,"Mumbai Angels, Ravikanth Reddy",Seed,50000000,2019


### Deep Dive on Startup Name

In [15]:
with pd.option_context('display.max_rows', None,):
    print(data["Startup Name"])

0                                                  BYJU’S
1                                                  Shuttl
2                                               Mamaearth
3                            https://www.wealthbucket.in/
4                                                  Fashor
5                                                   Pando
6                                                  Zomato
7                                                  Ecozen
8                                                CarDekho
9                                            Dhruva Space
10                                                 Rivigo
11                                             Healthians
12                                                Licious
13                                                 InCred
14                                                  Trell
15                                             Rein Games
16                                           Lenskart.com
17            

Name: Startup Name, dtype: object


In [16]:
# Solve the discrepancies
data.loc[data["Startup Name"] == "https://www.wealthbucket.in/", ["Startup Name"]] = ["Wealth Bucket"]

data.loc[data["Startup Name"] == "BYJU\\'S", ["Startup Name"]] = ["BYJU"]

data.loc[data["Startup Name"] == "What\\xe2\\x80\\x99s Up Life", ["Startup Name"]] = ["Up Life"]

data.loc[data["Startup Name"] == "Byju\\xe2\\x80\\x99s", ["Startup Name"]] = ["BYJU"]

data.loc[data["Startup Name"] == "Creator\\xe2\\x80\\x99s Gurukul", ["Startup Name"]] = ["Gurukul"]

data.loc[data["Startup Name"] == "SERV\\xe2\\x80\\x99D", ["Startup Name"]] = ["Serv"]

data.loc[data["Startup Name"] == "BYJU\\xe2\\x80\\x99s", ["Startup Name"]] = ["BYJU"]

### Deep dive on  Amount 

In [46]:
data.rename(columns={"Amount in USD": "Amount"}, inplace=True)
data["Amount"].dtypes

dtype('O')

In [47]:
with pd.option_context('display.max_rows', None,):
    print(data["Amount"])

0               20,00,00,000
1                  80,48,394
2                1,83,58,860
3                  30,00,000
4                  18,00,000
5                  90,00,000
6               15,00,00,000
7                  60,00,000
8                7,00,00,000
9                5,00,00,000
10               2,00,00,000
11               1,20,00,000
12               3,00,00,000
13                 59,00,000
14                 20,00,000
15               5,00,00,000
16              23,10,00,000
17              15,00,00,000
18                  4,86,000
19                 15,00,000
20               undisclosed
21               1,20,00,000
22               2,60,00,000
23               1,74,11,265
24                 13,00,000
25              13,50,00,000
26                  3,00,000
27              22,00,00,000
28               1,58,00,000
29              28,30,00,000
30              20,00,00,000
31            1,00,00,00,000
32               4,50,00,000
33              58,50,00,000
34            

In [48]:
for i in range (0, len(data["Amount"])):
    data["Amount"][i]= re.sub('\D',"",str(data["Amount"][i]))
data["Amount"] = pd.to_numeric(data["Amount"])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["Amount"][i]= re.sub('\D',"",str(data["Amount"][i]))


In [49]:
data["Amount"].dtypes

dtype('float64')