# Data Analysis Project -- Indian Start-up Funding Analysis
### Business Understanding
Ideas, creativity, and execution are essential for a start-up to flourish. But are they enough? Investors provide start-ups and other entrepreneurial ventures with the capital---popularly known as "funding"---to think big, grow rich, and leave a lasting impact. In this project, I am  going to analyse funding received by start-ups in India from 2018 to 2021. I will find the data for each year of funding in a separate csv file in the dataset provided. In these files I'll find the start-ups' details, the funding amounts received, and the investors' information.
### Data Understanding
#### Feature description:
1. Company/Brand: Name of the company/start-up
2. Founded: Year start-up was founded
3. Sector: Sector of service
4. What it does: Description about Company
5. Founders: Founders of the Company
6. Investor: Investors
7. Amount($): Raised fund
8. Stage: Round of funding reached

## Business Questions:
 
 1. How have the trends in funding amounts varied among different sectors of Indian startups from 2018 to 2021, and what implications do these trends have for investors seeking to optimize their investment strategies in the Indian startup ecosystem?
  2. What were the top sectors in the Indian startup ecosystem that received the highest total funding from 2018 to 2021, and what factors contributed to their success in attracting investment compared to other sectors? This question delves into not just the identification of the sectors but also seeks to understand the underlying reasons for their success in attracting funding, which could include factors like market demand, technological innovation, regulatory environment, and investor interest.
   3. What are the specific funding trends across different stages of startup development (early-stage, growth-stage, late-stage) within the Indian startup ecosystem, and how do factors such as sector, geographic location, and investor type influence these trends? This refined question allows for a more comprehensive analysis of funding trends by considering additional dimensions such as: Startup Stages: Early-stage, growth-stage, and late-stage. Sector: Identifying which sectors are receiving funding at various stages. Geographic Location: Understanding regional variations in funding. Investor Type: Examining the role of different types of investors (e.g., venture capital, angel investors, private equity) in funding trends. 
   4. What specific geographic factors (such as infrastructure, talent pool, economic policies, and market access) influence the correlation between the geographical location of startups and the funding they received within the Indian startup ecosystem, and how do these factors vary across different regions? 
   5. What specific factors (such as the amount and stage of funding, investor involvement, business model, and market conditions) influence the relationship between funding amounts and the subsequent success or failure of startups within the Indian ecosystem, and how do these factors vary across different sectors and stages of startup development?
## Hypothesis to Test:
 
Given the goal of assessing the investment potential in the Indian startup ecosystem, we hypothesize that:
 
Null Hypothesis (H0): There is no clear pattern in the funding received by Indian startups from 2018 to 2021, and factors like sector, stage, location, and funding amount do not affect startup success. 
Alternative Hypothesis (H1): There is a clear pattern in the funding received by Indian startups from 2018 to 2021, and factors like sector, stage, location, and funding amount affect startup success.
## Objectives:
 
1. To assess the overall attractiveness of the Indian startup ecosystem based on funding trends and investor activity from 2018 to 2021.
2. To identify key sectors with high potential for investment based on their funding attractiveness and growth prospects.
3. To evaluate the investment opportunities across different stages of startup development and their risk-return profiles.
4. To analyze the geographical distribution of startups and funding to identify strategic investment locations and regional investment disparities.
5. To determine the correlation between funding amounts received by startups and their subsequent performance, providing insights into potential returns on investment and success rates.
 
These objectives aim to provide a comprehensive evaluation of the investment landscape in the Indian startup ecosystem, helping the team make informed decisions regarding the feasibility and potential of investing in Indian startups.

## Install necessary packages

In [1967]:
%pip install pyodbc

Note: you may need to restart the kernel to use updated packages.


In [1968]:
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [1969]:
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)


 ### Import all the necessary packages
 

In [1970]:
import pyodbc     
from dotenv import dotenv_values    #import the dotenv_values function from the dotenv package
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')


## Load the datasets to use in this project

In [1971]:
SERVER="dap-projects-database.database.windows.net"
LOGIN="LP1_learner"
PASSWORD="Hyp0th3s!$T3$t!ng"
DATABASE="dapDB"

In [1972]:
# load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')
 
# Get the values for the credentials from .env file
database=environment_variables.get("DATABASE")
server=environment_variables.get("SERVER")
login=environment_variables.get("LOGIN")
password=environment_variables.get("PASSWORD")
 
# create a connection string
connection_string=f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={login};PWD={password}"

In [1973]:
connection = pyodbc.connect(connection_string)

In [1974]:
# selecting tables from DB
db_query = ''' SELECT *
            FROM INFORMATION_SCHEMA.TABLES
            WHERE TABLE_TYPE = 'BASE TABLE' '''

In [1975]:
# call selected table from DataFrame
data1=pd.read_sql(db_query, connection)
 
data1

Unnamed: 0,TABLE_CATALOG,TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
0,dapDB,dbo,LP1_startup_funding2021,BASE TABLE
1,dapDB,dbo,LP1_startup_funding2020,BASE TABLE


## Exploring the data
### Data cleaning

In [1976]:
query = "select * from dbo.LP1_startup_funding2020"
data = pd.read_sql(query, connection)
data.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,,Pre-seed,
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,


In [1977]:
data.describe()

Unnamed: 0,Founded,Amount
count,842.0,801.0
mean,2015.36342,113043000.0
std,4.097909,2476635000.0
min,1973.0,12700.0
25%,2014.0,1000000.0
50%,2016.0,3000000.0
75%,2018.0,11000000.0
max,2020.0,70000000000.0


In [1978]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1055 entries, 0 to 1054
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1055 non-null   object 
 1   Founded        842 non-null    float64
 2   HeadQuarter    961 non-null    object 
 3   Sector         1042 non-null   object 
 4   What_it_does   1055 non-null   object 
 5   Founders       1043 non-null   object 
 6   Investor       1017 non-null   object 
 7   Amount         801 non-null    float64
 8   Stage          591 non-null    object 
 9   column10       2 non-null      object 
dtypes: float64(2), object(8)
memory usage: 82.5+ KB


In [1979]:
data.dtypes

Company_Brand     object
Founded          float64
HeadQuarter       object
Sector            object
What_it_does      object
Founders          object
Investor          object
Amount           float64
Stage             object
column10          object
dtype: object

In [1980]:
data.shape

(1055, 10)

In [1981]:
data.isna().sum()

Company_Brand       0
Founded           213
HeadQuarter        94
Sector             13
What_it_does        0
Founders           12
Investor           38
Amount            254
Stage             464
column10         1053
dtype: int64

In [1982]:
# replace all dash(-) and empty spaces IN THE Founded column with NaN
data["Founded"] = data["Founded"].replace(["-", "", "nan"], np.nan)
data["Founded"].unique()

array([2019., 2018., 2020., 2016., 2008., 2015., 2017., 2014., 1998.,
       2007., 2011., 1982., 2013., 2009., 2012., 1995., 2010., 2006.,
       1978.,   nan, 1999., 1994., 2005., 1973., 2002., 2004., 2001.])

In [1983]:
# Standardize Headquarter  names
data["HeadQuarter"] = data["HeadQuarter"].str.replace("Bangalore|Bengaluru", "Bangalore")
data["HeadQuarter"] = data["HeadQuarter"].str.replace("Hyderebad", "Hyderabad")
data["HeadQuarter"] = data["HeadQuarter"].str.replace("New Delhi|Delhi", "New Delhi")

In [1984]:
# Handling missing values
data["HeadQuarter"] = data["HeadQuarter"].replace(["", "-", "nan"], np.nan)

In [1985]:
# Identify empty, commas, and NaN values in the Founders column
data[(data["Founders"] == " ") | (data["Founders"].str.contains(",")) | (data["Founders"].isna())| data["Founders"].isnull()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,,
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,,
5,qZense,2019.0,Bangalore,AgriTech,qZense Labs is building the next-generation Io...,"Rubal Chib, Dr Srishti Batra","Venture Catalysts, 9Unicorns Accelerator Fund",600000.0,Seed,
8,Rupeek,2015.0,Bangalore,FinTech,Rupeek is an online lending platform that spec...,"Amar Prabhu, Ashwin Soni, Sumit Maniyar","KB Investment, Bertelsmann India Investments",45000000.0,Series C,
...,...,...,...,...,...,...,...,...,...,...
1049,Fashor,2017.0,Chennai,Fashion,Women’s fashion and apparel,"Vikram Kankaria, Priyanka Kankaria",Sprout venture partners,1000000.0,Pre Series A,
1051,EpiFi,,,Fintech,It offers customers with a single interface fo...,"Sujith Narayanan, Sumit Gwalani","Sequoia India, Ribbit Capital",13200000.0,Seed Round,
1052,Purplle,2012.0,Mumbai,Cosmetics,Online makeup and beauty products retailer,"Manish Taneja, Rahul Dash",Verlinvest,8000000.0,,
1053,Shuttl,2015.0,New Delhi,Transport,App based bus aggregator serice,"Amit Singh, Deepanshu Malviya",SIG Global India Fund LLP.,8043000.0,Series C,


In [1986]:
# replace() method to replace the specified values (empty string '', comma ,, and NaN np.nan) in the "Founders" column with NaN (np.nan)
data["Founders"] = data["Founders"].replace(['', ',', np.nan], np.nan)

In [1987]:
#check for null values in the investor column
data[data["Investor"].isnull() | (data["Investor"] == "")|(data["Investor"] == "nan")]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage,column10
21,SucSEED Indovation,2016.0,Hyderabad,FinTech,SucSEED INDOVATION FACILITATES ACCESS TO CAPIT...,Vikrant Varshney,,5000000.0,,
24,Circle of Angels,2018.0,Gurgaon,FinTech,Circles of Angels aims to solve this issue thr...,Karanpal Singh,,3000000.0,,
44,PointOne Capital,2020.0,Bangalore,Venture capitalist,Pre-seed/Seed focussed VC investor,Mihir Jha,,,,
65,Toppeq,2019.0,Mumbai,SaaS startup,SaaS-based equity management platform,Nandini Sankar,,,Seed,
131,PlayerzPot,2015.0,Mumbai,Gaming startup,"Favorite fantasy cricket, football & kabaddi l...","Yogesh, Mitesh Gangar",,3000000.0,Series A,
151,BlackSoil,2016.0,Mumbai,FinTech,Blacksoil Advisory is an independent boutique ...,Ankur Bansal,,10000000.0,,
162,Jade Forest,,New Delhi,Beverages,NATURALLY THE BEST From zero artificial ingred...,"Punweet Singh, Shuchir Suri",,250000.0,Seed,
237,GoodGamer,2020.0,Bangalore,Gaming,GoodGamer is India's first Daily Fantasy Sport...,Charles Creighton,,2500000.0,Seed,
241,SoOLEGAL,2015.0,New Delhi,LegalTech,SoOLEGAL is a global integrated directory of l...,Manish Kaul,,4000000.0,,
244,Hire Me Car,,Noida,Car Service,India's largest cloud based digital discovery ...,Pankaj Sharma,,,Seed,


In [1988]:
# replace () the specified values (empty string '', comma ,, and NaN np.nan) in the Investor column with NaN (np.nan)
data["Investor"] = data["Investor"].replace(['', ',', np.nan], np.nan)

In [1989]:
data["Amount"].unique()

array([2.0000000e+05, 1.0000000e+05,           nan, 4.0000000e+05,
       3.4000000e+05, 6.0000000e+05, 4.5000000e+07, 1.0000000e+06,
       2.0000000e+06, 1.2000000e+06, 6.6000000e+08, 1.2000000e+05,
       7.5000000e+06, 5.0000000e+06, 5.0000000e+05, 3.0000000e+06,
       1.0000000e+07, 1.4500000e+08, 1.0000000e+08, 2.1000000e+07,
       4.0000000e+06, 2.0000000e+07, 5.6000000e+05, 2.7500000e+05,
       4.5000000e+06, 1.5000000e+07, 3.9000000e+08, 7.0000000e+06,
       5.1000000e+06, 7.0000000e+08, 2.3000000e+06, 7.0000000e+05,
       1.9000000e+07, 9.0000000e+06, 4.0000000e+07, 7.5000000e+05,
       1.5000000e+06, 7.8000000e+06, 5.0000000e+07, 8.0000000e+07,
       3.0000000e+07, 1.7000000e+06, 2.5000000e+06, 4.0000000e+04,
       3.3000000e+07, 3.5000000e+07, 3.0000000e+05, 2.5000000e+07,
       3.5000000e+06, 2.0000000e+08, 6.0000000e+06, 1.3000000e+06,
       4.1000000e+06, 5.7500000e+05, 8.0000000e+05, 2.8000000e+07,
       1.8000000e+07, 3.2000000e+06, 9.0000000e+05, 2.5000000e

In [1990]:
data.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage', 'column10'],
      dtype='object')

In [1991]:
# Ensure the column is of string type
data["Amount"] = data["Amount"].astype(str)
# Remove commas
data["Amount"] = data["Amount"].str.replace(',', '')
# Replace undisclosed with NaN
data["Amount"] = data["Amount"].replace(["Undisclosed", "Undiclsosed"], np.nan)  
# Remove '$' sign
data["Amount"] = data["Amount"].str.replace('\$', '')  
# Extract numeric values
data["Amount"] = data["Amount"].str.extract('(\d+\.?\d*)', expand=False).astype(float)  

In [1992]:
#overview of the stage column
data["Stage"].unique

<bound method Series.unique of 0             None
1         Pre-seed
2         Pre-seed
3             None
4             None
           ...    
1050          None
1051    Seed Round
1052          None
1053      Series C
1054      Series A
Name: Stage, Length: 1055, dtype: object>

In [1993]:
# Replace similar stage names
data["Stage"] = data["Stage"].replace({
    'pre series a': 'pre-series a',
    'pre series b': 'pre-series b',
    'pre series c': 'pre-series c',
    'pre seed': 'pre-seed',
    'seed funding': 'seed',
    'seed investment': 'seed',
    'seed round': 'seed',
    'seed a': 'seed',
    'angel round': 'seed',
    'seed round & series a': 'seed',
    'pre seed round': 'pre-seed',
    'pre series a1': 'pre-series a',
    'series a-1': 'series a',
    'series c, d': 'series c',
    'pre-seed round': 'pre-seed',
    'pre series b': 'pre-series b',
    'pre series a': 'pre-series a',
    'seed round': 'seed' })

In [1994]:
# Handle missing or unknown values
data["Stage"] = data["Stage"].replace([' ', ',', 'nan'], np.nan)

In [1995]:
data['Amount'] = data['Amount'].astype(str)

In [1996]:
# Remove the $ sign and convert to float, handling non-numeric values
data['Amount'] = pd.to_numeric(data['Amount'].str.replace('$', '').str.replace(',', ''), errors='coerce')


In [1997]:


# Founded: replace null values with median
data['Founded'].fillna(data['Founded'].median(), inplace=True)

# Sector: replace with most repeated
data['Sector'].fillna(data['Sector'].mode()[0], inplace=True)

# dealing with missing values in Headquarter column
data['HeadQuarter'].fillna('HeadQuarter Unknown', inplace=True)



# Founders: simulate by filling with "Unknown")
data['Founders'].fillna('Unknown Founders', inplace=True)

#Investor: simulate by filling with "Various Investors")
data['Investor'].fillna('Various Investors', inplace=True)

# Amount($): simulate by filling with median of existing amounts
data['Amount'].fillna(data['Amount'].mean(), inplace=True)

# Stage: simulate by mode
data['Stage'].fillna(data['Stage'].mode()[0], inplace=True)

data.drop('column10',axis=1,inplace=True)

data.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,Series A
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,113043000.0,Pre-seed
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,Series A
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,Series A


In [1998]:
data.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,Series A
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,113043000.0,Pre-seed
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,Series A
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,Series A


In [1999]:
data.isna().sum()

Company_Brand    0
Founded          0
HeadQuarter      0
Sector           0
What_it_does     0
Founders         0
Investor         0
Amount           0
Stage            0
dtype: int64

In [2000]:
data.nunique()

Company_Brand    905
Founded           26
HeadQuarter       75
Sector           302
What_it_does     990
Founders         928
Investor         849
Amount           301
Stage             42
dtype: int64

In [2001]:
# Taking closer look at the duplicates
data[data.duplicated(keep=False)].sort_values(by='Sector').head(25)
data.drop_duplicates(inplace=True)

In [2002]:
query = "select * from dbo.LP1_startup_funding2021"
data1 = pd.read_sql(query, connection)
data1.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [2003]:
data1.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First","$1,200,000",Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital","$30,000,000",Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital","$51,000,000",Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal","$2,000,000",Seed


In [2004]:
data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company_Brand  1209 non-null   object 
 1   Founded        1208 non-null   float64
 2   HeadQuarter    1208 non-null   object 
 3   Sector         1209 non-null   object 
 4   What_it_does   1209 non-null   object 
 5   Founders       1205 non-null   object 
 6   Investor       1147 non-null   object 
 7   Amount         1206 non-null   object 
 8   Stage          781 non-null    object 
dtypes: float64(1), object(8)
memory usage: 85.1+ KB


In [2005]:
data1.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

In [2006]:
data1.describe()

Unnamed: 0,Founded
count,1208.0
mean,2016.655629
std,4.517364
min,1963.0
25%,2015.0
50%,2018.0
75%,2020.0
max,2021.0


In [2007]:
data1.isna().sum()

Company_Brand      0
Founded            1
HeadQuarter        1
Sector             0
What_it_does       0
Founders           4
Investor          62
Amount             3
Stage            428
dtype: int64

In [2008]:
# checking the overview of the Company/Brand column
data1['HeadQuarter'].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', 'Computer Games',
       'Cochin', 'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara',
       'Food & Beverages', 'Pharmaceuticals\t#REF!', 'Gurugram\t#REF!',
       'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana', 'Indore', 'Powai',
       'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna', 'Samsitpur',
       'Lucknow', 'Telangana', 'Silvassa', 'Thiruvananthapuram',
       'Faridabad', 'Roorkee', 'Ambernath', 'Panchkula', 'Surat',
       'Coimbatore', 'Andheri', 'Mangalore', 'Telugana', 'Bhubaneswar',
       'Kottayam', 'Beijing', 'Panaji', 'Satara', 'Orissia', 'Jodhpur',
       'New York', 'Santra', 'Mountain View, CA', 'Trivandrum',
       'Jharkhand', 'Kanpur', 'Bhilwara', 'Guwahati',
       'Online Media\t#REF!', 'Kochi', 'London',
       'Information Technol

In [2009]:
#remove t#REF!from names in headquarters

data1['HeadQuarter'] = data1['HeadQuarter'].replace({'\t#REF!': 'NaN'}, regex=True)
data1['HeadQuarter'].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', 'Computer Games',
       'Cochin', 'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara',
       'Food & Beverages', 'PharmaceuticalsNaN', 'GurugramNaN', 'Kolkata',
       'Ahmedabad', 'Mohali', 'Haryana', 'Indore', 'Powai', 'Ghaziabad',
       'Nagpur', 'West Bengal', 'Patna', 'Samsitpur', 'Lucknow',
       'Telangana', 'Silvassa', 'Thiruvananthapuram', 'Faridabad',
       'Roorkee', 'Ambernath', 'Panchkula', 'Surat', 'Coimbatore',
       'Andheri', 'Mangalore', 'Telugana', 'Bhubaneswar', 'Kottayam',
       'Beijing', 'Panaji', 'Satara', 'Orissia', 'Jodhpur', 'New York',
       'Santra', 'Mountain View, CA', 'Trivandrum', 'Jharkhand', 'Kanpur',
       'Bhilwara', 'Guwahati', 'Online MediaNaN', 'Kochi', 'London',
       'Information Technology & Services', 'T

In [2010]:
# from our unique function, we see that food and beverages, pharmaceuticals and Information Technology & Services
# are in our headquarter column. Lets extract them 

data1[data1['HeadQuarter'].str.contains
     ('Information Technology & Services|Online Media|Pharmaceuticals|Food & Beverages|Computer Games',case=False, na=False)]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
98,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
111,FanPlay,2020.0,Computer Games,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
241,MasterChow,2020.0,Food & Beverages,Hauz Khas,A ready-to-cook Asian cuisine brand,"Vidur Kataria, Sidhanth Madan",WEH Ventures,$461000,Seed
242,Fullife Healthcare,2009.0,PharmaceuticalsNaN,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,
255,MasterChow,2020.0,Food & Beverages,Hauz Khas,A ready-to-cook Asian cuisine brand,"Vidur Kataria, Sidhanth Madan",WEH Ventures,$461000,Seed
256,Fullife Healthcare,2009.0,PharmaceuticalsNaN,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,
1100,Sochcast,2020.0,Online MediaNaN,Sochcast is an Audio experiences company that ...,"CA Harvinderjit Singh Bhatia, Garima Surana, A...","Vinners, Raj Nayak, Amritaanshu Agrawal",$Undisclosed,,
1176,Peak,2014.0,Information Technology & Services,"Manchester, Greater Manchester",Peak helps the world's smartest companies put ...,Atul Sharma,SoftBank Vision Fund 2,$75000000,Series C


In [2011]:
# replacing with NaN
data1['HeadQuarter'] = data1['HeadQuarter'].replace(['Pharmaceuticals', 'Computer Games', 
    'Food & Beverages', 'Online Media', 'Information Technology & Services'], np.nan)
data1["HeadQuarter"].unique()

array(['Bangalore', 'Mumbai', 'Gurugram', 'New Delhi', 'Hyderabad',
       'Jaipur', 'Ahmadabad', 'Chennai', None,
       'Small Towns, Andhra Pradesh', 'Goa', 'Rajsamand', 'Ranchi',
       'Faridabad, Haryana', 'Gujarat', 'Pune', 'Thane', nan, 'Cochin',
       'Noida', 'Chandigarh', 'Gurgaon', 'Vadodara', 'PharmaceuticalsNaN',
       'GurugramNaN', 'Kolkata', 'Ahmedabad', 'Mohali', 'Haryana',
       'Indore', 'Powai', 'Ghaziabad', 'Nagpur', 'West Bengal', 'Patna',
       'Samsitpur', 'Lucknow', 'Telangana', 'Silvassa',
       'Thiruvananthapuram', 'Faridabad', 'Roorkee', 'Ambernath',
       'Panchkula', 'Surat', 'Coimbatore', 'Andheri', 'Mangalore',
       'Telugana', 'Bhubaneswar', 'Kottayam', 'Beijing', 'Panaji',
       'Satara', 'Orissia', 'Jodhpur', 'New York', 'Santra',
       'Mountain View, CA', 'Trivandrum', 'Jharkhand', 'Kanpur',
       'Bhilwara', 'Guwahati', 'Online MediaNaN', 'Kochi', 'London',
       'The Nilgiris', 'Gandhinagar'], dtype=object)

In [2012]:
#replace misspelled stage names with the appropriate stage names 
data1['Stage'] = data1['Stage'].replace({'Seed+':'Seed','Seies A':'Series A'})
data1['Stage'].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt',
       '$1200000', 'Bridge', 'Series F2', 'Series A+', 'Series G',
       'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', '$300000', 'Early seed', 'Series D1', '$6000000',
       '$1000000', 'Pre-series', 'Series A2', 'Series I'], dtype=object)

In [2013]:
#overview of the stage column
data1['Stage'].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt',
       '$1200000', 'Bridge', 'Series F2', 'Series A+', 'Series G',
       'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', '$300000', 'Early seed', 'Series D1', '$6000000',
       '$1000000', 'Pre-series', 'Series A2', 'Series I'], dtype=object)

In [2014]:
# replace NaN using np.nan
data1["Stage"].replace("nan", np.nan, inplace = True, regex=True)
# Extracting the row with  missing data in the NaN column
data1[data1['Stage'].isna()]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management","$120,000,000",
5,Urban Company,2014.0,New Delhi,Home services,Urban Company (Formerly UrbanClap) is a home a...,"Abhiraj Singh Bhal, Raghav Chandra, Varun Khaitan",Vy Capital,"$188,000,000",
6,Comofi Medtech,2018.0,Bangalore,HealthTech,Comofi Medtech is a healthcare robotics startup.,Gururaj KB,"CIIE.CO, KIIT-TBI","$200,000",
8,Vitra.ai,2020.0,Bangalore,Tech Startup,Vitra.ai is an AI-based video translation plat...,Akash Nidhi PS,Inflexor Ventures,Undisclosed,
9,Taikee,2010.0,Mumbai,E-commerce,"Taikee is the ISO-certified, B2B e-commerce pl...","Nidhi Ramachandran, Sachin Chhabra",,"$1,000,000",
...,...,...,...,...,...,...,...,...,...
1172,Peppermint,2019.0,Pune,Industrial Automation,Intelligent Housekeeping Robots for public and...,"Runal Dahiwade, Miraj C Vora","Venture Catalysts, Indian Angel Network",$600000,
1182,Sugar.fit,2021.0,Bangalore,Health,"Innovative technology, compassionate diabetes ...","Shivtosh Kumar, Madan Somasundaram","Cure.fit, Endiya Partners, Tanglin Venture",$10000000,
1192,Geniemode,2021.0,Gurugram,B2B,Transforming global sourcing for retailers & s...,"Amit Sharma, Tanuj Gangwani",Info Edge Ventures,$2000000,
1193,Sapio Analytics,2019.0,Mumbai,Computer Software,Sapio helps government create policies driven ...,"Hardik Somani, Ashwin Srivastava, Shripal Jain...","Rachit Poddar, Rajesh Gupta",$Undisclosed,


In [2015]:
# the Stage contains amounts therefore there is a need to extract the rows with figures 
#in them using the $ sign as our condition . 
data1[data1['Stage'].str.contains('\$', na=False)]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
98,FanPlay,2020.0,,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
111,FanPlay,2020.0,,Computer Games,A real money game app specializing in trivia g...,YC W21,"Pritesh Kumar, Bharat Gupta",Upsparks,$1200000
538,Little Leap,2020.0,New Delhi,EdTech,Soft Skills that make Smart Leaders,Holistic Development Programs for children in ...,Vishal Gupta,ah! Ventures,$300000
551,BHyve,2020.0,Mumbai,Human Resources,A Future of Work Platform for diffusing Employ...,Backed by 100x.VC,"Omkar Pandharkame, Ketaki Ogale","ITO Angel Network, LetsVenture",$300000
674,MYRE Capital,2020.0,Mumbai,Commercial Real Estate,Democratising Real Estate Ownership,Own rent yielding commercial properties,Aryaman Vir,,$6000000
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serv...",Pedagogy,Sushil Agarwal,"JITO Angel Network, LetsVenture",$1000000


In [2016]:
#Remove all amount values with $ sign in the Stage column and put them in the Amount ($) colum
dollar = data1['Stage'].str.contains('\$', na=False)
data1.loc[dollar, ['Stage', 'Amount']] = data1.loc[dollar, ['Amount', 'Stage']].to_numpy()

In [2017]:
#check whether the code worked precisely
data1['Stage'].unique()

array(['Pre-series A', None, 'Series D', 'Series C', 'Seed', 'Series B',
       'Series E', 'Pre-seed', 'Series A', 'Pre-series B', 'Debt',
       'Upsparks', 'Bridge', 'Series F2', 'Series A+', 'Series G',
       'Series F', 'Series H', 'Series B3', 'PE', 'Series F1',
       'Pre-series A1', 'ah! Ventures', 'ITO Angel Network, LetsVenture',
       'Early seed', 'Series D1', 'JITO Angel Network, LetsVenture',
       'Pre-series', 'Series A2', 'Series I'], dtype=object)

In [2018]:
#replace foreign values with NaN
data1['Stage'] = data1['Stage'].replace(["PE", "ah! Ventures", "ITO Angel Network, LetsVenture",
                     "JITO Angel Network, LetsVenture", "Upsparks"], np.nan)

In [2019]:
#overview of the sector column
data1['Sector'].unique()

array(['AI startup', 'EdTech', 'B2B E-commerce', 'FinTech',
       'Home services', 'HealthTech', 'Tech Startup', 'E-commerce',
       'B2B service', 'Helathcare', 'Renewable Energy', 'Electronics',
       'IT startup', 'Food & Beverages', 'Aeorspace', 'Deep Tech',
       'Dating', 'Gaming', 'Robotics', 'Retail', 'Food', 'Oil and Energy',
       'AgriTech', 'Telecommuncation', 'Milk startup', 'AI Chatbot', 'IT',
       'Logistics', 'Hospitality', 'Fashion', 'Marketing',
       'Transportation', 'LegalTech', 'Food delivery', 'Automotive',
       'SaaS startup', 'Fantasy sports', 'Video communication',
       'Social Media', 'Skill development', 'Rental', 'Recruitment',
       'HealthCare', 'Sports', 'Computer Games', 'Consumer Goods',
       'Information Technology', 'Apparel & Fashion',
       'Logistics & Supply Chain', 'Healthtech', 'Healthcare',
       'SportsTech', 'HRTech', 'Wine & Spirits',
       'Mechanical & Industrial Engineering', 'Spiritual',
       'Financial Services', 'I

In [2020]:
#overview of the investor column
data1['Investor'].unique()

array(['BEENEXT, Entrepreneur First',
       'Unilazer Ventures, IIFL Asset Management',
       'GSV Ventures, Westbridge Capital', 'CDC Group, IDG Capital',
       'Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal', 'Vy Capital',
       'CIIE.CO, KIIT-TBI', 'Inflection Point Ventures',
       'Inflexor Ventures', None,
       '9Unicorns Accelerator Fund, Metaform Ventures',
       'SucSEED Indovation, IIM Calcutta Innovation Park',
       'Safe Planet Medicare', 'Impact Partners, C4D Partners',
       'Tiger Global Management, InnoVen Capital', 'Novo Tellus Capital',
       'Raintree Family Office, ADB arm',
       'Mumbai Angels, Narendra Shyamsukha', 'Paradigm, Kunal Shah',
       'Matrix Partners India, GIC', 'Chiratae Ventures, JAFCO Asia',
       'Mumbai Angels Network, Expert DOJO', 'GVFL',
       'Kotak Mahindra Bank, FMO', 'Kalaari Capital',
       'NB Ventures, IAN Fund',
       'Sequoia Capital India, Hummingbird Ventures',
       'Gaurav Munjal, Snehil Khanor', 'JITO Angel Net

In [2021]:
#remove dollar sign
data1[data1['Investor'].str.contains('\$', na=False)]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
242,Fullife Healthcare,2009.0,PharmaceuticalsNaN,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,
256,Fullife Healthcare,2009.0,PharmaceuticalsNaN,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,$22000000,Series C,
257,MoEVing,2021.0,GurugramNaN,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",$5000000,Seed,
545,AdmitKard,2016.0,Noida,EdTech,A tech solution for end to end career advisory...,"Vamsi Krishna, Pulkit Jain, Gaurav Munjal\t#REF!",$1000000,Pre-series A,
1100,Sochcast,2020.0,Online MediaNaN,Sochcast is an Audio experiences company that ...,"CA Harvinderjit Singh Bhatia, Garima Surana, A...","Vinners, Raj Nayak, Amritaanshu Agrawal",$Undisclosed,,


In [2022]:
# replace the row with NaN
data1['Investor'] = data1['Investor'].replace(["http://100x.vc/","$Undisclosed", "$1000000", "$5000000", "$22000000", "2000000", ], np.nan)
data1['Investor'].unique()

array(['BEENEXT, Entrepreneur First',
       'Unilazer Ventures, IIFL Asset Management',
       'GSV Ventures, Westbridge Capital', 'CDC Group, IDG Capital',
       'Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal', 'Vy Capital',
       'CIIE.CO, KIIT-TBI', 'Inflection Point Ventures',
       'Inflexor Ventures', None,
       '9Unicorns Accelerator Fund, Metaform Ventures',
       'SucSEED Indovation, IIM Calcutta Innovation Park',
       'Safe Planet Medicare', 'Impact Partners, C4D Partners',
       'Tiger Global Management, InnoVen Capital', 'Novo Tellus Capital',
       'Raintree Family Office, ADB arm',
       'Mumbai Angels, Narendra Shyamsukha', 'Paradigm, Kunal Shah',
       'Matrix Partners India, GIC', 'Chiratae Ventures, JAFCO Asia',
       'Mumbai Angels Network, Expert DOJO', 'GVFL',
       'Kotak Mahindra Bank, FMO', 'Kalaari Capital',
       'NB Ventures, IAN Fund',
       'Sequoia Capital India, Hummingbird Ventures',
       'Gaurav Munjal, Snehil Khanor', 'JITO Angel Net

In [2023]:
#overview of the Amount column
data1['Amount'].unique()

array(['$1,200,000', '$120,000,000', '$30,000,000', '$51,000,000',
       '$2,000,000', '$188,000,000', '$200,000', 'Undisclosed',
       '$1,000,000', '$3,000,000', '$100,000', '$700,000', '$9,000,000',
       '$40,000,000', '$49,000,000', '$400,000', '$300,000',
       '$25,000,000', '$160,000,000', '$150,000', '$1,800,000',
       '$5,000,000', '$850,000', '$53,000,000', '$500,000', '$1,100,000',
       '$6,000,000', '$800,000', '$10,000,000', '$21,000,000',
       '$7,500,000', '$26,000,000', '$7,400,000', '$1,500,000',
       '$600,000', '$800,000,000', '$17,000,000', '$3,500,000',
       '$15,000,000', '$215,000,000', '$2,500,000', '$350,000,000',
       '$5,500,000', '$83,000,000', '$110,000,000', '$500,000,000',
       '$65,000,000', '$150,000,000,000', '$300,000,000', '$2,200,000',
       '$35,000,000', '$140,000,000', '$4,000,000', '$13,000,000', None,
       '$Undisclosed', '$2000000', '$800000', '$6000000', '$2500000',
       '$9500000', '$13000000', '$5000000', '$8000000',

In [2024]:
data1['Amount'] = data1['Amount'].astype(str)


In [2025]:

# Remove the $ sign and convert to float, handling non-numeric values
data1['Amount'] = pd.to_numeric(data1['Amount'].str.replace('$', '').str.replace(',', ''), errors='coerce')


In [2026]:
data1.columns

Index(['Company_Brand', 'Founded', 'HeadQuarter', 'Sector', 'What_it_does',
       'Founders', 'Investor', 'Amount', 'Stage'],
      dtype='object')

In [2027]:
data1.loc[data1["Amount"] == "JITO Angel Network, LetsVenture", ["Amount", "Stage"]] = ["$1000000", np.nan]
data1.loc[data1["Company_Brand"] == "Saarthi Pedagogy"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
677,Saarthi Pedagogy,2015.0,Ahmadabad,EdTech,"India's fastest growing Pedagogy company, serv...",Pedagogy,Sushil Agarwal,1000000.0,


In [2028]:
data1.loc[data1["Company_Brand"] == "MoEVing", ["Amount", "Stage", "Investor"]] = ["$5000000", "Seed", np.nan]
data1.loc[data1["Company_Brand"] == "Godamwale", ["Amount", "Stage", "Investor"]] = ["1000000", "Seed", np.nan]
data1.loc[data1["Company_Brand"] == "MoEVing"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
257,MoEVing,2021.0,GurugramNaN,MoEVing is India's only Electric Mobility focu...,"Vikash Mishra, Mragank Jain","Anshuman Maheshwary, Dr Srihari Raju Kalidindi",,$5000000,Seed


In [2029]:
data1.loc[data1["Company_Brand"] == "Godamwale"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
1148,Godamwale,2016.0,Mumbai,Logistics & Supply Chain,Godamwale is tech enabled integrated logistics...,"Basant Kumar, Vivek Tiwari, Ranbir Nandan",,1000000,Seed


In [2030]:
data1.loc[data1["Company_Brand"] == "Fullife Healthcare", ["HeadQuarter","Amount", "Stage", "Investor"]] = [np.nan, "$22000000000", "Series C", np.nan]
data1.loc[data1["Company_Brand"] == "Fullife Healthcare"]

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
242,Fullife Healthcare,2009.0,,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,,$22000000000,Series C
256,Fullife Healthcare,2009.0,,Primary Business is Development and Manufactur...,Varun Khanna,Morgan Stanley Private Equity Asia,,$22000000000,Series C


In [2031]:

# Founded: replace null values with median
data1['Founded'].fillna(data1['Founded'].median(), inplace=True)



# dealing with missing values in Headquarter column
data1['HeadQuarter'].fillna('HeadQuarter Unknown', inplace=True)


# Founders: simulate by filling with "Unknown")
data1['Founders'].fillna('Unknown Founders', inplace=True)

#Investor: simulate by filling with "Various Investors")
data1['Investor'].fillna('Various Investors', inplace=True)

# Replace empty column with 0
data1["Amount"].replace(" ", 0, inplace = True, regex=True)
# remove comma from the amounts 
data1["Amount"].replace(",", "", inplace = True, regex=True)
# Replace  "Undisclosed", "undisclosed", "Undiclsosed", "Undislosed" and "$undisclosed" with NaN
data1["Amount"].replace("Undisclosed", np.nan, inplace = True, regex=True)
data1["Amount"].replace("undisclosed", np.nan, inplace = True, regex=True)
data1["Amount"].replace("Undiclsosed", np.nan, inplace = True, regex=True)
data1["Amount"].replace("Undislosed", np.nan, inplace = True, regex=True)
data1["Amount"].replace("$undisclosed", np.nan, inplace = True, regex=True)

# Stage: simulate by mode
data1['Stage'].fillna(data1['Stage'].mode()[0], inplace=True)


data1.head()

Unnamed: 0,Company_Brand,Founded,HeadQuarter,Sector,What_it_does,Founders,Investor,Amount,Stage
0,Unbox Robotics,2019.0,Bangalore,AI startup,Unbox Robotics builds on-demand AI-driven ware...,"Pramod Ghadge, Shahid Memon","BEENEXT, Entrepreneur First",1200000.0,Pre-series A
1,upGrad,2015.0,Mumbai,EdTech,UpGrad is an online higher education platform.,"Mayank Kumar, Phalgun Kompalli, Ravijot Chugh,...","Unilazer Ventures, IIFL Asset Management",120000000.0,Seed
2,Lead School,2012.0,Mumbai,EdTech,LEAD School offers technology based school tra...,"Smita Deorah, Sumeet Mehta","GSV Ventures, Westbridge Capital",30000000.0,Series D
3,Bizongo,2015.0,Mumbai,B2B E-commerce,Bizongo is a business-to-business online marke...,"Aniket Deb, Ankit Tomar, Sachin Agrawal","CDC Group, IDG Capital",51000000.0,Series C
4,FypMoney,2021.0,Gurugram,FinTech,"FypMoney is Digital NEO Bank for Teenagers, em...",Kapil Banwari,"Liberatha Kallat, Mukesh Yadav, Dinesh Nagpal",2000000.0,Seed


In [2032]:
# Convert the 'Amount' column to string
data1["Amount"] = data1["Amount"].apply(str)

# Remove the "$" and Replace specific patterns with NaN values in the 'Amount($)' Column
data1["Amount"] = data1["Amount"].apply(lambda x: str(x).replace("$",""))
data1["Amount"].replace("^\s*$", np.nan, inplace=True, regex=True)

In [2033]:
data1.isna().sum()

Company_Brand    0
Founded          0
HeadQuarter      0
Sector           0
What_it_does     0
Founders         0
Investor         0
Amount           0
Stage            0
dtype: int64

In [2034]:
# Conform the changes made on the unique vlaues in the column
data1['Amount'].unique()

array(['1200000.0', '120000000.0', '30000000.0', '51000000.0',
       '2000000.0', '188000000.0', '200000.0', 'nan', '1000000.0',
       '3000000.0', '100000.0', '700000.0', '9000000.0', '40000000.0',
       '49000000.0', '400000.0', '300000.0', '25000000.0', '160000000.0',
       '150000.0', '1800000.0', '5000000.0', '850000.0', '53000000.0',
       '500000.0', '1100000.0', '6000000.0', '800000.0', '10000000.0',
       '21000000.0', '7500000.0', '26000000.0', '7400000.0', '1500000.0',
       '600000.0', '800000000.0', '17000000.0', '3500000.0', '15000000.0',
       '215000000.0', '2500000.0', '350000000.0', '5500000.0',
       '83000000.0', '110000000.0', '500000000.0', '65000000.0',
       '150000000000.0', '300000000.0', '2200000.0', '35000000.0',
       '140000000.0', '4000000.0', '13000000.0', '9500000.0', '8000000.0',
       '12000000.0', '1700000.0', '150000000.0', '100000000.0',
       '225000000.0', '6700000.0', '1300000.0', '20000000.0', '250000.0',
       '52000000.0', '3800

In [2035]:
# Taking closer look at the duplicates
data1[data1.duplicated(keep=False)].sort_values(by='Sector').head(25)
data1.drop_duplicates(inplace=True)

In [2036]:
data2 =pd.read_csv("startup_funding2018.csv")
data2.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,"Brand Marketing, Event Promotion, Marketing, S...",Seed,250000,"Bangalore, Karnataka, India","TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,"Agriculture, Farming",Seed,"₹40,000,000","Mumbai, Maharashtra, India",A startup which aggregates milk from dairy far...
2,MyLoanCare,"Credit, Financial Services, Lending, Marketplace",Series A,"₹65,000,000","Gurgaon, Haryana, India",Leading Online Loans Marketplace in India
3,PayMe India,"Financial Services, FinTech",Angel,2000000,"Noida, Uttar Pradesh, India",PayMe India is an innovative FinTech organizat...
4,Eunimart,"E-Commerce Platforms, Retail, SaaS",Seed,—,"Hyderabad, Andhra Pradesh, India",Eunimart is a one stop solution for merchants ...


In [2037]:
data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526 entries, 0 to 525
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Company Name   526 non-null    object
 1   Industry       526 non-null    object
 2   Round/Series   526 non-null    object
 3   Amount         526 non-null    object
 4   Location       526 non-null    object
 5   About Company  526 non-null    object
dtypes: object(6)
memory usage: 24.8+ KB


In [2038]:
data2.shape

(526, 6)

In [2039]:
data2.dtypes

Company Name     object
Industry         object
Round/Series     object
Amount           object
Location         object
About Company    object
dtype: object

In [2040]:
data2.describe().T

Unnamed: 0,count,unique,top,freq
Company Name,526,525,TheCollegeFever,2
Industry,526,405,—,30
Round/Series,526,21,Seed,280
Amount,526,198,—,148
Location,526,50,"Bangalore, Karnataka, India",102
About Company,526,524,"TheCollegeFever is a hub for fun, fiesta and f...",2


In [2041]:
data2.isna().sum()

Company Name     0
Industry         0
Round/Series     0
Amount           0
Location         0
About Company    0
dtype: int64

In [2042]:
# Find the missing values depictict _
data2[data2['Industry']=='—']

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
58,MissMalini Entertainment,—,Seed,"₹104,000,000","Mumbai, Maharashtra, India",MissMalini Entertainment is a multi-platform n...
105,Jagaran Microfin,—,Debt Financing,"₹550,000,000","Kolkata, West Bengal, India",Jagaran Microfin is a Microfinance institution...
121,FLEECA,—,Seed,—,"Jaipur, Rajasthan, India",FLEECA is a Tyre Care Provider company.
146,WheelsEMI,—,Series B,"$14,000,000","Pune, Maharashtra, India","WheelsEMI is the brand name of NBFC, WheelsEMI..."
153,Fric Bergen,—,Venture - Series Unknown,—,"Alwar, Rajasthan, India",Fric Bergen is a leader in the specialty food ...
174,Deftouch,—,Seed,—,"Bangalore, Karnataka, India",Deftouch is a mobile game development company ...
181,Corefactors,—,Seed,—,"Bangalore, Karnataka, India","Corefactors is a leading campaign management, ..."
210,Cell Propulsion,—,Seed,"₹7,000,000","Bangalore, Karnataka, India",Cell Propulsion is an electric mobility startu...
230,Flathalt,—,Angel,50000,"Gurgaon, Haryana, India",FInd your Customized Home here.
235,dishq,—,Seed,400000,"Bengaluru, Karnataka, India",dishq leverages food science and machine learn...


In [2043]:
# Replace Dashes(-) in all columns with NaN
data2 =data2.replace('—',np.nan)

In [2044]:
#overview of the industry column
data2['Industry'].unique

<bound method Series.unique of 0      Brand Marketing, Event Promotion, Marketing, S...
1                                   Agriculture, Farming
2       Credit, Financial Services, Lending, Marketplace
3                            Financial Services, FinTech
4                     E-Commerce Platforms, Retail, SaaS
                             ...                        
521     B2B, Business Development, Internet, Marketplace
522                                      Tourism, Travel
523           Food and Beverage, Food Delivery, Internet
524                               Information Technology
525           Biotechnology, Health Care, Pharmaceutical
Name: Industry, Length: 526, dtype: object>

In [2045]:
#mostly the overview column has the entries with more than one industry, keep the first
data2['Industry'] = data2['Industry'].str.split(',').str[0]
data2['Industry'] 

0             Brand Marketing
1                 Agriculture
2                      Credit
3          Financial Services
4        E-Commerce Platforms
                ...          
521                       B2B
522                   Tourism
523         Food and Beverage
524    Information Technology
525             Biotechnology
Name: Industry, Length: 526, dtype: object

In [2046]:
# Check for unique values
data2['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', 'Undisclosed',
       'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593',
       'Series D', 'Secondary Market', 'Post-IPO Equity',
       'Non-equity Assistance', 'Funding Round'], dtype=object)

In [2047]:
# Drop the row that has google docs link
data2 = data2[data2['Round/Series'] != 'https://docs.google.com/spreadsheets/d/1x9ziNeaz6auNChIHnMI8U6kS7knTr3byy_YBGfQaoUA/edit#gid=1861303593']

# Reset the row index
data2.reset_index(drop=True, inplace=True)

In [2048]:
data2.columns

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

In [2049]:
data2.shape

(525, 6)

In [2050]:
data2['Amount'] = data2['Amount'].astype(str)



In [2051]:
# Create separate dataframe for rupee amounts
df_rupees = data2[data2['Amount'].astype(str).str.contains('₹')]
# convert rupees to dollars
df_rupees['Amount'] = df_rupees['Amount'].apply(lambda x: x.replace('₹','').replace(',','')).astype('float')
df_rupees['Amount'] = df_rupees['Amount']*0.0146

# update the original dataframe
data2.loc[data2['Amount'].isin(df_rupees['Amount'])] = df_rupees

In [2052]:
# Convert the 'Amount' column to string
data2["Amount"] = data2["Amount"].apply(str)

# Remove the "$" and Replace specific patterns with NaN values in the 'Amount($)' Column
data2["Amount"] = data2["Amount"].apply(lambda x: str(x).replace("$",""))
data2["Amount"].replace("^\s*$", np.nan, inplace=True, regex=True)

In [2053]:
# Ensure the column is of string type
data2["Amount"] = data2["Amount"].astype(str)
# Remove dollar sign
data2['Amount'] = data2['Amount'].str.replace('₹', '')
# Replace commas
data2['Amount'] = data2['Amount'].str.replace(',','')
data2['Amount'] = data2['Amount'].astype(float)

In [2054]:
# Replace all the NaN values with the mean 
data2['Amount'].replace(to_replace = np.nan, value = data2['Amount'].mean(), inplace = True)

In [2055]:
data2['Location'].unique

<bound method Series.unique of 0           Bangalore, Karnataka, India
1            Mumbai, Maharashtra, India
2               Gurgaon, Haryana, India
3           Noida, Uttar Pradesh, India
4      Hyderabad, Andhra Pradesh, India
                     ...               
520         Bangalore, Karnataka, India
521             Haryana, Haryana, India
522          Mumbai, Maharashtra, India
523          Mumbai, Maharashtra, India
524          Chennai, Tamil Nadu, India
Name: Location, Length: 525, dtype: object>

In [2056]:
#strip the location column to only show the city info
data2['Location'] = data2['Location'].str.split(',').str[0]
data2.head()

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
0,TheCollegeFever,Brand Marketing,Seed,250000.0,Bangalore,"TheCollegeFever is a hub for fun, fiesta and f..."
1,Happy Cow Dairy,Agriculture,Seed,40000000.0,Mumbai,A startup which aggregates milk from dairy far...
2,MyLoanCare,Credit,Series A,65000000.0,Gurgaon,Leading Online Loans Marketplace in India
3,PayMe India,Financial Services,Angel,2000000.0,Noida,PayMe India is an innovative FinTech organizat...
4,Eunimart,E-Commerce Platforms,Seed,239797400.0,Hyderabad,Eunimart is a one stop solution for merchants ...


In [2057]:
# Replace Bengaluru with Bangalore

data2['Location'].replace('Bengaluru', 'Bangalore', inplace=True)
data2['Location'].replace('New Delhi', 'Delhi', inplace=True)

In [2058]:
data2.isnull().sum()

Company Name      0
Industry         30
Round/Series      0
Amount            0
Location          0
About Company     0
dtype: int64

In [2059]:
# Taking closer look at the duplicates
data2[data2.duplicated(keep=False)].sort_values(by='About Company').head(25)
data2.drop_duplicates(inplace=True)

In [2060]:
# check for unique values in stage column
data2["Location"].unique()

array(['Bangalore', 'Mumbai', 'Gurgaon', 'Noida', 'Hyderabad', 'Kalkaji',
       'Delhi', 'India', 'Hubli', 'Chennai', 'Mohali', 'Kolkata', 'Pune',
       'Jodhpur', 'Kanpur', 'Ahmedabad', 'Azadpur', 'Haryana', 'Cochin',
       'Faridabad', 'Jaipur', 'Kota', 'Anand', 'Bangalore City',
       'Belgaum', 'Thane', 'Margão', 'Indore', 'Alwar', 'Kannur',
       'Trivandrum', 'Ernakulam', 'Kormangala', 'Uttar Pradesh',
       'Andheri', 'Mylapore', 'Ghaziabad', 'Kochi', 'Powai', 'Guntur',
       'Kalpakkam', 'Bhopal', 'Coimbatore', 'Worli', 'Alleppey',
       'Chandigarh', 'Guindy', 'Lucknow'], dtype=object)

In [2061]:
#Overview of the Stage column
data2['Round/Series'].unique()

array(['Seed', 'Series A', 'Angel', 'Series B', 'Pre-Seed',
       'Private Equity', 'Venture - Series Unknown', 'Grant',
       'Debt Financing', 'Post-IPO Debt', 'Series H', 'Series C',
       'Series E', 'Corporate Round', 'Undisclosed', 'Series D',
       'Secondary Market', 'Post-IPO Equity', 'Non-equity Assistance',
       'Funding Round'], dtype=object)

In [2062]:
data2.columns

Index(['Company Name', 'Industry', 'Round/Series', 'Amount', 'Location',
       'About Company'],
      dtype='object')

In [2063]:
# replace NaN using np.nan
data2["Round/Series"].replace("nan", np.nan, inplace = True, regex=True)
# Extracting the row with  missing data in the NaN column
data2[data2['Round/Series'].isna()]

Unnamed: 0,Company Name,Industry,Round/Series,Amount,Location,About Company
99,Portea Medical,Health Care,,250000000.0,Bangalore,Portea Medical is the largest and fastest grow...
105,Jagaran Microfin,,,550000000.0,Kolkata,Jagaran Microfin is a Microfinance institution...
114,OneAssist,Financial Services,,2400000.0,Mumbai,OneAssist is a protection & assistance service...
149,Quikr,Classifieds,,550000000.0,Bangalore,Quikr is a free classifieds and online marketp...
359,Drivezy,Automotive,,100000000.0,Bangalore,Drivezy is India's largest vehicle sharing pla...
375,Aye Finance,Finance,,72000000.0,Gurgaon,Aye Finance provides financial services to mic...
383,Yaantra,Information Services,,2000000.0,Delhi,"Yaantra, India’s leading mobile phone repair, ..."
394,Dunzo,Customer Service,,70000000.0,Bangalore,Dunzo is an app that connects you with the nea...
422,Shuttl,Apps,,1000000.0,Gurgaon,Shuttl provides an app-based office shuttle se...
451,Vogo Rentals,Last Mile Transportation,,80000000.0,Kormangala,Vogo is a dockless scooter rental company in I...


In [2064]:
# Stage: simulate by mode
data2['Round/Series'].fillna(data2['Round/Series'].mode()[0], inplace=True)


In [2065]:
# sector simulated by mode
data2['Industry'].fillna(data2['Industry'].mode()[0], inplace=True)

In [2066]:
data3 = pd.read_csv("startup_funding2019.csv")
data3.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [2067]:
data3.head()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
0,Bombay Shaving,,,Ecommerce,Provides a range of male grooming products,Shantanu Deshpande,Sixth Sense Ventures,"$6,300,000",
1,Ruangguru,2014.0,Mumbai,Edtech,A learning platform that provides topic-based ...,"Adamas Belva Syah Devara, Iman Usman.",General Atlantic,"$150,000,000",Series C
2,Eduisfun,,Mumbai,Edtech,It aims to make learning fun via games.,Jatin Solanki,"Deepak Parekh, Amitabh Bachchan, Piyush Pandey","$28,000,000",Fresh funding
3,HomeLane,2014.0,Chennai,Interior design,Provides interior designing solutions,"Srikanth Iyer, Rama Harinath","Evolvence India Fund (EIF), Pidilite Group, FJ...","$30,000,000",Series D
4,Nu Genes,2004.0,Telangana,AgriTech,"It is a seed company engaged in production, pr...",Narayana Reddy Punyala,Innovation in Food and Agriculture (IFA),"$6,000,000",


In [2068]:
data3.tail()

Unnamed: 0,Company/Brand,Founded,HeadQuarter,Sector,What it does,Founders,Investor,Amount($),Stage
84,Infra.Market,,Mumbai,Infratech,It connects client requirements to their suppl...,"Aaditya Sharda, Souvik Sengupta","Tiger Global, Nexus Venture Partners, Accel Pa...","$20,000,000",Series A
85,Oyo,2013.0,Gurugram,Hospitality,Provides rooms for comfortable stay,Ritesh Agarwal,"MyPreferred Transformation, Avendus Finance, S...","$693,000,000",
86,GoMechanic,2016.0,Delhi,Automobile & Technology,Find automobile repair and maintenance service...,"Amit Bhasin, Kushal Karwa, Nitin Rana, Rishabh...",Sequoia Capital,"$5,000,000",Series B
87,Spinny,2015.0,Delhi,Automobile,Online car retailer,"Niraj Singh, Ramanshu Mahaur, Ganesh Pawar, Mo...","Norwest Venture Partners, General Catalyst, Fu...","$50,000,000",
88,Ess Kay Fincorp,,Rajasthan,Banking,Organised Non-Banking Finance Company,Rajendra Setia,"TPG, Norwest Venture Partners, Evolvence India","$33,000,000",


In [2069]:
data3.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,60.0,2014.533333,2.937003,2004.0,2013.0,2015.0,2016.25,2019.0


In [2070]:
data3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company/Brand  89 non-null     object 
 1   Founded        60 non-null     float64
 2   HeadQuarter    70 non-null     object 
 3   Sector         84 non-null     object 
 4   What it does   89 non-null     object 
 5   Founders       86 non-null     object 
 6   Investor       89 non-null     object 
 7   Amount($)      89 non-null     object 
 8   Stage          43 non-null     object 
dtypes: float64(1), object(8)
memory usage: 6.4+ KB


In [2071]:
data3.dtypes

Company/Brand     object
Founded          float64
HeadQuarter       object
Sector            object
What it does      object
Founders          object
Investor          object
Amount($)         object
Stage             object
dtype: object

In [2072]:
data3.shape

(89, 9)

In [2073]:
data3.isnull().sum()

Company/Brand     0
Founded          29
HeadQuarter      19
Sector            5
What it does      0
Founders          3
Investor          0
Amount($)         0
Stage            46
dtype: int64

In [2074]:
#overview of the founded column
data3['Founded'].unique

<bound method Series.unique of 0        NaN
1     2014.0
2        NaN
3     2014.0
4     2004.0
       ...  
84       NaN
85    2013.0
86    2016.0
87    2015.0
88       NaN
Name: Founded, Length: 89, dtype: float64>

In [2075]:
# replace NaN using np.nan

data3["Founded"].replace("nan", np.nan, inplace = True, regex=True)

#Change the datatype of Founded column from Float to int first to remove the decimal
# fill NaN rows with 0
data3['Founded'] = data3['Founded'].replace(np.nan, 0)
data3['Founded'] = data3['Founded'].astype(int)
data3['Founded']

0        0
1     2014
2        0
3     2014
4     2004
      ... 
84       0
85    2013
86    2016
87    2015
88       0
Name: Founded, Length: 89, dtype: int32

In [2076]:
data3.columns

Index(['Company/Brand', 'Founded', 'HeadQuarter', 'Sector', 'What it does',
       'Founders', 'Investor', 'Amount($)', 'Stage'],
      dtype='object')

In [2077]:
#overview of the Headquarter column
# New Dehli == in Dehli 

data3['HeadQuarter'].replace('New Delhi', 'Delhi', inplace=True)
data3[data3['HeadQuarter']=='New Delhi'].sum()

Company/Brand    0.0
Founded          0.0
HeadQuarter      0.0
Sector           0.0
What it does     0.0
Founders         0.0
Investor         0.0
Amount($)        0.0
Stage            0.0
dtype: float64

In [2078]:
# Remove the $ sign and convert to float, handling non-numeric values
data3['Amount($)'] = pd.to_numeric(data3['Amount($)'].str.replace('$', '').str.replace(',', ''), errors='coerce')


In [2079]:
data3['Amount($)'] = data3['Amount($)'].astype(str)
# Replace commas
data3['Amount($)'] = data3['Amount($)'].str.replace(',','')

In [2080]:
# Amount column contains some other values that need to be changed before converting to float
data3['Amount($)'] = data3['Amount($)'].str.replace('Undisclosed','')
data3['Amount($)'] = data3['Amount($)'].replace('',np.nan)
# Change dtype to float
data3['Amount($)'] = data3['Amount($)'].astype(float)

In [2081]:


# Founded: replace null values with median
data3['Founded'].fillna(data3['Founded'].median(), inplace=True)

# Sector: replace with most repeated
data3['Sector'].fillna(data3['Sector'].mode()[0], inplace=True)

# dealing with missing values in Headquarter column
data3['HeadQuarter'].fillna('HeadQuarter Unknown', inplace=True)


# Founders: simulate by filling with "Unknown")
data3['Founders'].fillna('Unknown Founders', inplace=True)


# Stage: simulate by mode
data3['Stage'].fillna(data3['Stage'].mode()[0], inplace=True)

In [2082]:
# Replace all the NaN values with the mean 
data3['Amount($)'].replace(to_replace = np.nan, value = data3['Amount($)'].mean(), inplace = True)

In [2083]:
# Taking closer look at the duplicates
data3[data3.duplicated(keep=False)].sort_values(by='Sector').head(25)
data3.drop_duplicates(inplace=True)

In [2084]:
# Define a common schema for renaming
common_schema = {
    'Company_Brand': 'Company',
    'Company Name': 'Company',
    'Company/Brand': 'Company',
    'Founded': 'Founded',
    'HeadQuarter': 'Headquarter',
    'HeadQuarter': 'Headquarter',
    'Sector': 'Sector',
    'Industry': 'Sector',
    'What_it_does': 'Description',
    'What it does': 'Description',
    'About Company': 'Description',
    'Founders': 'Founders',
    'Investor': 'Investor',
    'Amount': 'Amount',
    'Amount($)': 'Amount',
    'Stage': 'Stage',
    'Round/Series': 'Stage',
    'Location': 'Headquarter',
    
}
data['fundyear'] = 2020
data1['fundyear'] = 2021
data2['fundyear'] =2018
data3['fundyear'] =2019


# Rename columns in each dataset
data.rename(columns=common_schema, inplace=True)
data1.rename(columns=common_schema, inplace=True)
data2.rename(columns=common_schema, inplace=True)
data3.rename(columns=common_schema, inplace=True)

# Concatenate datasets
df = pd.concat([data, data1, data2, data3], ignore_index=True)

# Display the result
df.head()


Unnamed: 0,Company,Founded,Headquarter,Sector,Description,Founders,Investor,Amount,Stage,fundyear
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,Series A,2020
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,113042969.543071,Pre-seed,2020
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,Series A,2020
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,Series A,2020


In [2085]:
df["Amount"].value_counts

<bound method IndexOpsMixin.value_counts of 0               200000.0
1               100000.0
2       113042969.543071
3               400000.0
4               340000.0
              ...       
2849          20000000.0
2850         693000000.0
2851           5000000.0
2852          50000000.0
2853          33000000.0
Name: Amount, Length: 2854, dtype: object>

In [2086]:
df.dtypes

Company         object
Founded        float64
Headquarter     object
Sector          object
Description     object
Founders        object
Investor        object
Amount          object
Stage           object
fundyear         int64
dtype: object

In [2087]:
df.dtypes

Company         object
Founded        float64
Headquarter     object
Sector          object
Description     object
Founders        object
Investor        object
Amount          object
Stage           object
fundyear         int64
dtype: object

In [2088]:
df.nunique()

Company        2213
Founded          35
Headquarter     136
Sector          594
Description    2690
Founders       1981
Investor       1771
Amount          596
Stage            66
fundyear          4
dtype: int64

In [2089]:
#fix spaces and column names
df.columns= df.columns.str.strip()
df.columns

Index(['Company', 'Founded', 'Headquarter', 'Sector', 'Description',
       'Founders', 'Investor', 'Amount', 'Stage', 'fundyear'],
      dtype='object')

In [2090]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Founded,2330.0,1990.967382,223.600261,0.0,2015.0,2016.0,2019.0,2021.0
fundyear,2854.0,2020.01822,1.087328,2018.0,2020.0,2020.0,2021.0,2021.0


In [2091]:
#unique values in sector
df['Sector'].unique()

array(['AgriTech', 'EdTech', 'Hygiene management', 'Escrow',
       'Networking platform', 'FinTech', 'Crowdsourcing',
       'Food & Bevarages', 'HealthTech', 'Fashion startup',
       'Food Industry', 'Food Delivery', 'Virtual auditing startup',
       'E-commerce', 'Gaming', 'Work fulfillment', 'AI startup',
       'Telecommunication', 'Logistics', 'Tech Startup', 'Sports',
       'Retail', 'Medtech', 'Tyre management', 'Cloud company',
       'Software company', 'Venture capitalist', 'Renewable player',
       'IoT startup', 'SaaS startup', 'Aero company', 'Marketing company',
       'Retail startup', 'Co-working Startup', 'Finance company',
       'Tech company', 'Solar Monitoring Company',
       'Video sharing platform', 'Gaming startup',
       'Video streaming platform', 'Consumer appliances',
       'Blockchain startup', 'Conversational AI platform', 'Real Estate',
       'SaaS platform', 'AI platform', 'Fusion beverages', 'HR Tech',
       'Job portal', 'Hospitality', 'Digit

In [2092]:
import re
 
def sector_redistribution(sector):
    if isinstance(sector, str):
        if re.search(r'Credit|Financial Services|Lending|Marketplace|FinTech|Accounting|Banking|Venture Capital|Investment|Financial Exchanges|Micro Lending|Wealth Management|Insurance|Crowdfunding|Finance|Impact Investing|Personal Finance|Cryptocurrency|Trading Platform|Financial Services', sector):
            return 'Finance'
        elif re.search(r'Automotive|Air Transportation|Transport|Logistics|Vehicle|Transportation|Railroad|Last Mile Transportation|Electric Vehicle|Ride Sharing|Autonomous Vehicles|Marine Transportation|Battery', sector):
            return 'Transport'
        elif re.search(r'E-Commerce|Retail|Fashion|Jewelry|Shopping|Retail Technology|Marketplace|E-Commerce Platforms|Online Portals|Facilities Support Services|Procurement|Interior Design|Home Decor|Home Improvement|Home Services|Furniture', sector):
            return 'E-Commerce & Retail'
        elif re.search(r'Cloud Infrastructure|PaaS|SaaS|Software|Enterprise Software|Network Hardware|Network Security|Delivery Service|Information Technology|Cloud Computing|Data Analytics|AI|Machine Learning|Analytics|Big Data|IoT|Blockchain|Artificial Intelligence|Digital Marketing|SEO|SEM|Web Development|Digital Media|Media and Entertainment|Social Media|CRM|Virtual Reality|Augmented Reality|Enterprise Resource Planning', sector):
            return 'Technology & IT'
        elif re.search(r'Health Care|Hospital|Medical|Health Diagnostics|Medical Device|Wellness|Personal Health|Health Insurance|Health and Fitness|MedTech|Pharmaceutical|Life Science|Biotechnology|Diabetes|Elder Care|Alternative Medicine|mHealth|Dental|Home Health Care|Nutrition|Medical|HealthTech', sector):
            return 'Health & Medical'
        elif re.search(r'Food Delivery|Food and Beverage|Food Processing|Restaurants|Catering|Snack Food|Tea|Organic Food|Food Industry|FoodTech|Cloud Kitchen|Beverages|Fusion Beverages|Food & Nutrition|Food Production|Cooking', sector):
            return 'Food & Beverage'
        elif re.search(r'Advertising|Brand Marketing|Event Promotion|Marketing|Sponsorship|Ticketing|Digital Marketing|Creative Agency|Video Streaming|Broadcasting|News|Publishing|Media|Media Tech|Content Management|Content Publishing|Video Platform', sector):
            return 'Media & Advertising'
        elif re.search(r'Agriculture|AgTech|Farming|Farmers Market|AgriTech|Foodtech|Dairy', sector):
            return 'Agriculture'
        elif re.search(r'Tourism|Travel|TravelTech|Business Travel|Tourism & EV|Travel Accommodations|Hospitality|Hotel|Reservations', sector):
            return 'Travel & Hospitality'
        elif re.search(r'Consulting|Business Development|Advisory|Management Consulting|Outsourcing|Customer Service|Professional Services', sector):
            return 'Consulting & Professional Services'
        elif re.search(r'Education|E-Learning|EdTech|Higher Education|Education Management|Continuing Education|Skill Assessment|Tutoring|STEM Education|Career Planning|Training', sector):
            return 'Education'
        elif re.search(r'Supply Chain Management|Freight Service|Logistics|Delivery|Warehousing|Packaging Services|Supply Chain', sector):
            return 'Logistics & Supply Chain'
        elif re.search(r'Industrial Automation|Manufacturing|Robotics|Automation|Industrial|Mechanical & Industrial Engineering|Production|Factory|Industrial Technology|Automobile Technology', sector):
            return 'Manufacturing & Industrial'
        elif re.search(r'Energy|Renewable Energy|CleanTech|Solar|Electricity|Energy Storage|Environmental Services|GreenTech|Environmental Consulting|Natural Resources|Oil and Gas|Energy Technology', sector):
            return 'Energy & Environmental'
        elif re.search(r'Children|Parenting|Child Care|Preschool Daycare|KidTech', sector):
            return 'Parenting & Child Care'
        elif re.search(r'Sports|Fitness|Health and Fitness|Wellness|Yoga|eSports|Gaming|Video Games|Fantasy Sports|Sporting Goods|SportsTech|Health & Wellness', sector):
            return 'Sports & Fitness'
        elif re.search(r'Fashion|Beauty|Lifestyle|Cosmetics|Apparel|Footwear|Wearables|Fashion Tech|Jewelry|Skincare|Beauty Products|Beauty & Wellness', sector):
            return 'Fashion & Beauty'
        elif re.search(r'Construction|Building|Infrastructure|Real Estate|PropTech|Commercial Real Estate|Property Management|Rental Property|Housing|Home Services|Interior Design', sector):
            return 'Construction & Real Estate'
        elif re.search(r'HR|Human Resources|Staffing|Recruitment|HRTech', sector):
            return 'Human Resources'
        elif re.search(r'Finance|Financial Services|FinTech|Mobile Payments|Payments|Insurance|Insurance Tech|InsureTech|Insurtech|Personal Finance|Wealth Management|Investment|Mutual Funds|Investment Banking|Venture Capital', sector):
            return 'Finance'
        else:
            return 'Others'
    else:
        return 'Others'
   
 
# Apply the sector redistribution function to create a new column
df['redistributed_sector'] = df['Sector'].apply(sector_redistribution)
 

In [2093]:
#confirm the changes
df.head()

Unnamed: 0,Company,Founded,Headquarter,Sector,Description,Founders,Investor,Amount,Stage,fundyear,redistributed_sector
0,Aqgromalin,2019.0,Chennai,AgriTech,Cultivating Ideas for Profit,"Prasanna Manogaran, Bharani C L",Angel investors,200000.0,Series A,2020,Agriculture
1,Krayonnz,2019.0,Bangalore,EdTech,An academy-guardian-scholar centric ecosystem ...,"Saurabh Dixit, Gurudutt Upadhyay",GSF Accelerator,100000.0,Pre-seed,2020,Education
2,PadCare Labs,2018.0,Pune,Hygiene management,Converting bio-hazardous waste to harmless waste,Ajinkya Dhariya,Venture Center,113042969.543071,Pre-seed,2020,Others
3,NCOME,2020.0,New Delhi,Escrow,Escrow-as-a-service platform,Ritesh Tiwari,"Venture Catalysts, PointOne Capital",400000.0,Series A,2020,Others
4,Gramophone,2016.0,Indore,AgriTech,Gramophone is an AgTech platform enabling acce...,"Ashish Rajan Singh, Harshit Gupta, Nishant Mah...","Siana Capital Management, Info Edge",340000.0,Series A,2020,Agriculture


In [2094]:
#drop the description column has no value to the dataset
df.drop(columns='Description', axis=1,inplace = True)
df.columns

Index(['Company', 'Founded', 'Headquarter', 'Sector', 'Founders', 'Investor',
       'Amount', 'Stage', 'fundyear', 'redistributed_sector'],
      dtype='object')

In [2095]:
df.isna().sum()

Company                   0
Founded                 524
Headquarter               0
Sector                    0
Founders                524
Investor                524
Amount                    0
Stage                     0
fundyear                  0
redistributed_sector      0
dtype: int64

In [2097]:
# knowing that 2018 data does not have founded, founders and investors column
# Founded: replace null values with median
df['Founded'].fillna(df['Founded'].median(), inplace=True)

# Founders: simulate by filling with "Unknown")
df['Founders'].fillna('Unknown Founders', inplace=True)

#Investor: simulate by filling with "Various Investors")
df['Investor'].fillna('unknown Investors', inplace=True)

In [2098]:
df.isna().sum()

Company                 0
Founded                 0
Headquarter             0
Sector                  0
Founders                0
Investor                0
Amount                  0
Stage                   0
fundyear                0
redistributed_sector    0
dtype: int64