<div class="alert alert-block alert-info">

<h1 style="font-family:verdana;"> Description:</h1> 

<ul>
<li><p style="font-family:verdana;">
India is one of the fastest-growing economies in the world. In the past decade we have seen a large number of unicorn startups rise in the Indian startup ecosystem which has a global impact.
</p></li>
    
<li><p style="font-family:verdana;">
In this notebook, we are going to perform EDA on the Indian startup funding dataset and also we will infer some insights and try to answer certain questions about the Indian startup ecosystem.
</p></li>   
</ul>

</div>

## Step 0: Import libraries and dataset

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing dataset
dataset = pd.read_csv('../input/indian-startup-funding/startup_funding.csv')

## Step 1: Understanding the dataset

In [None]:
# Preview dataset
dataset.head()

In [None]:
# Dataset dimensions - (rows, columns)
dataset.shape

In [None]:
# Features data-type
dataset.info()

In [None]:
# Checking for Null values
(dataset.isnull().sum() / dataset.shape[0] * 100).sort_values(ascending = False).round(2).astype(str) + ' %'

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;"> Observations:</h3>

<ul>
    
<li><p style="font-family:verdana;">
In the dependent feature i.e. 'Amount in USD' some preprocessing needs to be done like removing commas and dealing with null values.
</p></li>
    
<li><p style="font-family:verdana;">
The 'Date' feature needs to be preprocessed and converted to numeric so that it can be useful for creating plots.
</p></li>    

<li><p style="font-family:verdana;">
The 'Remarks' feature contains about 86.23% of null values, so we can drop it.
</p></li>    

</ul>

</div>

## Step 2: Data Preprocessing

In [None]:
# Replacing the commas in 'Amount in USD' feature
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x: str(x).replace(',', ''))

In [None]:
# Fixing the faulty values in 'Amount in USD' feature
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("undisclosed", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("Undisclosed", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("unknown", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("14342000+", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa010000000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa05000000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa019350000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa0600000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa020000000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa0N/A", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa016200000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("\\\\xc2\\\\xa0685000", "0"))
dataset['Amount in USD'] = dataset['Amount in USD'].apply(lambda x : str(x).replace("nan", "0"))

In [None]:
# Converting to numeric data-type
dataset['Amount in USD'] = pd.to_numeric(dataset['Amount in USD'])

In [None]:
# Checking for most frequent values in 'Amount in USD'
dataset['Amount in USD'].value_counts(normalize = True).head(10).mul(100).round(2).astype(str) + ' %'

In [None]:
# Replacing 0 in 'Amount in USD' with null values
dataset['Amount in USD'] = dataset['Amount in USD'].replace(0, np.nan)

In [None]:
# Replacing null values with mean
dataset['Amount in USD'].fillna(dataset['Amount in USD'].mean(), inplace = True)

In [None]:
# Fixing the faulty values in 'Date' column
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '12/05.2015'] = '12/05/2015'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '13/04.2015'] = '13/04/2015'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '15/01.2015'] = '15/01/2015'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '22/01//2015'] = '22/01/2015'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '05/072018'] = '05/07/2018'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '01/07/015'] = '01/07/2015'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '05/072018'] = '05/07/2018'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '\\xc2\\xa010/7/2015'] = '10/07/2015'
dataset['Date dd/mm/yyyy'][dataset['Date dd/mm/yyyy'] == '\\\\xc2\\\\xa010/7/2015'] = '10/07/2015'

In [None]:
# Creating a feature 'Year Month' consisting of year and month
dataset['Year Month'] = (pd.to_datetime(dataset['Date dd/mm/yyyy']).dt.year*100) + (pd.to_datetime(dataset['Date dd/mm/yyyy']).dt.month)

In [None]:
# Dropping the 'Remarks' feature as it contains 86.24% null values
dataset.drop('Remarks', axis = 1, inplace = True)

In [None]:
# Replacing 'Bengaluru' with the more common name 'Bangalore' in the dataset
dataset['City  Location'][dataset['City  Location'] == 'Bengaluru'] = 'Bangalore'

In [None]:
# Replacing 'Undisclosed investors' with a common name 'Undisclosed Investors'
dataset['Investors Name'][dataset['Investors Name'] == 'Undisclosed investors'] = 'Undisclosed Investors'
dataset['Investors Name'][dataset['Investors Name'] == 'Undisclosed Investor'] = 'Undisclosed Investors'
dataset['Investors Name'][dataset['Investors Name'] == 'undisclosed investors'] = 'Undisclosed Investors'
dataset['Investors Name'][dataset['Investors Name'] == 'Undisclosed'] = 'Undisclosed Investors'

In [None]:
# Removing the space in 'Ola Cabs' as it gives two different words in WordCloud
dataset['Startup Name'][dataset['Startup Name'] == 'Ola Cabs'] = 'OlaCabs'

In [None]:
# Replacing with more common word
dataset['InvestmentnType'][dataset['InvestmentnType'] == 'Seed/ Angel Funding'] = 'Seed / Angel Funding'
dataset['InvestmentnType'][dataset['InvestmentnType'] == 'Seed\\\\nFunding'] = 'Seed Funding'
dataset['InvestmentnType'][dataset['InvestmentnType'] == 'Seed/ Angel Funding'] = 'Seed / Angel Funding'
dataset['InvestmentnType'][dataset['InvestmentnType'] == 'Seed/Angel Funding'] = 'Seed / Angel Funding'
dataset['InvestmentnType'][dataset['InvestmentnType'] == 'Angel / Seed Funding'] = 'Seed / Angel Funding'

## Step 3: Exploratory Data Analysis

### Q1. How does the funding ecosystem change with time?

In [None]:
# Selecting the most frequent values in 'Year Month'
months = dataset['Year Month'].value_counts().head(20)

In [None]:
print('Average number of fundings each month are',months.values.mean())

In [None]:
print('Minimum number of fundings made in a month are',months.values.min())

In [None]:
print('Maximum number of fundings in a month are',months.values.max())

In [None]:
# Creating a barplot for Number of fundings made each month
plt.figure(figsize = (20, 7))
sns.barplot(months.index, months.values, palette = 'colorblind')
plt.title('Number of Fundings each month', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.xlabel('Months', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.ylabel('Number of Fundings', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.tick_params(axis = 'x', labelsize = 12)
plt.tick_params(axis = 'y', labelsize = 15)
plt.grid()
plt.show()

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">In the above bar plot, we can see that the number of fundings made during the year 2015 and 2016 averaged around 84 and there was no drastic increase or decrease</p></li>
<li><p style="font-family:verdana;">In April of 2016, maximum fundings of 102 were made.</p></li>
<li><p style="font-family:verdana;">In January of 2015, minimum fundings of 70 were made.</p></li>
</ul>

</div>

### Q2. Do cities play a major role in funding?

In [None]:
# Selecting top 10 cities 
cities = dataset['City  Location'].value_counts().head(10)

In [None]:
# Preview of frequencies of top 10 cities
cities.values

In [None]:
# Preview of names of top 10 countries
cities.index

In [None]:
# Creating a barplot for number of fundings made in each city
plt.figure(figsize = (20, 7))
sns.barplot(cities.values, cities.index, palette = 'Paired')
plt.title('Number of fundings in each city', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.xlabel('Number of fundings', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.ylabel('Cities', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.tick_params(labelsize = 15)
plt.grid()
plt.show()

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">Cities like Bangalore, Mumbai and New Delhi are the most funded cities accounting for most of the fundings in the Indian startup ecosystem.</p></li>
<li><p style="font-family:verdana;">On the lower end, we have cities like Ahmedabad and Gurugram whose number of fundings are uncomparable with the top cities</p></li>
<li><p style="font-family:verdana;">From the graph, we can infer that if a startup is based in one of the top cities like Bengaluru then it has a higher probability of getting funded than someone in low funded cities.</p></li>
</ul>

</div>

### Q3. Which industries are favoured by investors for funding?

In [None]:
# Selecting the most frequent industries
industry = dataset['Industry Vertical'].value_counts().head(10)

In [None]:
# Preview of frequencies of top 10 industy types
industry.values

In [None]:
# Prevew the names of top 10 industry types
industry.index

In [None]:
# Creating a pie chart of top 10 industries
plt.figure(figsize = (20, 10))
plt.pie(industry.values, labels = industry.index, startangle = 30, explode = (0 , 0.20, 0, 0, 0, 0, 0, 0, 0, 0), 
        shadow = True, autopct = '%1.1f%%')
plt.axis('equal')
plt.title('Industry-wise distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.show()

In [None]:
# Selecting the most frequent subverticals
subvertical = dataset['SubVertical'].value_counts().head(10)

In [None]:
# Preview of frequencies of top 10 subverticals
subvertical.values

In [None]:
# Preview of names of top 10 subverticals
subvertical.index

In [None]:
# Creating a donut chart of top 10 Subverticals
plt.figure(figsize = (20, 10))
plt.pie(subvertical.values, labels = subvertical.index, startangle = 90, autopct = '%1.1f%%')
centre_circle = plt.Circle((0, 0), 0.70, fc = 'white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Subvertical-wise distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.axis('equal')
plt.show()

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">In the pie chart we can see that technology and consumer internet startups make up about 74% of the market.</p></li>
<li><p style="font-family:verdana;">Most of the startups like online pharmacy, online lending, food delivery platforms are technology-based as seen in the donut chart.</p></li>
<li><p style="font-family:verdana;">From the above graphs, we can say that consumer-based technology startups are favored by investors.</p></li>
</ul>

</div>

### Q4. Who are the most important investors in the Indian Ecosystem?

In [None]:
# Selecting the most frequent investors 
investors = dataset['Investors Name'].value_counts().head(10)

In [None]:
# Preview of frequency of top 10 investors
investors.values

In [None]:
# Preview names of top 10 investors
investors.index

In [None]:
# Create a barplot of top 10 investors
plt.figure(figsize = (20, 7))
sns.barplot(investors.values, investors.index, palette = 'deep')
plt.title('Number of investments made by Top Investors', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.xlabel('Number of Investments', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.ylabel('Investor names', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.tick_params(labelsize = 15)
plt.grid()
plt.show()

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">The top investor in the above graph is undisclosed, this is because in many startups investors choose to remain hidden.</p></li>
<li><p style="font-family:verdana;">The known top investors in the Indian startup ecosystem are Ratan Tata, Indian Angel Network, and Kalaari Capital.</p></li>
</ul>

</div>

### Q5. How much funds do startups generally get in India?

In [None]:
# Preview of top 10 most funded startups
dataset['Amount in USD'].sort_values(ascending = False).head(10)

In [None]:
# Preview of details of top 10 most funded startups
dataset.sort_values(by = 'Amount in USD', ascending = False).head(5)

In [None]:
# Calculating average funding received by a startup
dataset['Amount in USD'].mean() 

In [None]:
# Preview of least funded startups
dataset['Amount in USD'].sort_values().head(10)

In [None]:
# Preview of details of least funded startups
dataset.sort_values(by = 'Amount in USD').head(5)

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">
The highest funded startups are 'Rapido Bike taxi', 'Paytm', and 'Flipkart'.</p></li>
<li><p style="font-family:verdana;">The average funding in the Indian startup ecosystem is 18429897 USD.</p></li>
<li><p style="font-family:verdana;">The lowest-funded startups are 'Hostel Dunia', 'Play your sport', and 'Yo Grad' with about 16000 USD each.</p></li>
</ul>

</div>

### Q6. Which are the Startups which have been funded most number of times?

In [None]:
# Selecting the startups funded the most number of times
most_funded = dataset['Startup Name'].value_counts().head(20)

In [None]:
# Preview frequencies
most_funded.values

In [None]:
# Preview names 
most_funded.index

In [None]:
# Creating a barplot of startups funded most number of times
plt.figure(figsize = (25, 5))
sns.barplot(most_funded.index, most_funded.values, palette = 'colorblind')
plt.title('Most number of times funded startups', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.xlabel('Startup Name', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.ylabel('Number of times funded', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20})
plt.tick_params(axis = 'x', labelsize = 10)
plt.tick_params(axis = 'y', labelsize = 15)
plt.grid()
plt.show()

In [None]:
from wordcloud import WordCloud, STOPWORDS

In [None]:
most_funded_1 = dataset['Startup Name'].value_counts().head(30)

In [None]:
names = most_funded_1.index

In [None]:
# Creating a wordcloud of startup names
plt.figure(figsize = (20, 7))
wordcloud = WordCloud(max_font_size = 25, width = 300, height = 100).generate(' '.join(names))
plt.title('Most number of times funded startups', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.axis("off")
plt.imshow(wordcloud)
plt.show()

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">The statups funded the most number of times are 'Swiggy', 'OlaCabs' and 'Paytm'.</p></li>

</ul>

</div>

### Q7. What are the different types of funding for startups?

In [None]:
# Preview of types of investments sorted by frequency
dataset['InvestmentnType'].value_counts().head(10)

In [None]:
# Selecting 10 most common investment types
investment_type = dataset['InvestmentnType'].value_counts().head(10)

In [None]:
# Creating a Treemap of Investment types
import squarify
plt.figure(figsize = (10, 7))
squarify.plot(sizes = investment_type.values, label = investment_type.index, value = investment_type.values)
plt.title('Investment type distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 30, 'fontweight' : 'bold'})
plt.show()

<div class="alert alert-block alert-info">

<h3 style="font-family:verdana;">Observations:</h3>

<ul>
<li><p style="font-family:verdana;">The major types of investments are Private Equity and Seed Funding which account for more than 90% of fundings.</p></li>
<li><p style="font-family:verdana;">Other types of fundings are Debt funding, Seed/Angel funding, Series A, Series B, Series C but they are very rare.</p></li>
</ul>

</div>

<div class="alert alert-block alert-info">

<h2 style="font-family:verdana;">Conclusion:</h2>

<ul>
<li><p style="font-family:verdana;">
Technology-based startups which provide their services to the everyday consumer are very probable to get a lot of funding. We have seen such startups like Flipkart and Paytm.</p></li>
<li><p style="font-family:verdana;">A large number of Startups based in metropolitan cities like Bangalore and Mumbai are funded, which may be due to the fact that talent availability is massive in those cities.</p></li>
<li><p style="font-family:verdana;">Most of the fundings are either of Private Equity and Seed Funding type.</p></li>
<li><p style="font-family:verdana;">Ratan Tata, Indian Angel Network, and Kalaari Capital are some of the top investors in the Indian startup ecosystem.</p></li>
<li><p style="font-family:verdana;">Flipkart, Paytm and Rapido Bike taxi are one of the most funded startups whereas Ola Cabs and Swiggy were funded the most number of times.</p></li>
</ul>

</div>