### Import the libraries

In [47]:
import pandas as pd
import numpy as np

### Read the dataset

In [2]:
data = pd.read_csv('startup_funding.csv')
data.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
0,0,01/08/2017,TouchKin,Technology,Predictive Care Platform,Bangalore,Kae Capital,Private Equity,1300000.0,
1,1,02/08/2017,Ethinos,Technology,Digital Marketing Agency,Mumbai,Triton Investment Advisors,Private Equity,,
2,2,02/08/2017,Leverage Edu,Consumer Internet,Online platform for Higher Education Services,New Delhi,"Kashyap Deorah, Anand Sankeshwar, Deepak Jain,...",Seed Funding,,
3,3,02/08/2017,Zepo,Consumer Internet,DIY Ecommerce platform,Mumbai,"Kunal Shah, LetsVenture, Anupam Mittal, Hetal ...",Seed Funding,500000.0,
4,4,02/08/2017,Click2Clinic,Consumer Internet,healthcare service aggregator,Hyderabad,"Narottam Thudi, Shireesh Palle",Seed Funding,850000.0,


### Clean the dataset

We want to find out the maximum amount of funding given to any startup in the following regions -

Bangalore 
NCR (which includes New Delhi, Gurgaon and Noida)
Mumbai 
Pune 
Hyderabad 

So create your population data first by extracting those startups which are in given cities.

Before that we need to replace:

`Delhi to New Delhi`

`bangalore to Bangalore`

`New Delhi, Gurgaon and Noida to NCR.`



In [6]:
data.CityLocation = data.CityLocation.replace(to_replace =["bangalore"], value ="Bangalore")
data.CityLocation = data.CityLocation.replace(to_replace =["Delhi"], value ="New Delhi")
data.CityLocation = data.CityLocation.replace(to_replace =["New Delhi", "Gurgaon", "Noida"], value ="NCR")

In [7]:
data.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
0,0,01/08/2017,TouchKin,Technology,Predictive Care Platform,Bangalore,Kae Capital,Private Equity,1300000.0,
1,1,02/08/2017,Ethinos,Technology,Digital Marketing Agency,Mumbai,Triton Investment Advisors,Private Equity,,
2,2,02/08/2017,Leverage Edu,Consumer Internet,Online platform for Higher Education Services,NCR,"Kashyap Deorah, Anand Sankeshwar, Deepak Jain,...",Seed Funding,,
3,3,02/08/2017,Zepo,Consumer Internet,DIY Ecommerce platform,Mumbai,"Kunal Shah, LetsVenture, Anupam Mittal, Hetal ...",Seed Funding,500000.0,
4,4,02/08/2017,Click2Clinic,Consumer Internet,healthcare service aggregator,Hyderabad,"Narottam Thudi, Shireesh Palle",Seed Funding,850000.0,


In [14]:
data.shape

(2372, 10)

Now extract only the data for cities `['Bangalore', 'NCR', 'Mumbai', 'Pune', 'Hyderabad']`

In [11]:
city = data.loc[data['CityLocation'].isin(['Bangalore', 'NCR', 'Mumbai', 'Pune', 'Hyderabad'])]

In [12]:
city.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
0,0,01/08/2017,TouchKin,Technology,Predictive Care Platform,Bangalore,Kae Capital,Private Equity,1300000.0,
1,1,02/08/2017,Ethinos,Technology,Digital Marketing Agency,Mumbai,Triton Investment Advisors,Private Equity,,
2,2,02/08/2017,Leverage Edu,Consumer Internet,Online platform for Higher Education Services,NCR,"Kashyap Deorah, Anand Sankeshwar, Deepak Jain,...",Seed Funding,,
3,3,02/08/2017,Zepo,Consumer Internet,DIY Ecommerce platform,Mumbai,"Kunal Shah, LetsVenture, Anupam Mittal, Hetal ...",Seed Funding,500000.0,
4,4,02/08/2017,Click2Clinic,Consumer Internet,healthcare service aggregator,Hyderabad,"Narottam Thudi, Shireesh Palle",Seed Funding,850000.0,


In [13]:
city.shape

(1937, 10)

In [15]:
city.CityLocation.value_counts()

NCR          703
Bangalore    628
Mumbai       446
Pune          84
Hyderabad     76
Name: CityLocation, dtype: int64

Create strata containing 20 samples from each city

In [17]:
strata = city.groupby('CityLocation', group_keys=False).apply(lambda x: x.sample(20))

In [18]:
strata.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
870,870,15/07/2016,BankerBay,Consumer Internet,Online Investment Banking platform,Bangalore,undisclosed investors,Private Equity,2000000.0,
1020,1020,10/5/2016,Swiggy,Consumer Internet,Food Delivery Platform,Bangalore,"Norwest Venture Partners, DST Global, Accel Pa...",Private Equity,7000000.0,
430,430,25/01/2017,InstaSafe,Technology,Security-as-a-Service solution provider,Bangalore,ABM Knowledgeware,Private Equity,2200000.0,
1324,1324,25/02/2016,CarveNiche,Consumer Internet,Personalized Learning Solutions & products,Bangalore,"Calcutta Angels, Lead Angels & Others",Seed Funding,,
1614,1614,30/11/2015,QikPod,Ecommerce Delivery locker services,,Bangalore,"Flipkart, Accel Partners, Delhivery, Foxconn",Private Equity,9000000.0,Series A


In [19]:
strata.CityLocation.value_counts()

Bangalore    20
Hyderabad    20
Mumbai       20
NCR          20
Pune         20
Name: CityLocation, dtype: int64

Clean the population data column `AmountInUSD` for NA values and convert it to numeric.

In [24]:
amount = city['AmountInUSD']
amount.dropna(inplace=True)

a1 = amount.str.replace(',', '')

a2 = pd.to_numeric(a1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(result)


Get the maximum of the population

In [25]:
pop_max =a2.max()

Do the same for the sample strata data

In [26]:
amount = strata['AmountInUSD']
amount.dropna(inplace=True)

a1 = amount.str.replace(',', '')

a2 = pd.to_numeric(a1)

In [27]:
sample_max = a2.max()

The sampling error for the Stratified sampling is given below.

In [28]:
pop_max - sample_max

1220000000