### Import the libraries

In [1]:
import pandas as pd
import numpy as np

### Read the dataset

In [2]:
data = pd.read_csv('startup_funding.csv')
data.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
0,0,01/08/2017,TouchKin,Technology,Predictive Care Platform,Bangalore,Kae Capital,Private Equity,1300000.0,
1,1,02/08/2017,Ethinos,Technology,Digital Marketing Agency,Mumbai,Triton Investment Advisors,Private Equity,,
2,2,02/08/2017,Leverage Edu,Consumer Internet,Online platform for Higher Education Services,New Delhi,"Kashyap Deorah, Anand Sankeshwar, Deepak Jain,...",Seed Funding,,
3,3,02/08/2017,Zepo,Consumer Internet,DIY Ecommerce platform,Mumbai,"Kunal Shah, LetsVenture, Anupam Mittal, Hetal ...",Seed Funding,500000.0,
4,4,02/08/2017,Click2Clinic,Consumer Internet,healthcare service aggregator,Hyderabad,"Narottam Thudi, Shireesh Palle",Seed Funding,850000.0,


### Clean the dataset

We want to find out the average amount of funding given to startups which are in either Bangalore or in New Delhi.

Note:
Take the city name "Delhi" as "New Delhi".

Check the case-sensitiveness of cities also. That means - at someplace, instead of "Bangalore", "bangalore" is given. Take city name as "Bangalore"



In [3]:
data.CityLocation = data.CityLocation.replace(to_replace =["bangalore"], value ="Bangalore")
data.CityLocation = data.CityLocation.replace(to_replace =["Delhi"], value ="New Delhi")

In [4]:
data.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
0,0,01/08/2017,TouchKin,Technology,Predictive Care Platform,Bangalore,Kae Capital,Private Equity,1300000.0,
1,1,02/08/2017,Ethinos,Technology,Digital Marketing Agency,Mumbai,Triton Investment Advisors,Private Equity,,
2,2,02/08/2017,Leverage Edu,Consumer Internet,Online platform for Higher Education Services,New Delhi,"Kashyap Deorah, Anand Sankeshwar, Deepak Jain,...",Seed Funding,,
3,3,02/08/2017,Zepo,Consumer Internet,DIY Ecommerce platform,Mumbai,"Kunal Shah, LetsVenture, Anupam Mittal, Hetal ...",Seed Funding,500000.0,
4,4,02/08/2017,Click2Clinic,Consumer Internet,healthcare service aggregator,Hyderabad,"Narottam Thudi, Shireesh Palle",Seed Funding,850000.0,


In [5]:
data.shape

(2372, 10)

Now extract only the data for cities `['Bangalore', 'New Delhi']`

In [6]:
city = data.loc[data['CityLocation'].isin(['Bangalore', 'New Delhi'])]

In [7]:
city.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
0,0,01/08/2017,TouchKin,Technology,Predictive Care Platform,Bangalore,Kae Capital,Private Equity,1300000.0,
2,2,02/08/2017,Leverage Edu,Consumer Internet,Online platform for Higher Education Services,New Delhi,"Kashyap Deorah, Anand Sankeshwar, Deepak Jain,...",Seed Funding,,
5,5,01/07/2017,Billion Loans,Consumer Internet,Peer to Peer Lending platform,Bangalore,Reliance Corporate Advisory Services Ltd,Seed Funding,1000000.0,
8,8,05/07/2017,Jumbotail,eCommerce,online marketplace for food and grocery,Bangalore,"Kalaari Capital, Nexus India Capital Advisors",Private Equity,8500000.0,
11,11,06/07/2017,Minjar,Technology,Cloud Solutions provider,Bangalore,"Blume Ventures, Contrarian Capital India Partn...",Seed Funding,,


In [8]:
city.shape

(1013, 10)

In [9]:
city.CityLocation.value_counts()

Bangalore    628
New Delhi    385
Name: CityLocation, dtype: int64

Now rather than considering all the startups, take a sample of size 50 (with replacement). Then find the average amount of funding from this sample and calculate the Sampling Error.

For this select some rows randomly with replace = true
Parameter replace give permission to select one rows many time(like). Default value of replace parameter of sample() method is False so you never select more than total number of rows.

In [11]:
sample = city.sample(n = 50, replace = True)

In [12]:
sample.head()

Unnamed: 0,SNo,Date,StartupName,IndustryVertical,SubVertical,CityLocation,InvestorsName,InvestmentType,AmountInUSD,Remarks
2037,2037,16/06/2015,Karma Recycling,Electronic Goods recycling service,,New Delhi,"Infuse Ventures, Low Carbon Enterprise Fund",Private Equity,,Series A
1813,1813,05/08/2015,Zocalo,Rental Accomodation finder,,New Delhi,"Sachin Bhatia, Rajesh Sawhney",Seed Funding,,
571,571,26/11/2016,TapChief,Consumer Internet,Professional Expert Advice Platform,Bangalore,"Paytm, Aprameya Radhakrishna, Subramanya Venka...",Seed Funding,,
1497,1497,17/12/2015,Eatfresh,Online marketplace for Chef Meals,,Bangalore,Kalaari Capital,Private Equity,,
509,509,27/12/2016,ShopX,eCommerce,SAAS ECommerce Retail app,Bangalore,Nandan Nilekani,Private Equity,5000000.0,


In [13]:
sample.CityLocation.value_counts()

Bangalore    26
New Delhi    24
Name: CityLocation, dtype: int64

Clean the population data column `AmountInUSD` for NA values and convert it to numeric.

In [14]:
amount = city['AmountInUSD']
amount.dropna(inplace=True)

a1 = amount.str.replace(',', '')

a2 = pd.to_numeric(a1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(result)


Get the maximum of the population

In [15]:
pop_avg =a2.mean()

Do the same for the sample sample data

In [16]:
amount = sample['AmountInUSD']
amount.dropna(inplace=True)

a1 = amount.str.replace(',', '')

a2 = pd.to_numeric(a1)

In [21]:
sample_avg = a2.mean()

The sampling error for the Sample is given below.

In [22]:
pop_avg - sample_avg

13564620.03143782