# Statistics Essential  : Investment Analysis Assignment

Spark Funds wants to make investments in a few companies. The CEO of Spark Funds wants to understand the global trends in investments so that she can take the investment decisions effectively.

# Approach

To complete the analysis for spark fund we will follow ***CRISP-DM*** framwork steps. This will allow us to structure data analysis , cleaning and processing effective. 

1. Bussiness Undestanding 
2. Data Understanding 
3. Data Preperation 
4. Data Modelling
5. Model Evaluation 
6. Model Deployment 

Aim to align analysis with points mentioned above in order to struture our analysis. Assuption  is that we will not be able to follow entire steps mentioned in CRISP framework but this data analytics will be a good start to map few of them.  

## 1. Bussiness Undestanding

##### The business objectives and goals of data analysis are pretty straightforward.

>**Business objective**: <br>
The objective is to identify the best sectors, countries, and a suitable investment type for making investments. The overall strategy is to invest where others are investing, implying that the 'best' sectors and countries are the ones 'where most investors are investing'.

## 2. Data Understanding

>**Goals of data analysis**: <br>
Goals are divided into three sub-goals:
    
>>`Investment type analysis`: Comparing the typical investment amounts in the venture, seed, angel, private equity etc. so that Spark Funds can choose the type that is best suited for their strategy.

>>`Country analysis`: Identifying the countries which have been the most heavily invested in the past. These will be Spark Funds’ favourites as well.

>>`Sector analysis`: Understanding the distribution of investments across the eight main sectors.

***Data points provided for analysis***
1. Company details (**companies**) - Data having all the information related to companies fro which analysis needs to be performend
2. Funding round details (**round2**) - Data related to funding details
3. Sector Classification (**mapping**) - Data matrix for categoty Vs sectors 

## 3. Data Preperation

***Steps are divide into below step :***
> a. `Loading Data` <br>
> b. `Filtering and Filling Nan/Blank Data values`<br>
> c. `Cleaning Data` <br>

-----------------

## Checkpoint 1: Data Cleaning 1

### a. Loading Data

**Importing required libraries for data analysis** 

In [2]:
import pandas as pd
import numpy as np
import re

pd.options.display.float_format = '{:20,.2f}'.format

**Reading data into frames and make sure to pass encoding to avoid any issue regarding reading** 

In [8]:
# Below are the encoding that used to read files
encoding_cp = "cp1252"
encoding_iso = "ISO-8859-1"
encoding_utf = "utf-8"
encoding_latin_1="latin-1"
encoding_latin="latin"
encoding_utf_sig="utf-8-sig"
encoding_unicode="unicode-escape"
encoding_raw_unicode="raw-unicode-escape"

companies = pd.read_csv("../input/companies.txt", sep="\t", encoding = encoding_latin)
rounds2 = pd.read_csv("../input/rounds2.csv",encoding = encoding_latin)

### b. Cleaning Data

**Helper Method**

In [9]:
# Can use lambda x:x.lower() but it needs to be repeated a lot so function is better 
def lowerCase(name):
    return name.lower()

# Method to remove special character 
def cleanNameField(name):
    cleanedLiteral=""
    for item in re.compile(r'[0-9a-zA-Z-+/.]').findall(name):
        cleanedLiteral +=str(item)
    return str(cleanedLiteral)

**Formatting field in dataframes to lower case for better comparision**

In [10]:
companies["permalink"] = companies["permalink"].apply(lowerCase)
rounds2["company_permalink"] = rounds2["company_permalink"].apply(lowerCase)

# Removing special character from unique fields to compare the dataset correctly
companies["permalink"] = companies["permalink"].apply(cleanNameField)
rounds2["company_permalink"] = rounds2["company_permalink"].apply(cleanNameField)

In [11]:
rounds2.head(5)

Unnamed: 0,company_permalink,funding_round_permalink,funding_round_type,funding_round_code,funded_at,raised_amount_usd
0,/organization/-fame,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,B,05-01-2015,10000000.0
1,/organization/-qounter,/funding-round/22dacff496eb7acb2b901dec1dfe5633,venture,A,14-10-2014,
2,/organization/-qounter,/funding-round/b44fbb94153f6cdef13083530bb48030,seed,,01-03-2014,700000.0
3,/organization/-the-one-of-them-inc-,/funding-round/650b8f704416801069bb178a1418776b,venture,B,30-01-2014,3406878.0
4,/organization/0-6-com,/funding-round/5727accaeaa57461bd22a9bdd945382d,venture,A,19-03-2008,2000000.0


In [12]:
companies.head(5)

Unnamed: 0,permalink,name,homepage_url,category_list,status,country_code,state_code,region,city,founded_at
0,/organization/-fame,#fame,http://livfame.com,Media,operating,IND,16,Mumbai,Mumbai,
1,/organization/-qounter,:Qounter,http://www.qounter.com,Application Platforms|Real Time|Social Network...,operating,USA,DE,DE - Other,Delaware City,04-09-2014
2,/organization/-the-one-of-them-inc-,"(THE) ONE of THEM,Inc.",http://oneofthem.jp,Apps|Games|Mobile,operating,,,,,
3,/organization/0-6-com,0-6.com,http://www.0-6.com,Curated Web,operating,CHN,22,Beijing,Beijing,01-01-2007
4,/organization/004-technologies,004 Technologies,http://004gmbh.de/en/004-interact,Software,operating,USA,IL,"Springfield, Illinois",Champaign,01-01-2010


# Questions

#### Question 1 : How many unique companies are present in rounds2?

In [13]:
rounds2.company_permalink.nunique()

66368

#### Question 2 : How many unique companies are present in companies?

In [14]:
companies.permalink.nunique()

66368

#### Question 3 : In the companies data frame, which column can be used as the unique key for each company? Write the name of the column.

***permalink*** can be used as unique column 

#### Question 4 : Are there any companies in the rounds2 file which are not present in companies?

In [17]:
# Create index using company unique ID
companies_index = pd.Index(companies["permalink"])
rounds2_index = pd.Index(rounds2["company_permalink"])

rounds2_index.difference(companies_index)

# Difference between companis and rounds2 only when data is not cleaned and has special character

# No diffrence if data is cleaned of special character  

# Answer is N , no diffrence 

Index([], dtype='object')

#### Question 5 : Merging Data Frames [How many observations are present in master_frame ?]

In [18]:
# Renaming company column name so to merge smoothly
companies.rename(columns={'permalink':'company_permalink'},inplace=True)

# Mering both to create a master_frame
master_frame = pd.merge(companies,rounds2,how='inner',on='company_permalink')

print(len(master_frame))
print((master_frame.shape))

114949
(114949, 15)


***Cleaning Master Frame for further analysis***

In [19]:
master_frame.category_list = master_frame.category_list.astype(str)

master_frame["status"] = master_frame["status"].apply(lowerCase)
master_frame["funding_round_type"] = master_frame["funding_round_type"].apply(lowerCase)

# Removing NaN values rows as this field will play important role with analysis
master_frame = master_frame[pd.notnull(master_frame['country_code'])]
master_frame = master_frame[pd.notnull(master_frame['category_list'])]

# Removing NaN values rows as this field will play important role with analysis
master_frame = master_frame[pd.notnull(master_frame['raised_amount_usd'])]

In [20]:
# Capitlaizing column names for better visuals and diffrentiation 
master_frame.rename(columns=lambda x: x.title(), inplace=True)

master_frame.head()

Unnamed: 0,Company_Permalink,Name,Homepage_Url,Category_List,Status,Country_Code,State_Code,Region,City,Founded_At,Funding_Round_Permalink,Funding_Round_Type,Funding_Round_Code,Funded_At,Raised_Amount_Usd
0,/organization/-fame,#fame,http://livfame.com,Media,operating,IND,16,Mumbai,Mumbai,,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,B,05-01-2015,10000000.0
2,/organization/-qounter,:Qounter,http://www.qounter.com,Application Platforms|Real Time|Social Network...,operating,USA,DE,DE - Other,Delaware City,04-09-2014,/funding-round/b44fbb94153f6cdef13083530bb48030,seed,,01-03-2014,700000.0
4,/organization/0-6-com,0-6.com,http://www.0-6.com,Curated Web,operating,CHN,22,Beijing,Beijing,01-01-2007,/funding-round/5727accaeaa57461bd22a9bdd945382d,venture,A,19-03-2008,2000000.0
6,/organization/01games-technology,01Games Technology,http://www.01games.hk/,Games,operating,HKG,,Hong Kong,Hong Kong,,/funding-round/7d53696f2b4f607a2f2a8cbb83d01839,undisclosed,,01-07-2014,41250.0
7,/organization/0ndine-biomedical-inc,Ondine Biomedical Inc.,http://ondinebio.com,Biotechnology,operating,CAN,BC,Vancouver,Vancouver,01-01-1997,/funding-round/2b9d3ac293d5cdccbecff5c8cb0f327d,seed,,11-09-2009,43360.0


-----------------

## Checkpoint 2: Funding Type Analysis

#### Question [1,2,3,4] : Average funding amount of Venture , Seed , Angle & Private Equity

In [21]:
print(master_frame.Funding_Round_Type.unique())

['venture' 'seed' 'undisclosed' 'convertible_note' 'private_equity'
 'debt_financing' 'angel' 'grant' 'equity_crowdfunding' 'post_ipo_equity'
 'post_ipo_debt' 'product_crowdfunding' 'secondary_market'
 'non_equity_assistance']


In [22]:

venture_funding_frame = master_frame[master_frame["Funding_Round_Type"].isin(["venture"])]
angel_funding_frame = master_frame[master_frame["Funding_Round_Type"].isin(["angel"])]
seed_funding_frame = master_frame[master_frame["Funding_Round_Type"].isin(["seed"])]
private_equity_funding_frame = master_frame[master_frame["Funding_Round_Type"].isin(["private_equity"])]

print("Average Venture Funding : "+ str(venture_funding_frame.Raised_Amount_Usd.mean()))
print("Average Angle Funding : "+ str(angel_funding_frame.Raised_Amount_Usd.mean()))
print("Average Seed Funding : "+ str(seed_funding_frame.Raised_Amount_Usd.mean()))
print("Average Private Equity Funding : "+ str(private_equity_funding_frame.Raised_Amount_Usd.mean()))


Average Venture Funding : 11735779.935191536
Average Angle Funding : 968559.909645358
Average Seed Funding : 748104.4981867847
Average Private Equity Funding : 73618563.61743869


#### Question 5 : Considering that Spark Funds wants to invest between 5 to 15 million USD per  investment round, which investment type is the most suitable for them?

In [23]:
# Grouping data as per fund rounding type
funding_round_type = master_frame.groupby('Funding_Round_Type').Raised_Amount_Usd.mean().reset_index()

# Applying business restrictions and making sure to filter out values not included 
funding_round_type = funding_round_type.loc[ (funding_round_type.Raised_Amount_Usd >= 5000000) & 
                                             (funding_round_type.Raised_Amount_Usd <= 15000000)]

# Sorting vales as per amount
funding_round_type.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
 
 # Get the top value    
funding_round_type

Unnamed: 0,Funding_Round_Type,Raised_Amount_Usd
13,venture,11735779.94


In [24]:
# To be removed 
invetment_type_df_temp = master_frame.groupby('Funding_Round_Type').Raised_Amount_Usd.mean().reset_index()
invetment_type_df_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
invetment_type_df_temp

Unnamed: 0,Funding_Round_Type,Raised_Amount_Usd
6,post_ipo_debt,169451789.77
10,secondary_market,81527203.55
8,private_equity,73618563.62
7,post_ipo_equity,66018794.26
2,debt_financing,17186403.5
12,undisclosed,15851078.78
13,venture,11735779.94
4,grant,4508472.53
9,product_crowdfunding,1489682.0
1,convertible_note,1331937.83


------------

## Checkpoint 3: Country Analysis

***Venture*** is selected as most favoured category 

In [25]:
# Calculating total sum country wise
venture_country_code = venture_funding_frame.groupby("Country_Code").Raised_Amount_Usd.sum().reset_index()
venture_country_code.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)

# Seleting top 9 countries 
top9 = venture_country_code.head(9)
top9


Unnamed: 0,Country_Code,Raised_Amount_Usd
94,USA,422510842796.0
15,CHN,39835418773.0
29,GBR,20245627416.0
39,IND,14391858718.0
12,CAN,9583332317.0
28,FRA,7259536732.0
42,ISR,6907514579.0
21,DEU,6346959822.0
45,JPN,3363676611.0


#### Identify the top three English-speaking countries in the data frame top9

To get the top english speaking countries we can refer to below link for refrence and create list for comparision. 

1. https://www.k-international.com/blog/countries-with-the-most-english-speakers/
2. https://www.ranker.com/list/countries-where-english-is-the-official-language/best-world-journeys



In [26]:
# Sample list created by refrencing above links
english_country_list_codes = ['USA','AUS','CAN','IND','BMU','GBR','NZL','GIB','IRL']

In [27]:
top3_funded = top9.loc[top9.Country_Code.isin(english_country_list_codes)].head(3)
top3_funded

Unnamed: 0,Country_Code,Raised_Amount_Usd
94,USA,422510842796.0
29,GBR,20245627416.0
39,IND,14391858718.0


****************

## Checkpoint 4: Sector Analysis 1

***Creating new coloumn "Primary_Sector" by extracting the data from Category List coloumn*** 


In [28]:
master_frame['Primary_Sector'] = master_frame.Category_List.apply(lambda x:x.split('|')[0].title())

***Reading and filling NaN data from mapping.csv***

In [31]:
mapping_frame = pd.read_csv("../input/mapping.csv",encoding = "unicode-escape")
mapping_frame = mapping_frame[pd.notnull(mapping_frame['category_list'])]

***Method to check and replace "0" with corresponding value of "na". This method will return a string with capitalizing the first letter of string.*** 

In [32]:
def replace_zero_with_na(name) :
    index = name.find('0')    
    if index!= 0 :
        if name[index-1]!='.' :  # making sure that Expressoin 2.0 wont get replaced
            name = name.replace('0','na')
    elif index== 0 :             # making sure that first do get replaced    
            name = name.replace('0','na')
            
    return name.title()        

***Problem / No optamized solution found for task of converting/mapping columns to rows. Hence have used brute force method of iterating data frame using 2 for loops and creating required data frame.***



In [33]:
# Creating data frame with category list and main sector mapping only 
list_category = []
list_main = []
main_sectors_list = list(mapping_frame.columns)
main_sectors_list.pop(0)

# Iterating over dataframe
for row in mapping_frame.itertuples():
    for iCount in range(1,11):
        if row[iCount] == 1 :
            list_category.append(row[1]) 
            list_main.append(main_sectors_list[iCount-2])
         


 # Making sure that we do not have any incorrect sector       
for iCnt in range(len(list_category)): 
      list_category[iCnt] = replace_zero_with_na(list_category[iCnt]) 
        
mapping_frame_consolidated = pd.DataFrame({'Primary_Sector': list_category,'Main_Sector': list_main })
mapping_frame_consolidated.head(5)
 



Unnamed: 0,Primary_Sector,Main_Sector
0,3D,Manufacturing
1,3D Printing,Manufacturing
2,3D Technology,Manufacturing
3,Accounting,"Social, Finance, Analytics, Advertising"
4,Active Lifestyle,Health


***Merging final data frame for main analysis***

In [34]:
sector_master_frame = pd.merge(master_frame,mapping_frame_consolidated,how='inner',on='Primary_Sector')
sector_master_frame.head(5)

Unnamed: 0,Company_Permalink,Name,Homepage_Url,Category_List,Status,Country_Code,State_Code,Region,City,Founded_At,Funding_Round_Permalink,Funding_Round_Type,Funding_Round_Code,Funded_At,Raised_Amount_Usd,Primary_Sector,Main_Sector
0,/organization/-fame,#fame,http://livfame.com,Media,operating,IND,16,Mumbai,Mumbai,,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,B,05-01-2015,10000000.0,Media,Entertainment
1,/organization/90min,90min,http://www.90min.com,Media|News|Publishing|Soccer|Sports,operating,GBR,H9,London,London,01-01-2011,/funding-round/21a2cbf6f2fb2a1c2a61e04bf930dfe6,venture,,06-10-2015,15000000.0,Media,Entertainment
2,/organization/90min,90min,http://www.90min.com,Media|News|Publishing|Soccer|Sports,operating,GBR,H9,London,London,01-01-2011,/funding-round/bd626ed022f5c66574b1afe234f3c90d,venture,,07-05-2013,5800000.0,Media,Entertainment
3,/organization/90min,90min,http://www.90min.com,Media|News|Publishing|Soccer|Sports,operating,GBR,H9,London,London,01-01-2011,/funding-round/fd4b15e8c97ee2ffc0acccdbe1a98810,venture,,26-03-2014,18000000.0,Media,Entertainment
4,/organization/a-dance-for-me,A Dance for Me,http://www.adanceforme.com/,Media|News|Photo Sharing|Video,operating,USA,MT,Missoula,Missoula,31-07-2011,/funding-round/9ab9dbd17bf010c79d8415b2c22be6fa,equity_crowdfunding,,26-03-2014,1090000.0,Media,Entertainment


-----


## Checkpoint 5: Sector Analysis 2

#### Businesss  Constraint 
> FT = Venture <br>
> Investment b/w 5 to 15 Million <br>
> Top3 countries USD , GBR & IND <br>

In [35]:
# Creating alias with shorter name for readable code
smf = sector_master_frame

***Creating D1 dataframe***

In [36]:
D1_temp = smf.loc[(smf.Country_Code=="USA") & 
                  (smf.Funding_Round_Type=="venture") & 
                  (smf.Raised_Amount_Usd >=5000000) & (smf.Raised_Amount_Usd <= 15000000), :]

#Total amount invested in each main sector in a separate column
D1_total_sum_investments = D1_temp.groupby("Main_Sector").Raised_Amount_Usd.sum().sort_values(ascending = False).to_frame(name='Total_Sum_Invested')

#Total number (or count) of investments for each main sector in a separate column
D1_total_count_investments = D1_temp.groupby("Main_Sector").Raised_Amount_Usd.count().sort_values(ascending = False).to_frame(name='Counter')

# Merging frames to created final D1
D1_temp= pd.merge(D1_temp,D1_total_count_investments,how='inner',on='Main_Sector')
D1 = pd.merge(D1_temp,D1_total_sum_investments,how='inner',on='Main_Sector')

D1.head(5)

Unnamed: 0,Company_Permalink,Name,Homepage_Url,Category_List,Status,Country_Code,State_Code,Region,City,Founded_At,Funding_Round_Permalink,Funding_Round_Type,Funding_Round_Code,Funded_At,Raised_Amount_Usd,Primary_Sector,Main_Sector,Counter,Total_Sum_Invested
0,/organization/all-def-digital,All Def Digital,http://alldefdigital.com,Media,operating,USA,CA,Los Angeles,Los Angeles,,/funding-round/452a2342fe720285c3b92e9bd927d9ba,venture,A,06-08-2014,5000000.0,Media,Entertainment,591,5099197982.0
1,/organization/chefs-feed,ChefsFeed,http://www.chefsfeed.com,Media|Mobile|Restaurants|Technology,operating,USA,CA,SF Bay Area,San Francisco,01-01-2012,/funding-round/adca195749ae9ace84684723fbe75e5b,venture,A,26-02-2015,5000000.0,Media,Entertainment,591,5099197982.0
2,/organization/huffingtonpost,The Huffington Post,http://www.huffingtonpost.com,Media|News|Publishing,acquired,USA,NY,New York City,New York,09-05-2005,/funding-round/7f05940c4d2dfecb8e50a0e5720e5065,venture,A,01-08-2006,5000000.0,Media,Entertainment,591,5099197982.0
3,/organization/huffingtonpost,The Huffington Post,http://www.huffingtonpost.com,Media|News|Publishing,acquired,USA,NY,New York City,New York,09-05-2005,/funding-round/9241ae16e08df17ebdc064e49e23035a,venture,B,01-09-2007,5000000.0,Media,Entertainment,591,5099197982.0
4,/organization/matchmine,MatchMine,http://matchmine.com,Media|News|Reviews and Recommendations,closed,USA,MA,Boston,Needham,01-01-2007,/funding-round/41ac526630da57ad6eb9d02431b17657,venture,A,01-09-2007,10000000.0,Media,Entertainment,591,5099197982.0


***Creating D2 dataframe***

In [37]:
D2_temp = smf.loc[(smf.Country_Code=="GBR") & 
                  (smf.Funding_Round_Type=="venture") & 
                  (smf.Raised_Amount_Usd >=5000000) & (smf.Raised_Amount_Usd <= 15000000), :]

D2_total_sum_investments = D2_temp.groupby("Main_Sector").Raised_Amount_Usd.sum().sort_values(ascending = False).to_frame(name='Total_Sum_Invested')
D2_total_count_investments = D2_temp.groupby("Main_Sector").Raised_Amount_Usd.count().sort_values(ascending = False).to_frame(name='Counter')

D2_temp= pd.merge(D2_temp,D2_total_count_investments,how='inner',on='Main_Sector')
D2 = pd.merge(D2_temp,D2_total_sum_investments,how='inner',on='Main_Sector')

D2.head(5)


Unnamed: 0,Company_Permalink,Name,Homepage_Url,Category_List,Status,Country_Code,State_Code,Region,City,Founded_At,Funding_Round_Permalink,Funding_Round_Type,Funding_Round_Code,Funded_At,Raised_Amount_Usd,Primary_Sector,Main_Sector,Counter,Total_Sum_Invested
0,/organization/90min,90min,http://www.90min.com,Media|News|Publishing|Soccer|Sports,operating,GBR,H9,London,London,01-01-2011,/funding-round/21a2cbf6f2fb2a1c2a61e04bf930dfe6,venture,,06-10-2015,15000000.0,Media,Entertainment,56,482784687.0
1,/organization/90min,90min,http://www.90min.com,Media|News|Publishing|Soccer|Sports,operating,GBR,H9,London,London,01-01-2011,/funding-round/bd626ed022f5c66574b1afe234f3c90d,venture,,07-05-2013,5800000.0,Media,Entertainment,56,482784687.0
2,/organization/eutechnyx,Eutechnyx,http://press.eutechnyx.com,Games,operating,GBR,E5,Gateshead,Gateshead,01-01-1987,/funding-round/d2fc787fbc5e4f468dff8b2c557993f1,venture,A,13-05-2010,8800000.0,Games,Entertainment,56,482784687.0
3,/organization/mind-candy,Mind Candy,http://www.mindcandy.com,Games,operating,GBR,H9,London,London,01-01-2003,/funding-round/47df01ed44d7b5916159051e5e32391e,venture,B,01-06-2011,10000000.0,Games,Entertainment,56,482784687.0
4,/organization/mind-candy,Mind Candy,http://www.mindcandy.com,Games,operating,GBR,H9,London,London,01-01-2003,/funding-round/c6a873b4cbdd7ea3d023a771bd3b2f99,venture,A,23-11-2006,10860000.0,Games,Entertainment,56,482784687.0


***Creating D3 dataframe***

In [38]:
D3_temp = smf.loc[(smf.Country_Code=="IND") & 
                  (smf.Funding_Round_Type=="venture") & 
                  (smf.Raised_Amount_Usd >=5000000) & (smf.Raised_Amount_Usd <= 15000000), :]

D3_total_sum_investments = D3_temp.groupby("Main_Sector").Raised_Amount_Usd.sum().sort_values(ascending = False).to_frame(name='Total_Sum_Invested')
D3_total_count_investments = D3_temp.groupby("Main_Sector").Raised_Amount_Usd.count().sort_values(ascending = False).to_frame(name='Counter')

D3_temp= pd.merge(D3_temp,D3_total_count_investments,how='inner',on='Main_Sector')
D3 = pd.merge(D3_temp,D3_total_sum_investments,how='inner',on='Main_Sector')

D3.head(5)

Unnamed: 0,Company_Permalink,Name,Homepage_Url,Category_List,Status,Country_Code,State_Code,Region,City,Founded_At,Funding_Round_Permalink,Funding_Round_Type,Funding_Round_Code,Funded_At,Raised_Amount_Usd,Primary_Sector,Main_Sector,Counter,Total_Sum_Invested
0,/organization/-fame,#fame,http://livfame.com,Media,operating,IND,16,Mumbai,Mumbai,,/funding-round/9a01d05418af9f794eebff7ace91f638,venture,B,05-01-2015,10000000.0,Media,Entertainment,33,280830000.0
1,/organization/dhruva,Dhruva,http://www.dhruva.com/,Games,operating,IND,19,Bangalore,Bangalore,01-01-1997,/funding-round/6035248811c9530b11bd442d9239a0b1,venture,,27-11-2006,5000000.0,Games,Entertainment,33,280830000.0
2,/organization/games2win,Games2Win,http://www.games2win.com,Games,operating,IND,16,Mumbai,Mumbai,01-01-2005,/funding-round/6b024f4906c288c66d1df966e6aeb256,venture,A,29-03-2007,5000000.0,Games,Entertainment,33,280830000.0
3,/organization/games2win,Games2Win,http://www.games2win.com,Games,operating,IND,16,Mumbai,Mumbai,01-01-2005,/funding-round/b095563fd43d1e4fd16da3f4bcd040af,venture,B,30-03-2011,6000000.0,Games,Entertainment,33,280830000.0
4,/organization/pokkt,POKKT,http://www.pokkt.com,Games,operating,IND,16,Mumbai,Mumbai,01-08-2012,/funding-round/adb94c131e001a7438a4695d873d8dc1,venture,B,03-11-2015,5000000.0,Games,Entertainment,33,280830000.0


# Questions 

### 1. Total number of investments (count)

In [None]:
print("Total number of investments for D1 : " + str(D1.Raised_Amount_Usd.count()))
print("Total number of investments for D2 : " + str(D2.Raised_Amount_Usd.count()))
print("Total number of investments for D3 : " + str(D3.Raised_Amount_Usd.count()))

###  2. Total amount of investment (USD)

In [None]:
print("Total sum of investments for D1 : " + str(D1.Raised_Amount_Usd.sum()))
print("Total sum of investments for D2 : " + str(D2.Raised_Amount_Usd.sum()))
print("Total sum of investments for D3 : " + str(D3.Raised_Amount_Usd.sum()))


### 3. Top sector (based on count of investments)

In [None]:
D1_total_count_investments = D1.groupby("Main_Sector").Total_Sum_Invested.count().sort_values(ascending = False).to_frame(name='Total_Count')
D2_total_count_investments = D2.groupby("Main_Sector").Total_Sum_Invested.count().sort_values(ascending = False).to_frame(name='Total_Count')
D3_total_count_investments = D3.groupby("Main_Sector").Total_Sum_Invested.count().sort_values(ascending = False).to_frame(name='Total_Count')

print("Top 3 sector for D1 :")
print(D1_total_count_investments.head(3))
print()
print("Top 3 sector for D2 :")
print(D2_total_count_investments.head(3))
print()
print("Top 3 sector for D3 :")
print(D3_total_count_investments.head(3))
print()


In [None]:
print("Top 3 sector for D1 :")
D1_total_count_investments.head(3)

In [None]:
print("Top 3 sector for D2 :")
D2_total_count_investments.head(3)


In [None]:
print("Top 3 sector for D3 :")
D3_total_count_investments.head(3)


### 4. Second-best sector (based on count of investments)

In [None]:
# See above 

### 5. Third-best sector (based on count of investments)

In [None]:
# See above

### 6. Number of investments in the top sector (refer to point 3)

In [None]:
# See above

###  7. Number of investments in the second-best sector (refer to point 4)



In [None]:
# See above

### 8. Number of investments in the third-best sector (refer to point 5)

In [None]:
# See above

###  9. For the top sector count-wise (point 3), which company received the highest investment?

In [None]:
# Filter by Top sector and group by comany unique information and name
D1_temp = D1[D1.Main_Sector == "Others"]
D1_temp = D1_temp.groupby(["Company_Permalink","Name"]).Raised_Amount_Usd.sum().reset_index()
D1_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
d1_top_list = (D1_temp.iloc[0:1, 1:3].values.tolist())[0]

D2_temp = D2[D2.Main_Sector == "Others"]
D2_temp=D2_temp.groupby(["Company_Permalink","Name"]).Raised_Amount_Usd.sum().reset_index()
D2_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
d2_top_list = (D2_temp.iloc[0:1, 1:3].values.tolist())[0]

D3_temp = D3[D3.Main_Sector == "Others"]
D3_temp=D3_temp.groupby(["Company_Permalink","Name"]).Raised_Amount_Usd.sum().reset_index()
D3_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
d3_top_list = (D3_temp.iloc[0:1, 1:3].values.tolist())[0]

print("Analysis Results -")
print(" '{0}' company recived highest investments for D1 worth total (sum) $".format(d1_top_list[0]),d1_top_list[1])
print(" '{0}' company recived highest investments for D2 worth total (sum) $".format(d2_top_list[0]),d2_top_list[1])
print(" '{0}' company recived highest investments for D3 worth total (sum) $".format(d3_top_list[0]),d3_top_list[1])

###  10. For the second-best sector count-wise (point 4), which company received the highest investment?





In [None]:
# Filter by Top sector and group by comany unique information and name
D1_temp = D1[D1.Main_Sector == "Social, Finance, Analytics, Advertising"]
D1_temp = D1_temp.groupby(["Company_Permalink","Name"]).Raised_Amount_Usd.sum().reset_index()
D1_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
d1_top_list = (D1_temp.iloc[0:1, 1:3].values.tolist())[0]

D2_temp = D2[D2.Main_Sector == "Social, Finance, Analytics, Advertising"]
D2_temp=D2_temp.groupby(["Company_Permalink","Name"]).Raised_Amount_Usd.sum().reset_index()
D2_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
d2_top_list = (D2_temp.iloc[0:1, 1:3].values.tolist())[0]

D3_temp = D3[D3.Main_Sector == "Social, Finance, Analytics, Advertising"]
D3_temp=D3_temp.groupby(["Company_Permalink","Name"]).Raised_Amount_Usd.sum().reset_index()
D3_temp.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
d3_top_list = (D3_temp.iloc[0:1, 1:3].values.tolist())[0]

print("Analysis Results -")
print(" '{0}' company recived highest investments for D1 worth total (sum) $".format(d1_top_list[0]),d1_top_list[1])
print(" '{0}' company recived highest investments for D2 worth total (sum) $".format(d2_top_list[0]),d2_top_list[1])
print(" '{0}' company recived highest investments for D3 worth total (sum) $".format(d3_top_list[0]),d3_top_list[1])


-----------

## Checkpoint 6: Plots

### A plot showing the fraction of total investments (globally) in venture, seed, and private equity, and the average amount of investment in each funding type. This chart should make it clear that a certain funding type (FT) is best suited for Spark Funds.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plotting_frame = sector_master_frame[sector_master_frame["Funding_Round_Type"].isin(
                       ["venture","seed","private_equity"])]

plotting_frame = plotting_frame.loc[(plotting_frame.Raised_Amount_Usd >=5000000) & 
                                    (plotting_frame.Raised_Amount_Usd <= 15000000),:]

In [None]:
sns.boxplot(x='Funding_Round_Type', y='Raised_Amount_Usd', data=plotting_frame)
plt.yscale('log')
plt.show()

### A plot showing the top 9 countries against the total amount of investments of funding type FT. This should make the top 3 countries (Country 1, Country 2, and Country 3) very clear.

In [None]:
plotting = top9.set_index("Country_Code")
plotting.plot.bar(logy=True);
plotting


### A plot showing the number of investments in the top 3 sectors of the top 3 countries on one chart (for the chosen investment type FT). 

In [None]:
D1_plot = D1

D1_plot = D1_plot.groupby("Main_Sector").Raised_Amount_Usd.count().reset_index()
D1_plot.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
D1_plot = D1_plot.head(3)


In [None]:
D2_plot = D2

D2_plot = D2_plot.groupby("Main_Sector").Raised_Amount_Usd.count().reset_index()
D2_plot.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
D2_plot = D2_plot.head(3)

In [None]:
D3_plot = D3

D3_plot = D3_plot.groupby("Main_Sector").Raised_Amount_Usd.count().reset_index()
D3_plot.sort_values(["Raised_Amount_Usd"], axis=0,ascending=False, inplace=True)
D3_plot = D3_plot.head(3)

In [None]:
D12 = pd.merge(D1_plot,D2_plot,how='outer',on='Main_Sector')
D123 = pd.merge(D12,D3_plot,how='outer',on='Main_Sector')

In [None]:
D123 = D123.rename(columns={"Raised_Amount_Usd_x": "USD", "Raised_Amount_Usd_y": "GBP" ,"Raised_Amount_Usd": "INR"})
D123= D123.set_index("Main_Sector")
D123.fillna(0)

In [None]:
D123.T.plot.bar(logy=True)