## Objective :
You work for Spark Funds, an asset management company. Spark Funds wants to make investments in a few companies. The CEO of Spark Funds wants to understand the global trends in investments so that she can take the investment decisions effectively.


## Business and Data Understanding :
Spark Funds has two minor constraints for investments:

    - It wants to invest between 5 to 15 million USD per round of investment

    - It wants to invest only in English-speaking countries because of the ease of communication with the companies it would invest in. For the analysis, consider a country to be English speaking only if English is one of the official languages in that country
    
## Business objective: 
The objective is to identify the best sectors, countries, and a suitable investment type for making investments. The overall strategy is to invest where others are investing, implying that the 'best' sectors and countries are the ones 'where most investors are investing'. (Spark Funds wants to invest where most other investors are investing. This pattern is often observed among early stage startup investors.)

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

# Import the numpy and pandas packages

import numpy as np
import pandas as pd

## Task 1: Data Cleaning

-  ### Subtask 1.1: Import and read

Load the companies and rounds data (provided on the previous page) into two data frames and name them `companies` and `rounds2` respectively.

In [None]:
#Reading compaines.txt, changing the encoding type because of special characters. And then solving the multiple encoding issue.
companies = pd.read_csv('../input/companies.txt',encoding='ISO-8859-1',sep='\t')
companies.permalink = companies.permalink.str.encode('ISO-8859-1').str.decode('ascii', 'ignore')
companies.name = companies.name.str.encode('ISO-8859-1').str.decode('ascii', 'ignore')
companies.head()

In [None]:
rounds2 = pd.read_csv('../input/rounds2.csv',encoding='ISO-8859-1')
rounds2.company_permalink = rounds2.company_permalink.str.encode('ISO-8859-1').str.decode('ascii', 'ignore')
rounds2.head()

-  ### Subtask 1.2: Understand the Dataset

    - How many unique companies are present in `rounds2`?
    - How many unique companies are present in `companies`?
    - Are there any companies in the `rounds2` file which are not present in `companies`? Answer yes or no: **Y/N**
    - Merge the two data frames so that all variables (columns) in the `companies` frame are added to the `rounds2` data frame. Name the merged frame `master_frame`. How many observations are present in `master_frame`?

In [None]:
#How many unique companies are present in rounds2?
rounds2['company_permalink'] = rounds2['company_permalink'].str.lower()
print(len(rounds2['company_permalink'].unique()))

#Reconfirming -
rounds2['company_permalink'].str.lower().describe()

In [None]:
# How many unique companies are present in companies?
companies['permalink'] = companies['permalink'].str.lower()
print(len(companies['permalink'].unique()))

#Reconfirming -
companies['permalink'].str.lower().describe()

In [None]:
#Are there any companies in the rounds2 file which are not present in companies?
temp1 = pd.DataFrame(rounds2.company_permalink.unique())
temp2 = pd.DataFrame(companies.permalink.unique())
temp2.equals(temp1)

In [None]:
set(companies['permalink'].unique()).difference(set(rounds2['company_permalink'].unique()))

In [None]:
#Merge the two data frames so that all variables (columns) in the companies frame are added to the rounds2 data frame. Name the merged frame master_frame.
master_frame = pd.merge(rounds2, companies, how = 'left', left_on = 'company_permalink', right_on = 'permalink')
len(master_frame.index)

-  ### Subtask 1.3: Cleaning the Data

    - Inspecting Null Values
    - Dropping unnecessary columns
    - Dropping unnecessary rows

In [None]:
#Inspecting the Null values , column-wise
master_frame.isnull().sum(axis=0)

In [None]:
#Inspecting the Null values percentage , column-wise
print(round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2))

- #### Dropping unnecessary columns 

For Sparks Funds, we are mostly driving our analysis based on funding round type, category, country etc. Hence, so many columns present in the `master_frame` are not needed, we will drop those columns.

In [None]:
master_frame = master_frame.drop(['funding_round_code', 'funding_round_permalink', 'funded_at','permalink', 'homepage_url',
                                 'state_code', 'region', 'city', 'founded_at','status'], axis = 1)

In [None]:
#Inspecting the Null values percentage again after deletion, column-wise
print(round(100*(master_frame.isnull().sum()/len(master_frame.index)), 2))

- #### Dropping unnecessary rows

For the remaining columns of `master_frame` dataframe, we can see that there are columns which still have null counts, let's drop those rows and inspect the dataframe again.

In [None]:
#Dropping rows based on null columns
master_frame = master_frame[~(master_frame['raised_amount_usd'].isnull() | master_frame['country_code'].isnull() |
                             master_frame['category_list'].isnull())]

In [None]:
#Percentage of retained rows
print(100*(len(master_frame.index)/114949))

In [None]:
master_frame.shape

## Task 2: Funding Type Analysis

-  ### Subtask 2.1: Retaining the rows with only four investment types.

Spark Funds wants to choose one of these four investment types(venture, angel, seed, and private equity) for each potential investment they will make. So let's observe and see how many funding types are present in `master_frame` and then retain the rows with above-mentioned investment types.

In [None]:
#Observing the unique funding_round_type
master_frame.funding_round_type.value_counts()

In [None]:
#Retaining the rows with only four investment types
master_frame = master_frame[(master_frame['funding_round_type'] == 'venture') 
                            | (master_frame['funding_round_type'] == 'seed')
                            | (master_frame['funding_round_type'] == 'angel')
                            | (master_frame['funding_round_type'] == 'private_equity')]
master_frame.head()

-  ### Subtask 2.2: Calculate the average investment amount for each of the four funding types.

    - Average funding amount of **venture** type
    - Average funding amount of **seed** type
    - Average funding amount of **angel** type
    - Average funding amount of **private_equity** type

In [None]:
#Converting $ to million $.
master_frame['raised_amount_usd'] = master_frame['raised_amount_usd']/1000000
master_frame.head()

In [None]:
#calculating average investment amount for each of the four funding types.
round(master_frame.groupby('funding_round_type').raised_amount_usd.mean(), 2)

In [None]:
#Retaining rows with only venture type. As Spark Funds wants to invest between 5 to 15 million USD per investment round
master_frame = master_frame[master_frame['funding_round_type'] == 'venture'] 

#Dropping the column 'funding_round_type' as it is going to be venture type this point forward
master_frame = master_frame.drop(['funding_round_type'], axis = 1)

## Task 3: Country Analysis

-  ### Subtask 3.1: Analysing the countries based on investment amount

    - Spark Funds wants to see the top nine countries which have received the highest total funding (across ALL sectors for the chosen investment type)

    - For the chosen investment type, make a data frame named top9 with the top nine countries (based on the total investment amount each country has received)

In [None]:
top9 = master_frame.pivot_table(values = 'raised_amount_usd', index = 'country_code', aggfunc = 'sum')
top9 = top9.sort_values(by = 'raised_amount_usd', ascending = False)
top9 = top9.iloc[:9, ]
top9

In [None]:
#Retaining rows with only USA, GBR and IND country_codes. As SparksFunds wants to invest in only top three English speaking countries.
master_frame = master_frame[(master_frame['country_code'] == 'USA')
                            | (master_frame['country_code'] == 'GBR')
                            | (master_frame['country_code'] == 'IND')]

## Task 4: Sector Analysis 1

-  ### Subtask 4.1: Extract the primary sector of each category

Extract the primary sector value into *category_list* column. According to the  business rule the first string before the vertical bar will be considered the primary sector.

In [None]:
#Extracting the primary vector value
master_frame['category_list'] = master_frame['category_list'].apply(lambda x: x.split('|')[0])

-  ### Subtask 4.2: Map each primary sector to one of the eight main sectors

Use the mapping file 'mapping.csv' to map each primary sector to one of the eight main sectors (Note that â€˜Othersâ€™ is also considered one of the main sectors)

In [None]:
#Reading mapping.csv file 
mapping = pd.read_csv('../input/mapping.csv')
mapping.category_list = mapping.category_list.replace({'0':'na', '2.na' :'2.0'}, regex=True)
mapping.head()

In [None]:
#Reshaping the mapping dataframe to merge with the master_frame dataframe. Using melt() function to unpivot the table.
mapping = pd.melt(mapping, id_vars =['category_list'], value_vars =['Manufacturing','Automotive & Sports',
                                                              'Cleantech / Semiconductors','Entertainment',
                                                             'Health','News, Search and Messaging','Others',
                                                             'Social, Finance, Analytics, Advertising']) 
mapping = mapping[~(mapping.value == 0)]
mapping = mapping.drop('value', axis = 1)
mapping = mapping.rename(columns = {"variable":"main_sector"})
mapping.head()

In [None]:
master_frame = master_frame.merge(mapping, how = 'left', on ='category_list')
master_frame.head()

In [None]:
#List of primary sectors which have no main sectors in the master_frame
print(master_frame[master_frame.main_sector.isnull()].category_list.unique())

In [None]:
#Number of rows with NaN masin_sector value
len(master_frame[master_frame.main_sector.isnull()])

In [None]:
#Retaining the rows which have main_sector values
master_frame = master_frame[~(master_frame.main_sector.isnull())]
len(master_frame.index)

## Task 5: Sector Analysis 2

-  ### Subtask 5.1: Create DataFrames D1, D2, D3 based on three countries

    - Create three separate data frames D1, D2 and D3 for each of the three countries containing the observations of funding type FT falling within the 5-15 million USD range. The three data frames should contain:

        - All the columns of the master_frame along with the primary sector and the main sector

        - The total number (or count) of investments for each main sector in a separate column

        - The total amount invested in each main sector in a separate column

In [None]:
D1 = master_frame[(master_frame['country_code'] == 'USA') & 
             (master_frame['raised_amount_usd'] >= 5) & 
             (master_frame['raised_amount_usd'] <= 15)]
D1_gr = D1[['raised_amount_usd','main_sector']].groupby('main_sector').agg(['sum', 'count']).rename(
    columns={'sum':'Total_amount','count' : 'Total_count'})
D1 = D1.merge(D1_gr, how='left', on ='main_sector')
D1.head()

In [None]:
D2 = master_frame[(master_frame['country_code'] == 'GBR') & 
             (master_frame['raised_amount_usd'] >= 5) & 
             (master_frame['raised_amount_usd'] <= 15)]
D2_gr = D2[['raised_amount_usd','main_sector']].groupby('main_sector').agg(['sum', 'count']).rename(
    columns={'sum':'Total_amount','count' : 'Total_count'})
D2 = D2.merge(D2_gr, how='left', on ='main_sector')
D2.head()

In [None]:
D3 = master_frame[(master_frame['country_code'] == 'IND') & 
             (master_frame['raised_amount_usd'] >= 5) & 
             (master_frame['raised_amount_usd'] <= 15)]
D3_gr = D3[['raised_amount_usd','main_sector']].groupby('main_sector').agg(['sum', 'count']).rename(
    columns={'sum':'Total_amount','count' : 'Total_count'})
D3 = D3.merge(D3_gr, how='left', on ='main_sector')
D3.head()

-  ### Subtask 5.2: Sector-wise Investment Analysis

    - For D1, D2, D3, analyse the below points :

In [None]:
#Total number of investments (count)
print(D1.raised_amount_usd.count())
print(D2.raised_amount_usd.count())
print(D3.raised_amount_usd.count())

In [None]:
#Total amount of investment (USD)
print(round(D1.raised_amount_usd.sum(), 2))
print(round(D2.raised_amount_usd.sum(), 2))
print(round(D3.raised_amount_usd.sum(), 2))

In [None]:
#Top sector, second-top, third-top for D1 (based on count of investments)
#Number of investments in the top, second-top, third-top sector in D1
D1_gr

In [None]:
#Top sector, second-top, third-top for D2 (based on count of investments)
#Number of investments in the top, second-top, third-top sector in D2
D2_gr

In [None]:
#Top sector, second-top, third-top for D2 (based on count of investments)
#Number of investments in the top, second-top, third-top sector in D3
D3_gr

In [None]:
#For the top sector USA , which company received the highest investment?
company = D1[D1['main_sector']=='Others']
company = company.pivot_table(values = 'raised_amount_usd', index = 'company_permalink', aggfunc = 'sum')
company = company.sort_values(by = 'raised_amount_usd', ascending = False).head()
print(company.head(1))

#For the second top sector USA , which company received the highest investment?
company = D1[D1['main_sector']=='Social, Finance, Analytics, Advertising']
company = company.pivot_table(values = 'raised_amount_usd', index = 'company_permalink', aggfunc = 'sum')
company = company.sort_values(by = 'raised_amount_usd', ascending = False).head()
print(company.head(1))

In [None]:
#For the top sector GBR , which company received the highest investment?
company = D2[D2['main_sector']=='Others']
company = company.pivot_table(values = 'raised_amount_usd', index = 'company_permalink', aggfunc = 'sum')
company = company.sort_values(by = 'raised_amount_usd', ascending = False).head()
print(company.head(1))

#For the second top sector GBR , which company received the highest investment?
company = D2[D2['main_sector']=='Social, Finance, Analytics, Advertising']
company = company.pivot_table(values = 'raised_amount_usd', index = 'company_permalink', aggfunc = 'sum')
company = company.sort_values(by = 'raised_amount_usd', ascending = False).head()
print(company.head(1))

In [None]:
#For the top sector IND , which company received the highest investment?
company = D3[D3['main_sector']=='Others']
company = company.pivot_table(values = 'raised_amount_usd', index = 'company_permalink', aggfunc = 'sum')
company = company.sort_values(by = 'raised_amount_usd', ascending = False).head()
print(company.head(1))

#For the second top sector IND , which company received the highest investment?
company = D3[D3['main_sector']=='News, Search and Messaging']
company = company.pivot_table(values = 'raised_amount_usd', index = 'company_permalink', aggfunc = 'sum')
company = company.sort_values(by = 'raised_amount_usd', ascending = False).head()
print(company.head(1))

## Analysis Result :

- #### Based on the data analysis performed, SparksFunds should invest in -

    - Funding type - `Venture`.
    - Countries - `USA`, `Britain` and `India`, respectively.
    - Top two sectors to invest in are - `Others` and `Social, Finance, Analytics, Advertising`.