# Goal

You work for Spark Funds, an asset management company. Spark Funds wants to make investments in a few sectors.
The CEO of Spark Funds wants to understand the global trends in investments so that she can take the investment decisions effectively.

# Business and Data Understanding

Spark Funds has two minor constraints for investments:

It wants to invest between 5 to 15 million USD per round of investment

It wants to invest only in English-speaking countries because of the ease of communication with the companies it would invest in

For your analysis, consider a country to be English speaking only if English is one of the official languages in that country

# Goals of data analysis:

**Files:** 
    
    -companies.txt: A table with basic data of companies
    
    -rounds2.csv: A table with all the rounds details
    
    -mapping.csv: This file maps the numerous category names in the companies table (such 3D printing, aerospace, agriculture, etc.) to eight broad sector names. The purpose is to simplify the analysis into eight sector buckets, rather than trying to analyse hundreds of them.

**Investment type analysis:** Comparing the typical investment amounts in the venture, seed, angel, private equity etc. so that Spark Funds can choose the type that is best suited for their strategy.

**Country analysis:** Identifying the countries which have been the most heavily invested in the past. These will be Spark Funds’ favourites as well.

**Sector analysis:** Understanding the distribution of investments across the eight main sectors. (Note that we are interested in the eight 'main sectors' provided in the mapping file. The two files — companies and rounds2 — have numerous sub-sector names; hence, you will need to map each sub-sector to its main sector.)




In [None]:
#Importing libries
import numpy as np
import pandas as pd
import string
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# reading data files
# using encoding = "ISO-8859-1" to avoid pandas encoding error

companies = pd.read_csv("../input/investment-analysis/companies.txt", sep="\t", encoding = "ISO-8859-1")
rounds = pd.read_csv("../input/investment-analysis/rounds2.csv", encoding = "ISO-8859-1")

In [None]:
#Overview of the data in companies file
companies.info()

The variables funding_round_code and raised_amount_usd contain some missing values, as shown above. We'll deal with them after we're done with understanding the data - column names, primary keys of tables etc.

In [None]:
# Checking the shape of the DataFrames
print("Companies File: ", companies.shape)
print("Rounds File", rounds.shape)

In [None]:
#Find Primary Key in both data Frames.
print("Shape of companies: ", companies.shape)
for col in companies.columns:
    print("Unique values in {}:{}".format(col, companies[col].nunique()))

In [None]:
print("Shape of Rounds: ", rounds.shape)
for col in rounds.columns:
    print("Unique value in column: {} is {}".format(col, rounds[col].nunique()))

##### Compare values(Primary Key)

the ```permalink``` column in the companies dataframe should be the unique_key of the table, having 66368 unique company names (links, or permalinks). Also, these 66368 companies should be present in the rounds file.

Let's first confirm that these 66368 permalinks (which are the URL paths of companies' websites) are not repeating in the column, i.e. they are unique.

		a.  present in companiesDataFrame and not in roundsDataFrame.
		b.  present in roundsDataFrame and not in companiesDataFrame.

In [None]:
#Also, let's convert all the entries to lowercase (or uppercase) for uniformity.
# converting all permalinks to lowercase
companies['permalink'] = companies['permalink'].str.lower()
rounds['company_permalink'] = rounds['company_permalink'].str.lower()

In [None]:
# identify the unique number of permalinks in companies
len(companies.permalink.unique())

In [None]:
# look at unique company names in rounds master
# note that the column name in rounds file is different (company_permalink)
len(rounds.company_permalink.unique())

There seem to be 2 extra permalinks in the rounds file which are not present in the companies file. Let's hope that this is a data quality issue, since if this were genuine, we have two companies whose investment round details are available but their metadata (company name, sector etc.) is not available in the companies table.


##### There are tow way that we can find the mismatch in columns

###### Method 1:

In [None]:
# will use this columns to find the mismatch in it, 
# Present in companies but not in rounds
set(companies.permalink) - set(rounds.company_permalink )

In [None]:
# companies present in rounds master but not in companies master
set(rounds.company_permalink ) - set(companies.permalink)

###### Method 2:

In [None]:
# companies present in companies master but not in rounds master
companies.loc[~companies['permalink'].isin(rounds['company_permalink']), :]

In [None]:
# companies present in rounds file but not in (~) companies file
rounds.loc[~rounds['company_permalink'].isin(companies['permalink']), :]

The company weird characters appear when you import the data file. To confirm whether these characters are actually present in the given data or whether python has introduced them while importing into pandas, let's have a look at the original CSV file in Excel.

Thus, this is most likely a data quality issue we have introduced while reading the data file into python. Specifically, this is most likely caused because of encoding.

First, let's try to figure out the encoding type of this file. Then we can try specifying the encoding type at the time of reading the file. The ```chardet``` library shows the encoding type of a file.

In [None]:
import chardet

rawdata = open('../input/investment-analysis/rounds2.csv', 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print(charenc)

# print(result)

Now let's try telling pandas (at the time of importing) the encoding type. Here's a list of various encoding types python can handle: https://docs.python.org/2/library/codecs.html#standard-encodings.

In [None]:
# trying different encodings
# encoding="cp1254" throws an error
# rounds_original = pd.read_csv("rounds2.csv", encoding="cp1254")
# rounds_original.iloc[[29597, 31863, 45176], :]

Apparently, pandas cannot decode "cp1254" in this case.

After trying various other encoding types (in vain), this answer suggested an alternate (and a more intelligent) way: https://stackoverflow.com/questions/45871731/removing-special-characters-in-a-pandas-dataframe.

In [None]:
# remove encoding from companies master
companies['permalink'] = companies.permalink.str.encode('utf-8').str.decode('ascii', 'ignore')

# remove encoding from rounds master
rounds['company_permalink'] = rounds.company_permalink.str.encode('utf-8').str.decode('ascii', 'ignore')

Thus, the encoding issue seems resolved now.

## Missing Value Treatment

Let's now move to missing value treatment. 

Let's have a look at the number of missing values in both the dataframes.

In [None]:
# missing values in companies master
companies.isnull().sum()

In [None]:
# missing values in rounds master
rounds.isnull().sum()

Since there are no misisng values in the permalink or company_permalink columns, let's merge the two and then work on the master dataframe.

In [None]:
# merging the two masters
master = pd.merge(companies, rounds, how="inner", left_on="permalink", right_on="company_permalink")
master.head()

Since the columns ```company_permalink``` and ```permalink``` are the same, let's remove one of them.


In [None]:
# print column names
master.columns

In [None]:
# removing redundant columns
master =  master.drop(['company_permalink'], axis=1) 

In [None]:
# column-wise missing values 
master.isnull().sum()

Let's look at the fraction of missing values in the columns.

In [None]:
# summing up the missing values (column-wise) and displaying fraction of NaNs
round(100*(master.isnull().sum()/len(master.index)), 2)

Clearly, the column ```funding_round_code``` is useless (with about 73% missing values). Also, for the business objectives given, the columns ```homepage_url```, ```founded_at```, ```state_code```, ```region``` and ```city``` need not be used.

Thus, let's drop these columns.

In [None]:
# dropping columns 
master = master.drop(['funding_round_code', 'homepage_url', 'founded_at', 'state_code', 'region', 'city'], axis=1)
master.head()

In [None]:
# summing up the missing values (column-wise) and displaying fraction of NaNs
round(100*(master.isnull().sum()/len(master.index)), 2)

Note that the column ```raised_amount_usd``` is an important column, since that is the number we want to analyse (compare, means, sum etc.). That needs to be carefully treated. 

Also, the column ```country_code``` will be used for country-wise analysis, and ```category_list``` will be used to merge the dataframe with the main categories.

Let's first see how we can deal with missing values in ```raised_amount_usd```.


In [None]:
# summary stats of raised_amount_usd
master['raised_amount_usd'].describe()

The mean is somewhere around USD 10 million, while the median is only about USD 1m. The min and max values are also miles apart. 

In general, since there is a huge spread in the funding amounts, it will be inappropriate to impute it with a metric such as median or mean. Also, since we have quite a large number of observations, it is wiser to just drop the rows. 

Let's thus remove the rows having NaNs in ```raised_amount_usd```.

In [None]:
# removing NaNs in raised_amount_usd
master = master[~np.isnan(master['raised_amount_usd'])]
round(100*(master.isnull().sum()/len(master.index)), 2)

Let's now look at the column ```country_code```. To see the distribution of the values for categorical variables, it is best to convert them into type 'category'.

In [None]:
country_codes = master['country_code'].astype('category')

# displaying frequencies of each category
country_codes.value_counts().head(10)

In [None]:
# displaying frequencies of each category
country_codes.value_counts().tail(10)

By far, the most number of investments have happened in American countries. We can also see the fractions.

In [None]:
# viewing fractions of counts of country_codes
100*(master['country_code'].value_counts()/len(master.index)).head(10)

In [None]:
# viewing fractions of counts of country_codes
100*(master['country_code'].value_counts()/len(master.index)).tail(10)

Now, we can either delete the rows having ```country_code``` missing (about 6% rows), or we can impute them by ```USA```. Since the number 6 is quite small, and we have a decent amount of data, it may be better to just remove the rows.

**Note that** ```np.isnan``` does not work with arrays of type 'object', it only works with native numpy type (float). Thus, you can use ```pd.isnull()``` instead.

In [None]:
# removing rows with missing country_codes
master = master[~pd.isnull(master['country_code'])]

# look at missing values
round(100*(master.isnull().sum()/len(master.index)), 2)

Note that the fraction of missing values in the remaining dataframe has also reduced now - only 0.65% in ```category_list```. Let's thus remove those as well.

**Note**
Optionally, you could have simply let the missing values in the dataset and continued the analysis. There is nothing wrong with that. But in this case, since we will use that column later for merging with the 'main_categories', removing the missing values will be quite convenient (and again - we have enough data).

In [None]:
# removing rows with missing category_list values
master = master[~pd.isnull(master['category_list'])]

# look at missing values
round(100*(master.isnull().sum()/len(master.index)), 2)

## Funding Type Analysis

Let's compare the funding amounts across the funding types. Also, we need to impose the constraint that the investment amount should be between 5 and 15 million USD. We will choose the funding type such that the average investment amount falls in this range.

In [None]:
# first, let's filter the master so it only contains the four specified funding types
master = master[(master.funding_round_type == "venture") | 
        (master.funding_round_type == "angel") | 
        (master.funding_round_type == "seed") | 
        (master.funding_round_type == "private_equity") ]



Now, we have to compute a **representative value of the funding amount** for each type of invesstment. We can either choose the mean or the median - let's have a look at the distribution of ```raised_amount_usd``` to get a sense of the distribution of data.



In [None]:
# distribution of raised_amount_usd
sns.boxplot(y=master['raised_amount_usd'])
plt.yscale('log')
plt.show()

Let's also look at the summary metrics.

In [None]:
# summary metrics
master['raised_amount_usd'].describe()

Note that there's a significant difference between the mean and the median - USD 9.5m and USD 2m. Let's also compare the summary stats across the four categories.

In [None]:
# comparing summary stats across four categories
sns.boxplot(x='funding_round_type', y='raised_amount_usd', data=master)
plt.yscale('log')
plt.show()

In [None]:
# compare the mean and median values across categories
master.pivot_table(values='raised_amount_usd', columns='funding_round_type', aggfunc=[np.median, np.mean])

Note that there's a large difference between the mean and the median values for all four types. For type venture, for e.g. the median is about 20m while the mean is about 70m. 

Thus, the choice of the summary statistic will drastically affect the decision (of the investment type). Let's choose median, since there are quite a few extreme values pulling the mean up towards them - but they are not the most 'representative' values.



In [None]:
# compare the median investment amount across the types
master.groupby('funding_round_type')['raised_amount_usd'].median().sort_values(ascending=False)

The median investment amount for type 'private_equity' is approx. USD 20m, which is beyond Spark Funds' range of 5-15m. The median of 'venture' type is about USD 5m, which is suitable for them. The average amounts of angel and seed types are lower than their range.

Thus, 'venture' type investment will be most suited to them.

## Country Analysis

Let's now compare the total investment amounts across countries. Note that we'll filter the data for only the 'venture' type investments and then compare the 'total investment' across countries.

######Group by country(Another Method)
-  country_group = master_frame.groupby('country_code')

######data frame named top9 with the top nine countries (based on the total investment amount each country has received)
-  top9 = country_group['raised_amount_usd'].sum().nlargest(9)
-  top9 = top9.reset_index()
-  top9.rename(columns = {'raised_amount_usd':'Total_Investment'},inplace = True)

######Updating Top9 tables 
-  top9['Eng_as_Offical'] = [True,False,True,True,True,False,True,False,False]

######Creating Top3 Tabels Where english is the official language
-  top3 = top9.loc[top9['Eng_as_Offical'] == True].nlargest(3,'Total_Investment').copy()
-  top3.reset_index(inplace = True,drop=True)
-  top3

In [None]:
# filter the master for private equity type investments
master = master[master.funding_round_type=="venture"]

# group by country codes and compare the total funding amounts
country_wise_total = master.groupby('country_code')['raised_amount_usd'].sum().sort_values(ascending=False)
country_wise_total.head()

Let's now extract the top 9 countries from ```country_wise_total```.

In [None]:
# top 9 countries
top_9_countries = country_wise_total[:9]
top_9_countries

Among the top 9 countries, USA, GBR and IND are the top three English speaking countries. Let's filter the dataframe so it contains only the top 3 countries.

In [None]:
# filtering for the top three countries
master = master[(master.country_code=='USA') | (master.country_code=='GBR') | (master.country_code=='IND')]

After filtering for 'venture' investments and the three countries USA, Great Britain and India, the filtered master looks like this.

In [None]:
# filtered master has about 38800 observations
master.info()

One can visually analyse the distribution and the total values of funding amount.

In [None]:
# boxplot to see distributions of funding amount across countries
plt.figure(figsize=(8, 5))
sns.boxplot(x='country_code', y='raised_amount_usd', data=master)
plt.yscale('log')
plt.show()

Now, we have shortlisted the investment type (venture) and the three countries. Let's now choose the sectors.

## Sector Analysis

First, we need to extract the main sector using the column ```category_list```. The category_list column contains values such as 'Biotechnology|Health Care' - in this, 'Biotechnology' is the 'main category' of the company, which we need to use.

Let's extract the main categories in a new column.

In [None]:
# extracting the main category
master.loc[:, 'main_category'] = master['category_list'].apply(lambda x: x.split("|")[0])
master.head(2)

We can now drop the ```category_list``` column.

In [None]:
# drop the category_list column
master = master.drop('category_list', axis=1)
master.head(2)

Now, we'll read the ```mapping.csv``` file and merge the main categories with its corresponding column. 

In [None]:
# read mapping file
mapping = pd.read_csv("../input/investment-analysis/mapping.csv", sep=",")
mapping.head()

Firstly, let's get rid of the missing values since we'll not be able to merge those rows anyway. 

In [None]:
# missing values in mapping file
mapping.isnull().sum()

In [None]:
# remove the row with missing values
mapping = mapping[~pd.isnull(mapping['category_list'])]
mapping.isnull().sum()

Now, since we need to merge the mapping file with the main dataframe (master), let's convert the common column to lowercase in both.

In [None]:
# converting common columns to lowercase
mapping['category_list'] = mapping['category_list'].str.lower()
master['main_category'] = master['main_category'].str.lower()

In [None]:
# look at heads
mapping.head(2)

In [None]:
master.head(2)

Let's have a look at the ```category_list``` column of the mapping file. These values will be used to merge with the main master.

In [None]:
mapping['category_list'][:10]

To be able to merge all the ```main_category``` values with the mapping file's ```category_list``` column, all the values in the  ```main_category``` column should be present in the ```category_list``` column of the mapping file.

Let's see if this is true.

In [None]:
# values in main_category column in master which are not in the category_list column in mapping file
master[~master['main_category'].isin(mapping['category_list'])].head(10)

Notice that values such as 'analytics', 'business analytics', 'finance', 'nanatechnology' etc. are not present in the mapping file.

Let's have a look at the values which are present in the mapping file but not in the main dataframe master.

In [None]:
# values in the category_list column which are not in main_category column 
mapping.loc[mapping.category_list.str.contains('0')]

If you see carefully, you'll notice something fishy - there are sectors named *alter0tive medicine*, *a0lytics*, *waste ma0gement*, *veteri0ry*, etc. This is not a *random* quality issue, but rather a pattern. In some strings, the 'na' has been replaced by '0'. This is weird - maybe someone was trying to replace the 'NA' values with '0', and ended up doing this. 

Let's treat this problem by replacing '0' with 'na' in the ```category_list``` column.

In [None]:
# replacing '0' with 'na'
mapping['category_list'] = mapping['category_list'].apply(lambda x: x.replace('0', 'na'))
mapping['category_list'].head(20)

#This can be Done Using Regex;
# mapping.loc[mapping['category_list'].str.match('.*\d+[\w\s]'),'category_list'] = mapping.category_list.str.replace('0','na')
# mapping.loc[mapping.category_list.str.contains('0')]

This looks fine now. Let's now merge the two dataframes.

In [None]:
# merge the masters
master = pd.merge(master, mapping, how='inner', left_on='main_category', right_on='category_list')
master.head()

In [None]:
# let's drop the category_list column since it is the same as main_category
master = master.drop('category_list', axis=1)
master.head()

In [None]:
# look at the column types and names
master.info()

### Converting the 'wide' dataframe to 'long'

You'll notice that the columns representing the main category in the mapping file are originally in the 'wide' format - Automotive & Sports, Cleantech / Semiconductors etc.

They contain the value '1' if the company belongs to that category, else 0. This is quite redundant. We can as well have a column named 'sub-category' having these values. 

Let's convert the master into the long format from the current wide format. First, we'll store the 'value variables' (those which are to be melted) in an array. The rest will then be the 'index variables'.

In [None]:
### help(pd.melt)

In [None]:
# store the value and id variables in two separate arrays

# store the value variables in one Series
value_vars = master.columns[9:18]

# take the setdiff() to get the rest of the variables
id_vars = np.setdiff1d(master.columns, value_vars)


In [None]:
# convert into long
long_master = pd.melt(master, 
        id_vars=list(id_vars), 
        value_vars=list(value_vars))

long_master.head()

We can now get rid of the rows where the column 'value' is 0 and then remove that column altogether.

In [None]:
# remove rows having value=0
long_master = long_master[long_master['value']==1]
long_master = long_master.drop('value', axis=1)

In [None]:
# look at the new master
long_master.head()
len(long_master)

In [None]:
# renaming the 'variable' column
long_master = long_master.rename(columns={'variable': 'sector'})

The dataframe now contains only venture type investments in countries USA, IND and GBR, and we have mapped each company to one of the eight main sectors (named 'sector' in the dataframe). 

We can now compute the sector-wise number and the amount of investment in the three countries.

In [None]:
# summarising the sector-wise number and sum of venture investments across three countries

# first, let's also filter for investment range between 5 and 15m
master = long_master[(long_master['raised_amount_usd'] >= 5000000) & (long_master['raised_amount_usd'] <= 15000000)]


In [None]:
# groupby country, sector and compute the count and sum
master.groupby(['country_code', 'sector']).raised_amount_usd.agg(['count', 'sum'])

This will be much more easy to understand using a plot.

In [None]:
# plotting sector-wise count and sum of investments in the three countries
plt.figure(figsize=(16, 14))

plt.subplot(2, 1, 1)
p = sns.barplot(x='sector', y='raised_amount_usd', hue='country_code', data=master, estimator=np.sum)
p.set_xticklabels(p.get_xticklabels(),rotation=30)
plt.title('Total Invested Amount (USD)')

plt.subplot(2, 1, 2)
q = sns.countplot(x='sector', hue='country_code', data=master)
q.set_xticklabels(q.get_xticklabels(),rotation=30)
plt.title('Number of Investments')


plt.show()

Thus, the top country in terms of the number of investments (and the total amount invested) is the USA. The sectors 'Others', 'Social, Finance, Analytics and Advertising' and 'Cleantech/Semiconductors' are the most heavily invested ones.

In case you don't want to consider 'Others' as a sector, 'News, Search and Messaging' is the next best sector.