## Startup Exploration



Investment in Startups is a speculative and intuitive venture in which investors are willing to put their money in an idea that has not been proved but shows potential to make more money. From the different kinds of investors, there are those that do not have a direct exit strategy and those that have.

It is from this view that then investors protect themselves leading to such events like ipos, closure, acquisition and continous operation as a company. Therefore being a speculative venture, closure is the most expensive events for investors as their is a high likelihood of them loosing their money.


With this background this notebook, seeks to learn the characteristics of the ipo, closure, acquisition and still operating events, predict if a company would undergo one of the events and finally try to recommend startups for consideration as part of the investment portfolio.

#### Setting up the s3 environment for this notebook

In [None]:
# import sagemaker
# import boto3
# from sagemaker.amazon.amazon_estimator import get_image_uri
# from sagemaker.session import s3_input, Session

In [None]:
# bucket_name = 'startup-recommender-app'
# my_region = boto3.session.Session().region_name

In [None]:
# s3 = boto3.client('s3')
# existing_buckets = [x['Name'] for x in s3.list_buckets()['Buckets']]

# try:
    # if bucket_name not in existing_buckets:
       # if my_region is None:
           # s3.create_bucket(Bucket=bucket_name)
          #  print('s3 bucket created successfully')
        #else:
           # location = {'LocationConstraint': my_region}
           # s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location)
    #else:
      #  print('Already exists')
#except Exception as e:
   # print('s3 error: ', e)

In [None]:
# Setting up the path for model saving
#prefix = 'models-as-built-in-algo'
#output_path = 's3://{}/{}/output'.format(bucket_name, prefix)
#print(output_path)

In [None]:
#Importing necessary packages to read files
import pandas as pd
import numpy as np

In [None]:
df1 = pd.read_csv('../input/startup-investments/objects.csv')

In [None]:
# Parse the columns raising the warning above correctly
list = df1.columns.to_list()
list_2 = [3,7,9,10,17,18,21,22,23,25,26,29,30,33,34,37]
for i in list_2:
    print(list[i], i)

In [None]:
#Objects Data
df1 = pd.read_csv('../input/startup-investments/objects.csv', dtype={'parent_id':'object', 'category_code': 'object', 'short_description': 'object', 'description': 'object', 'country_code': 'object', 'state_code': 'object', 'city': 'object', 'created_by': 'object'}, parse_dates=[9, 10, 25, 26, 29, 30, 33, 34])

In [None]:
df1.head()

In [None]:
#Funding Rounds Data
df2 = pd.read_csv('../input/startup-investments/funding_rounds.csv', parse_dates=[3])

In [None]:
#IPOs Data
df3 = pd.read_csv('../input/startup-investments/ipos.csv', parse_dates=[7])
df3.head()

In [None]:
df1['entity_type'].unique()

**Observation**

The objects.csv has companies, financial organization, persons, and products data.

In [None]:
df5 = df1[df1['entity_type'] == 'Company']

**_Observaions_**

By means of intution, relevant features are selected and irrevelant features are ignored to form a new Dataframe.

In [None]:
# Creation of df6
df6 = df5[['id','status','normalized_name', 'category_code', 'founded_at', 'closed_at', 'tag_list', 'country_code', 'investment_rounds', 'invested_companies', 'first_funding_at', 'last_funding_at', 'funding_rounds', 'funding_total_usd', 'first_milestone_at', 'last_milestone_at', 'milestones', 'relationships']].copy()

In [None]:
# Looking into the contents of the tag_list feature to check its relevance to the study of this notebook
df5[['tag_list']][~(df5['tag_list'].isna())]

In [None]:
df6.shape

In [None]:
df6.info()

## Feature Engineering

From the selected features this notebook will strive to feature engineer at this point to help fill in some null values.

1. **Age**

In [None]:
# Base year for operating, acquired and ipo kind of companies
BASE_YEAR = df6.closed_at.max()

In [None]:
from datetime import timedelta

days_yr = timedelta(days=365.25)
days_yr

In [None]:
%%time
# Creating the feature age for the companies
age = []

for i in range(df6.shape[0]):
    if (df6.status.iloc[i] == 'operating') or (df6.status.iloc[i] == 'acquired') or (df6.status.iloc[i] == 'ipo'):
        age.append((BASE_YEAR - df6.founded_at.iloc[i])/days_yr)
    else:
        age.append((df6.closed_at.iloc[i] - df6.founded_at.iloc[i])/days_yr)

In [None]:
df6.loc[:, 'age'] = age

In [None]:
df6[df6.age < 0]

2. **Category**

In [None]:
df6.category_code.unique(), len(df6.category_code.unique())

In [None]:
# Creating a new feature to generalize the category code

leisure = ['games_video', 'photo_video', 'social', 'hospitality', 'sports', 'fashion', 'messaging', 'music']
bizsupport = ['network_hosting', 'advertising', 'enterprise', 'consulting', 'analytics', 'public_relations', 'security', 'legal']
building = ['cleantech', 'manufacturing', 'semiconductor', 'automotive', 'real_eastate', 'nanotech']
petcare = ['pets']
travel = ['travel', 'transportation']
health = ['health', 'medical', 'biotech']
other = ['web', 'other', 'mobile', 'software', 'finance', 'education', 'ecommerce', 'search', 'hardware', 'news', 'government', 'nonprofit', 'local',]

In [None]:
new_catg = []

for i in range(df6.category_code.shape[0]):
    x = df6.category_code.iloc[i]
    if x in leisure:
        new_catg.append('LE')
    elif x in bizsupport:
        new_catg.append('BZ')
    elif x in building:
        new_catg.append('BU')
    elif x in petcare:
        new_catg.append('PC')
    elif x in travel:
        new_catg.append('TR')
    elif x in health:
        new_catg.append('HE')
    else:
        new_catg.append('OT')


In [None]:
df6.loc[:,"category"] = new_catg

3. **Continent**

In [None]:
# Looking into the number and unique values in the country_code feature
df6.country_code.sort_values().unique(), len(df6.country_code.unique())

In [None]:
# Creating a new feature to generalize the country_code

Africa = ['AGO', 'BDI', 'BEN', 'BWA', 'CIV', 'CMR', 'DZA', 'EGY', 'ETH', 'GHA', 'GIN', 'KEN', 'LSO', 'MAR', 'MDG', 'MUS', 'NAM', 'NER','NGA', 'REU','RWA', 'SDN','SEN', 'SLE', 'SOM','SWZ', 'SYC', 'TUN', 'TZA', 'UGA', 'ZAF', 'ZMB', 'ZWE']
Asia = ['AFG', 'ARE', 'BGD', 'BHR', 'BRN', 'CHN', 'HKG', 'IDN', 'IND', 'IOT', 'IRN', 'IRQ', 'ISR','JOR', 'JPN', 'KAZ', 'KGZ', 'KHM', 'KOR', 'KWT','LAO', 'LBN', 'LKA', 'MAC', 'MDV', 'MMR', 'MYS', 'NPL', 'OMN', 'PAK', 'PCN','PHL','PRK','PST', 'QAT', 'SAU', 'SGP','SYR', 'THA', 'TJK', 'TWN', 'UZB', 'VNM', 'YEM']
Europe = ['AIA', 'ALB', 'AND', 'ARM', 'AUT', 'AZE', 'BEL', 'BGR','BIH', 'BLR', 'CHE', 'CYP', 'CZE', 'DEU', 'DNK','ESP', 'EST', 'FIN', 'FRA', 'GBR', 'GEO', 'GIB', 'GLB', 'GRC', 'HRV', 'HUN', 'IRL', 'ISL', 'ITA', 'LIE', 'LTU','LUX', 'LVA', 'MCO', 'MDA', 'MKD', 'MLT', 'NLD', 'NOR', 'POL', 'PRT', 'ROM', 'RUS', 'SMR', 'SVK', 'SVN','SWE', 'TUR', 'UKR']
North_America = ['ATG', 'BHS','BLZ', 'BMU', 'BRB', 'CAN', 'CRI','CUB','CYM', 'DMA', 'GRD', 'GTM', 'HND', 'HTI', 'JAM', 'MEX', 'MTQ', 'PAN', 'PRI', 'SLV', 'UMI','USA', 'VGB', 'VIR']
South_America = ['ARG', 'BOL', 'BRA', 'CHL', 'COL', 'DOM', 'ECU', 'NIC', 'PER', 'PRY', 'SUR', 'TTO', 'URY','VEN', 'VCT']
Other = ['ANT', 'ARA', 'AUS', 'CSS', 'FST', 'HMI','NCL', 'NFK','NRU', 'NZL']


In [None]:
continent = []

for i in range(df6.country_code.shape[0]):
    x = df6.country_code.iloc[i]
    if x in Africa:
        continent.append('AF')
    elif x in Asia:
        continent.append('AS')
    elif x in Europe:
        continent.append('EU')
    elif x in North_America:
        continent.append('NA')
    elif x in South_America:
        continent.append('SA')
    else:
        continent.append('UT')


In [None]:
df6.loc[:,'continent']= continent

In [None]:
df6[~(df6['founded_at'].isna()) & ~(df6['first_milestone_at'].isna())].shape

In [None]:
df6[~(df6['founded_at'].isna()) & ~(df6['first_milestone_at'].isna()) & ~(df6['first_funding_at'].isna())].shape

**Observation**

From the little exploration, we loose about half of the data when considering first_funding_at feature

4. **Funding Type**

In [None]:
df2.info()

In [None]:
df2.head()

In [None]:
funding_type = df2.groupby(['object_id', 'funding_round_type'])['funding_round_type'].count().unstack()

In [None]:
funding_type.fillna(value=0, axis=1, inplace=True)

In [None]:
funding_type

In [None]:
df7 = pd.merge(left=df6, right=funding_type ,how='inner', left_on='id', right_on=funding_type.index)

In [None]:
df7.info()

In [None]:
df7[df7['id'] == 'c:104377'].unstack()

5. **Number of Products**

In [None]:
products = df1[df1['entity_type'] == 'Product']

In [None]:
products.info()

In [None]:
products['status'].unique()

**Observation**

We will create a new feature by re-categorising the status feature

In [None]:
dev = ['alpha', 'beta', 'development']
operating = ['live', 'operating', 'private']
closed = ['closed']

In [None]:
status = []

for i in range(products.shape[0]):
    x = products.status.iloc[i]

    if x in dev:
        status.append('dev')
    elif x in operating:
        status.append('operating')
    elif x in closed:
        status.append('closed')

In [None]:
products  = products.assign(status = status)

In [None]:
products.status.unique()

In [None]:
no_products = products.groupby(['parent_id', 'status'])['status'].count().unstack()
no_products.fillna(0, inplace=True)

In [None]:
no_products.shape

In [None]:
df8 = pd.merge(left=df7, right=no_products, how='left', left_on='id', right_on='parent_id')

**Points to Note**

When merging a df7 and no_products we consider df7 as the main data on which the merge should happen. The no_products is shorter than df7 and therefore the resulting dataframe has null values for the new merged part. The major assumption is that an operating company has atleast a product and a closed company has at least a product with them not having any dev product. 

Thus, we will work with this assumptions for the null values created:

1. A company with closed status has 1 closed product
2. A company with closed status has 0 operating product
3. A company with operating status has 1 operating product
4. A company with acquired status has 1 operating product
5. A company with ipo status has 1 operating product
6. Null values in the closed feature are zero due to the the assumptions 1, 2, 3, 4, and 5
7. Null values in the dev feature are zero due to the the assumptions 1, 2, 3, 4, 5, and 6

In [None]:
# Assumption 1
df8.loc[(df8.status == 'closed') & (df8.closed.isna()), 'closed'] = 1

In [None]:
#Assumption 2
df8.loc[(df8.status == 'closed') & (df8.operating.isna()), 'operating'] = 0

In [None]:
# Assumption 3
df8.loc[(df8.status == 'operating') & (df8.operating.isna()), 'operating'] = 1

In [None]:
# Assumption 4
df8.loc[(df8.status == 'acquired') & (df8.operating.isna()), 'operating'] = 1

In [None]:
#Assumption 5
df8.loc[(df8.status == 'ipo') & (df8.operating.isna()), 'operating'] = 1

In [None]:
# Assumption 6
df8['closed'].fillna(0, inplace=True)

In [None]:
# Assumption 7
df8['dev'].fillna(0, inplace=True)

## EDA

As a general rule of thumb, normal data science follows through EDA to glean insights into how the data is setup and if they have an effect on what is being tried to be achieved.

For the sake of this analysis, we will ignore this important step and proceed directly to classification and recommender building based on:
1. Disconnect in the .csv files
2. Feature engineering coming earlier
3. Intuitive selection of features important to the exploration and model building

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df8.info()

In [None]:
# picking columns for model building
data = df8[['status', 'investment_rounds', 'invested_companies', 'funding_rounds', 'funding_total_usd', 'milestones', 'relationships', 'age', 'category', 'continent', 'angel', 'crowdfunding', 'other', 'post-ipo', 'private-equity', 'series-a', 'series-b', 'series-c+', 'venture', 'closed', 'dev', 'operating']].copy()

In [None]:
data.describe()

**Observation**

The Distribution for all the features looks well except for the _age_ which has missing values and has negative values. Thus, this notebook will drop the negative age values(outliers) and sort the age null values with a categorical feature.

In [None]:
# Determining the rows with negative age values
data[data.age < 0]

In [None]:
#Dropping the rows with negative age values
data.drop((data.loc[data.age < 0 ].index), axis=0, inplace=True)

In [None]:
# Due to null values in age I will create a new feature (age_set)
# Exploration to determine number of classes for the age_set feature
data.age.hist(bins=50)
plt.show()

**Observation**

1. Concentration of age is below 20 and therefore we will adopt 2 classes(young, old and other for null values)

In [None]:
age_set = []

for i in range(data.shape[0]):
    x = data.age.iloc[i]

    if x <= 20:
        age_set.append('young')
    elif x > 20:
        age_set.append('old')
    else:
        age_set.append('other')


In [None]:
data.loc[:, 'age_set'] = age_set 

#### Label Encoding and One-Hot Encoding

status and age_set features need to be converted to what models will understand

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

In [None]:
status_le = LabelEncoder()
ageset_le = OrdinalEncoder()
continent_oe = OneHotEncoder(drop='first', sparse=False)
category_oe = OneHotEncoder(drop='first', sparse=False)

status_transformed = status_le.fit_transform(data['status'])
ageset_transformed = ageset_le.fit_transform(data[['age_set']])
continent_transformed = continent_oe.fit_transform(data[['continent']])
category_transformed = category_oe.fit_transform(data[['category']])

In [None]:
category_oe.get_feature_names()

In [None]:
continent_oe.get_feature_names()

In [None]:
continent_ = pd.DataFrame(continent_transformed, columns=['AS', 'EU', 'NA', 'SA', 'UT'])
category_ = pd.DataFrame(category_transformed, columns=['BZ', 'HE', 'LE', 'OT', 'PC', 'TR'])

In [None]:
data.reset_index(inplace=True)

In [None]:
full_d = pd.concat([data, continent_, category_], axis=1)

In [None]:
full_d = full_d.assign(status = status_transformed, age_set= ageset_transformed)

In [None]:
full_d.drop(['age', 'category', 'continent', 'index'], axis=1, inplace=True)

#### Scaling

Scaling will be done to: funding_total_usd because it has extremely large values	

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
funding_scaler = StandardScaler()

funding_transformed = funding_scaler.fit_transform(full_d[['funding_total_usd']])

In [None]:
full_d = full_d.assign(funding_total_usd = funding_transformed)

**Observation**

1. The process above is not necessary to be repeated. Therefore, a copy of the transformed data will be saved in s3 bucckets after splitting into train and test data

### Train Test Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(full_d.drop(['status'], axis=1), full_d['status'], test_size=.3, random_state=42)

In [None]:
# import os

# Saving Train Data to Buckets
pd.concat([y_train, X_train], axis=1).to_csv('./train.csv', index=False)

#s3 = boto3.resource('s3')
#s3.Object(bucket_name, os.path.join(prefix, 'train/train.csv')).upload_file('./train.csv')
# s3_input_train = sagemaker.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name,prefix), content_type='csv')

In [None]:
# Saving Test Data to Buckets
pd.concat([y_test, X_test], axis=1).to_csv('./test.csv', index=False)

#s3.Object(bucket_name, os.path.join(prefix, 'test/test.csv')).upload_file('./test.csv')
#s3_input_test = sagemaker.TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name,prefix), content_type='csv')