Goal : <br>
Succes Startups Prediction<br>
<br>
Objective :<br>
Funding Efficiency<br>
Low Risk Ratio Investment<br>
Saving Loss Potential

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv("../input/startup-success-prediction/startup data.csv")
data.head()

# Data Cleansing

## Check & Drop Duplicate Data

In [None]:
data.duplicated().sum()

In [None]:
data.duplicated(subset=['name']).sum()

In [None]:
data=data.drop_duplicates(subset=['name'])

In [None]:
data.info()

Data update : 922 Rows & 49 Columns

## Drop Useless Feature :

In [None]:
data=data.drop(['Unnamed: 0', 'state_code','Unnamed: 6','latitude', 'longitude', 'zip_code','object_id','status'],axis=1)

data.describe()

## Check Missing Value

In [None]:
data_missing_value = data.isnull().sum().reset_index()
data_missing_value

missing value in "close_at" will be not fill, cause assumption company still stands

# Feature Engineering

add new Colum "Last Date" : Determination Last Date for Startup Age

1. If Startup still stands, count by the end of years 2013 (Cause Last Closed at is 2013-10-30)
2. If Startup is closed, count by closed date (feature "closed_at")

In [None]:
data['closed_at'] = pd.to_datetime(data['closed_at'])
data['founded_at'] = pd.to_datetime(data['founded_at'])

data['last_date']=data['closed_at']

data['last_date']=data['last_date'].fillna('2013-12-31')
data['last_date']=pd.to_datetime(data['last_date'])

In [None]:
data["age"] = data["last_date"]-data["founded_at"]
data["age"]=round(data.age/np.timedelta64(1,'Y'))
data.head()

## Check Error data in all features of Age

In [None]:
data[['founded_at','closed_at', 'age_first_funding_year',
       'age_last_funding_year', 'age_first_milestone_year',
       'age_last_milestone_year','age','labels']].sort_values('age').head(10)

## Drop Minus Age

In [None]:
data=data.drop(data[data.age<0].index)
data=data.drop(data[data.age_first_funding_year<0].index)
data=data.drop(data[data.age_last_funding_year<0].index)
data=data.drop(data[data.age_first_milestone_year<0].index)
data=data.drop(data[data.age_last_milestone_year<0].index)

In [None]:
data[['founded_at','closed_at', 'age_first_funding_year',
       'age_last_funding_year', 'age_first_milestone_year',
       'age_last_milestone_year','age','labels']].info()

update data is 837 rows

##  Fill 0 in NaN Age first & last milestone year

In [None]:
data['age_first_milestone_year']=data['age_first_milestone_year'].fillna(0)
data['age_last_milestone_year']=data['age_last_milestone_year'].fillna(0)

# Data Distribution

In [None]:
features = ['age_first_funding_year', 'age_last_funding_year','age_first_milestone_year', 
               'age_last_milestone_year', 'relationships','funding_rounds', 'funding_total_usd',
               'milestones','avg_participants', 'age']

In [None]:
plt.figure(figsize=(15, 7))
for i in range(0, len(features)):
    plt.subplot(1, 11, i+1)
    sns.boxplot(y=data[features[i]],color='green',orient='v')
    plt.tight_layout()

In [None]:
data[features].skew(axis=0, skipna=True)<2

Features "age_first_funding_year" , "relationships", "funding_total_usd" are skew and need to process Log Transformation & Scalling in Pre-Processing Data.

# Data Visualization

## Succes Rate by Age Startup

In [None]:
data_grp_1=data[data['labels']==1].groupby(['age']).agg({'labels':'count'}).reset_index()
data_grp_1.columns=['age','total_succes']

data_grp_2=data.groupby(['age']).agg({'labels':'count'}).reset_index()
data_grp_2.columns=['age','total']

data_grp_1=data_grp_1.merge(data_grp_2,
                           on='age')
data_grp_1['succes_rate']=round((data_grp_1['total_succes']/data_grp_1['total'])*100,2)

data_grp_1

In [None]:
fig, ax = plt.subplots(figsize=(15,7))

g = sns.barplot(x = 'age',y='succes_rate',data=data_grp_1,ax=ax, 
               palette=sns.color_palette("Blues_d", n_colors=13, desat=1))

x = np.arange(len(data_grp_1))
y = data_grp_1['succes_rate']

for i, v in enumerate(y):
    ax.text(x[i]- 0.1, v+3, str(v)+'%', fontsize = 12, color='gray', fontweight='bold')
    
title = ''' Succes Probability by Age

'''
ax.text(4,85,title,horizontalalignment='left',color='black',fontsize=12,fontweight='bold')
    
    
ax.set_ylim(0,100)

ax.set_xticklabels(ax.get_xticklabels(),rotation=0)
plt.tight_layout;

Startups that have age min 4 years, have chance of success above 50%. The older the company, the greater the chance of success

## Succes Rate by Milestones

In [None]:
data_grp_3=data[data['labels']==1].groupby(['milestones']).agg({'labels':'count'}).reset_index()
data_grp_3.columns=['milestones','total_succes']

data_grp_4=data.groupby(['milestones']).agg({'labels':'count'}).reset_index()
data_grp_4.columns=['milestones','total']

data_grp_3=data_grp_3.merge(data_grp_4,
                           on='milestones')
data_grp_3['succes_rate']=round((data_grp_3['total_succes']/data_grp_3['total'])*100,2)

data_grp_3

In [None]:
fig, ax = plt.subplots(figsize=(15,7))

g = sns.barplot(x = 'milestones',y='succes_rate',data=data_grp_3,ax=ax, 
               palette=sns.color_palette("Blues_d", n_colors=13, desat=1))

x = np.arange(len(data_grp_3))
y = data_grp_3['succes_rate']

for i, v in enumerate(y):
    ax.text(x[i]- 0.1, v+3, str(v)+'%', fontsize = 12, color='gray', fontweight='bold')
    
title = ''' Succes Probability by Milestones

'''
ax.text(2,85,title,horizontalalignment='left',color='black',fontsize=12,fontweight='bold')
        
ax.set_ylim(0,100)

ax.set_xticklabels(ax.get_xticklabels(),rotation=0)
plt.tight_layout

Startups that have min 1 milestone, have chance of success above 60%. The more milestone of Startups, the greater the chance of success

# Data Pre-Processing

## Log Transformation & Normalization

In [None]:
features=['funding_total_usd','age_first_funding_year','relationships']

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

for var in features :
    data['norm_'+var] =np.log1p(data[var])

features_2=['norm_funding_total_usd','norm_age_first_funding_year', 'norm_relationships']

for var in features_2 :
    data[var] = MinMaxScaler().fit_transform(data[var].values.reshape(len(data), 1))

### Visual Before Preprocessing :

In [None]:
plt.figure(figsize=(15, 7))
for i in range(0, len(features)):
    plt.subplot(1, 12, i+1)
    sns.boxplot(y=data[features[i]],color='green',orient='v')
    plt.tight_layout()

### After Pre-Processing

In [None]:
plt.figure(figsize=(15, 7))
for i in range(0, len(features_2)):
    plt.subplot(1, 12, i+1)
    sns.boxplot(y=data[features_2[i]],color='green',orient='v')
    plt.tight_layout()

## Check Skew after Preprocessing

In [None]:
data[features_2].skew(axis=0, skipna=True)<2

## Split Data (Train & Test) and Oversampling Train Data

In [None]:
# Split Feature Vector and Label
X = data[['age_last_funding_year','age_first_milestone_year', 'age_last_milestone_year',
       'funding_rounds','milestones', 'is_CA', 'is_NY',
       'is_MA', 'is_TX', 'is_otherstate', 'is_software', 'is_web', 'is_mobile',
       'is_enterprise', 'is_advertising', 'is_gamesvideo', 'is_ecommerce',
       'is_biotech', 'is_consulting', 'is_othercategory', 'has_VC',
       'has_angel', 'has_roundA', 'has_roundB', 'has_roundC', 'has_roundD',
       'avg_participants', 'is_top500', 'age', 
       'norm_funding_total_usd', 'norm_age_first_funding_year',
       'norm_relationships']]

y = data['labels'] # target / label

#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split 
X_train, X_test,y_train,y_test = train_test_split(X,
                                                y,
                                                test_size = 0.3,
                                                random_state = 42)
# Oversampling
from imblearn import under_sampling, over_sampling

X_train, y_train = over_sampling.RandomOverSampler(random_state=42).fit_resample(X_train, y_train)

# Modeling with Adaboost

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ab = AdaBoostClassifier(random_state=42)
ab.fit(X_train, y_train)
y_predicted = ab.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

from sklearn.metrics import accuracy_score
print('\naccuracy')
print(accuracy_score(y_test, y_predicted))

from sklearn.metrics import classification_report
print('\nclassification report')
print(classification_report(y_test, y_predicted)) # generate the precision, recall, f-1 score, num

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_predicted)
print('ROC Score:',roc_auc_score(y_test, y_predicted))

print("Train Accuracy:",ab.score(X_train, y_train))
print("Test Accuracy:",ab.score(X_test, y_test))

In [None]:
feat_importances = pd.Series(ab.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(10).plot(kind='barh')
ax.invert_yaxis()
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')

# Business Simulation in Data Test

In [None]:
y_predicted=pd.DataFrame(y_predicted)
y_test=pd.DataFrame(y_test)

y_test=y_test.reset_index()
y_test=y_test.drop(['index'],axis=1)

X_test['funding_total_usd']=data['funding_total_usd']
X_test=X_test.reset_index()
X_test=X_test.drop(['index'],axis=1)

X_test['y_predicted']=y_predicted
X_test['y_test']=y_test
X_test.head()

## Funding Management

### Total True Succes Startup in Data Test :

In [None]:
y_test[y_test['labels']==1].count()

### Predict True Succes Startups

In [None]:
X_test[(X_test['y_test']==1)&(X_test['y_predicted']==1)]['y_predicted'].count()

Note : Unpredict Succes Startups is 11 (False Negative)

### Total Fund Investment without ML

In [None]:
X_test['funding_total_usd'].sum()

###  Total Fund Investment after ML

In [None]:
X_test[(X_test['y_predicted']==1)]['funding_total_usd'].sum()

If use Modeling Adaboost & Predict All Succes Result, Acquisitions Startups need Fund USD 3.3 Bio USD **(Efficiency Funding 34%)**

## Risk Management

### Potensial Loss without ML :

In [None]:
X_test[(X_test['y_test']==0)]['funding_total_usd'].sum()

Acquisitions All Fail Startups in Data Test, will be Loss USD 1.8 Bio USD

### Potensial Loss with ML (False Positive):

In [None]:
X_test[(X_test['y_test']==0)&(X_test['y_predicted']==1)]['funding_total_usd'].sum()

Acquisitions All Fail Startups after predict with Modeling Adaboost in Data Test, will be Loss USD 317 Mio USD **(Saving 82% Potential Loss )**