<p style="font-size: 100px; text-align: center; color: rgb(212, 69, 0)"> <strong>S</strong>tartup <strong>A</strong>nalysis </p>

<p style="font-size: 60px; text-align: center; color: rgb(62, 61, 60)"> <strong>P</strong>art <strong>T</strong>wo</p>

<p style="text-align: center"><img src="https://i.vimeocdn.com/portrait/4910448_300x300" style="height: 50%; text-align: center"/></p>

In [4]:
import pandas as pd
import numpy as np
# Load functionality to manipulate dataframes
from pyspark.sql import functions as fn
import matplotlib.pyplot as plt
from pyspark.sql.functions import stddev, mean, col
# Functionality for computing features
from pyspark.ml import feature, regression, classification, Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import feature, regression, classification, Pipeline
from pyspark.ml.feature import Tokenizer, VectorAssembler, HashingTF, Word2Vec, StringIndexer, OneHotEncoder
from pyspark.ml import clustering
from itertools import chain
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml import classification
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml import evaluation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param

#LOAD AND IMPORT

In [6]:
%sh wget https://www.dropbox.com/s/zj133oas3xa1mma/master.csv -nv

In [7]:
# load master dataset
dfmaster = sqlContext.read.format("csv").load("file:///databricks/driver/master.csv", delimiter = ",", header = True)

# Preparations and Understanding before Modeling

In [9]:
# create a 0/1 column for acquistions
dfmaster = dfmaster.\
  withColumn("labelacq", fn.when(col("status") == "acquired","1").otherwise("0"))

In [10]:
# number of rows in master table
print(dfmaster.count())

### NAs and market column (with too many levels) handeling

In [12]:
# check for missing values 
dfmaster.toPandas().isnull().sum()

In [13]:
# drop market columns because of too many level and better breakdown with the category_final column
dfmaster1 = dfmaster.drop("market")

In [14]:
# drop rows with missing values
dfmaster1drop = dfmaster1.na.drop()

We decided to drop the rows with any NULL value, because we still had enough observations and we did not want to value the missing entries equally for every observation. We did one testing and the AUC performance was worse anyways. However, for further analysis it would be worth to try different methods to fill in the missing values.

### String indexer, one hot encoder and casting to numerics

In [17]:
# create index for categorical variables
# use pipline to apply indexer
list1 = ["country_code","city","quarter_new","investor_country_code","funding_round_type","category_final"]
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(dfmaster1drop) for column in list1]
pipelineindex = Pipeline(stages=indexers).fit(dfmaster1drop)
dfmasternew = pipelineindex.transform(dfmaster1drop)

In [18]:
# convert string to double for numerical variables
dfmasternew = dfmasternew.\
  withColumn("numeric_funding_rounds", dfmasternew["funding_rounds"].cast("double")).\
  withColumn("numeric_age", dfmasternew["age"].cast("double")).\
  withColumn("numeric_count_investor", dfmasternew["count_investor"].cast("double")).\
  withColumn("numeric_time_to_first_funding", dfmasternew["time_to_first_funding"].cast("double")).\
  withColumn("numeric_total_raised_usd", dfmasternew["total_raised_usd"].cast("double")).\
  withColumn("label", dfmasternew["labelacq"].cast("double"))

In [19]:
# save
dfone = dfmasternew

In [20]:
#display(dfone)

In [21]:
# list of index columns of categorical variables for the onehotencoder
list2 = dfone.columns[16:22]
list2

In [22]:
# create sparse matrix of indexed categorical columns
# use pipline to apply the encoder
onehotencoder_stages = [OneHotEncoder(inputCol=c, outputCol='onehotencoded_' + c) for c in list2]
pipelineonehot = Pipeline(stages=onehotencoder_stages)
pipeline_mode = pipelineonehot.fit(dfone)
df_coded = pipeline_mode.transform(dfone)

In [23]:
display(df_coded)

### Data split, defining vector assemblers & standard scaler and creating labellist

In [25]:
# split dataset into training, validaiton and testing dataset
training_df, validation_df, testing_df = df_coded.randomSplit([0.6, 0.3, 0.1])

VECTOR ASSEMBLER

In [27]:
training_df.columns[22:27]

In [28]:
training_df.columns[28:37]

We chose to only standardize the numeric values, because of interpretability. 

Further, our explained variance when standardizing all of the columns was very strange, because it was very low as seen in the figure below. However, the testing performance was the same if not a little bit better and the weights (for Logistric Regression) were better in terms of interpretation - you can take a look at it in the additional analysis section at the end of the notebook.  

**Explained Variance with all columns standardized**

<img src="https://i.imgur.com/kpjILwW.png"/>

**Explained Variance with only numerical columns standardized**

<img src="https://i.imgur.com/2xKpi46.png"/>

In [30]:
# define vector assembler with the features for the modelling
vanum = VectorAssembler(). \
      setInputCols(training_df.columns[22:27]). \
      setOutputCol('features_nonstd')

In [31]:
# define vector assembler with the features for the modelling
vacate = VectorAssembler(). \
      setInputCols(training_df.columns[28:37]). \
      setOutputCol('featurescate')

In [32]:
va = VectorAssembler(). \
      setInputCols(['featuresnum','featurescate']). \
      setOutputCol('features')

In [33]:
std = feature.StandardScaler(withMean=True, withStd=True).setInputCol('features_nonstd').setOutputCol('featuresnum')

In [34]:
#display(va.transform(training_df))

In [35]:
# suffix for investor country code because intersection with county_code of  the companies
invcc = ['{}_{}'.format(a, "investor") for a in indexers[3].labels]

In [36]:
# define labellist by using the indexer stages for displaying the weights & loadings
labellist = training_df.columns[22:27] + indexers[0].labels + indexers[1].labels + indexers[2].labels + invcc + indexers[4].labels + indexers[5].labels

In [37]:
training_df.columns[28:37]

In [38]:
# null dummy for onehotencoded_country_code_index
print("null dummy for onehotencoded_country_code_index")
print(len(indexers[0].labels))
print(indexers[0].labels[79])
# null dummy for onehotencoded_city_index
print("null dummy for onehotencoded_city_index")
print(indexers[1].labels[1761])
print(len(indexers[1].labels))
# null dummy for onehotencoded_quarter_new_index
print("null dummy for onehotencoded_quarter_new_index")
print(len(indexers[2].labels))
print(indexers[2].labels[3])
# null dummy for onehotencoded_investor_country_code_index
print("null dummy for onehotencoded_investor_country_code_index")
print(len(invcc))
print(invcc[67])
# null dummy for onehotencoded_funding_round_type_index
print("null dummy for onehotencoded_funding_round_type_index")
print(len(indexers[4].labels))
print(indexers[4].labels[12])
# null dummy for onehotencoded_category_final_index
print("null dummy for onehotencoded_category_final_index")
print(len(indexers[5].labels))
print(indexers[5].labels[210])

In [39]:
# delete for null dummys from labellist
labellist.remove("SRB")
labellist.remove("Aldermaston")
labellist.remove("Q4")
labellist.remove("SVK_investor")
labellist.remove("secondary_market")
labellist.remove("Video")

# Modeling

## RANDOM FOREST

In [42]:
# define binary classification evaluator
bce = BinaryClassificationEvaluator()

We chose to run Random Forest models with 20, 15 and 25 trees.

In [44]:
# define default, 15 trees and 25 trees random forest classifier
rf = RandomForestClassifier(maxBins=10000, featuresCol='features', labelCol='label')
rf15 = RandomForestClassifier(numTrees=15, maxBins=10000, featuresCol='features', labelCol='label')
rf25 = RandomForestClassifier(numTrees=25, maxBins=10000, featuresCol='features', labelCol='label')

In [45]:
# define and fit pipelines with vector assembler and random forest classifier 
rf_pipeline = Pipeline(stages=[vanum, std, vacate, va, rf]).fit(training_df)
rf_pipeline_15 = Pipeline(stages=[vanum, std, vacate, va, rf15]).fit(training_df)
rf_pipeline_25 = Pipeline(stages=[vanum, std, vacate, va, rf25]).fit(training_df)

In [46]:
# apply pipeline to validiaton dataset
dfrf = rf_pipeline.transform(validation_df)
dfrf_15 = rf_pipeline_15.transform(validation_df)
dfrf_25 = rf_pipeline_25.transform(validation_df)

### Performance

In [48]:
# print the areas under the curve for the different random forest pipelines
print("Random Forest with 20 trees: AUC = {}".format(bce.evaluate(dfrf)))
print("Random Forest 15 trees: AUC = {}".format(bce.evaluate(dfrf_15)))
print("Random Forest 25 trees: AUC = {}".format(bce.evaluate(dfrf_25)))

The performance in terms of AUC is very much the same for all three models. We always also look at the AUC, because our target variable is not really well balanced. AUC will give us the between the TRUE POSITIVE and FALSE POSITIVE rate. The accuracy on the valdidation dataset seems very good. However, it is strange that the accuracies are exactly the same even though the raw predicitions and probabilities are different. The AUC is harder to estimate, because there is no general rule which value is good - 0.71 seems okay though.

In [50]:
# print the accuracies for the different random forest pipelines
print(dfrf.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Random Forest with 20 trees")).show())
print(dfrf_15.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Random Forest with 15 trees")).show())
print(dfrf_25.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Random Forest with 25 trees")).show())

For some reason the accuracy is exactly the same for all three models, probably meaning that the three models do not have a significant difference. However, the values for the importancies (see down below) are different.

### Importancies

In [53]:
# create spark df with the 20 highest labels and the corresponding importancies + sorting by importancy
rfw = spark.createDataFrame(pd.DataFrame(list(zip(labellist, rf_pipeline.stages[4].featureImportances.toArray())),
            columns = ['column', 'importancy']).sort_values('importancy').tail(20))
display(rfw)

In [54]:
# create spark df with the labels and the corresponding importancies + sorting by importancy
rf15w = spark.createDataFrame(pd.DataFrame(list(zip(labellist, rf_pipeline_15.stages[4].featureImportances.toArray())),
            columns = ['column', 'importancy']).sort_values('importancy').tail(20))
display(rf15w)

In [55]:
# create spark df with the labels and the corresponding importancies + sorting by importancy
rf25w = spark.createDataFrame(pd.DataFrame(list(zip(labellist, rf_pipeline_25.stages[4].featureImportances.toArray())),
            columns = ['column', 'weight']).sort_values('weight').tail(20))
display(rf25w)

The dataframe above displays the most contributing factors for the prediction of an acquistion. However, we can not say if the impact is negativ or positiv. We used Logistic Regression to explain the direction of the feature. 

<img src="https://i.imgur.com/RH5DaS3.png"/>

*those were the important features for the best performing RF with the tranining dataset at that time*

The calculation for this figure is in the Logistic Regression part.

The table shows us that a company that is established, is located in the USA, has the majority of investors from the USA and has venture funding and a decent investore count has a good chance to get acquired. Also, the storage market seems to be hot (category)

## LOGISTIC REGRESSION

In [58]:
# define logistic regression parameters and pipeline stages 
lambda_par1 = 0.01
alpha_par1 = 0.05
en_lr1 = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('features').\
        setRegParam(lambda_par1).\
        setElasticNetParam(alpha_par1)

# change the parameters of the second classifier below
lambda_par2 = 0.05
alpha_par2 = 0.05
en_lr2 = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('features').\
        setRegParam(lambda_par2).\
        setElasticNetParam(alpha_par2)

# change the parameters of the thrid classifier below
lambda_par3 = 0.1
alpha_par3 = 0.05
en_lr3 = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('features').\
        setRegParam(lambda_par3).\
        setElasticNetParam(alpha_par3)
        
en_lr_estimator1 = Pipeline(
    stages=[vanum, std, vacate, va, en_lr1])
en_lr_estimator2 = Pipeline(
    stages=[vanum, std, vacate, va, en_lr2])
en_lr_estimator3 = Pipeline(
    stages=[vanum, std, vacate, va, en_lr3])

*A function to possibly search for the optimal lambda and alpha*

In [60]:
#lstauc = []
#lsacc = []
#lsi = []
#lsa = []
#for i in range(20):
#  lambda_par3 = 0.01 + i*0.01
#  for a in range(20):
#    alpha_par3 = 0.01 + i*0.01
#    en_lr3 = LogisticRegression().\
#    setLabelCol('label').\
#    setFeaturesCol('features').\
#    setRegParam(lambda_par3).\
#    setElasticNetParam(alpha_par3)
#    en_lr_estimator3 = Pipeline(stages=[va, en_lr3])
#    en_lr_model3 = en_lr_estimator3.fit(training_df)
#    dfmodel3 = en_lr_model3.transform(validation_df)
#    auc = bce.evaluate(dfmodel3)
#    lstauc.append(auc)
#    acc = dfmodel3.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("acc")).toPandas().acc[0]
#    lsacc.append(acc)
#    #lsi.append[i]
#    #lsa.append[a]

In [61]:
# fit logistic regression pipelines
en_lr_model1 = en_lr_estimator1.fit(training_df)
en_lr_model2 = en_lr_estimator2.fit(training_df)
en_lr_model3 = en_lr_estimator3.fit(training_df)

In [62]:
# apply pipeline to validation dataset
dfmodel1 = en_lr_model1.transform(validation_df)
dfmodel2 = en_lr_model2.transform(validation_df)
dfmodel3 = en_lr_model3.transform(validation_df)

### Performance

In [64]:
# print areas under the curve of the different logistic regressions
print("Logistic Regression Model 1: AUC = {}".format(bce.evaluate(dfmodel1)))
print("Logistic Regression Model 2: AUC = {}".format(bce.evaluate(dfmodel2)))
print("Logistic Regression Model 3: AUC = {}".format(bce.evaluate(dfmodel3)))

The AUC seems to be better in comparison to Random Forest, while the accuracy is a little bitter lower or almost the same. 

Logistic regressions seems to perform a litter bit better than Random Forest since the AUC is higher.

In [66]:
# print accuracies of the different logistic regressions
print(dfmodel1.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 1")).show())
print(dfmodel2.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 2")).show())
print(dfmodel3.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 3")).show())

### Weights

The weights are almost useless, because the level of categories seems to be too high in some features and therefore, the weights are close to each other and we also have high amount of features with weight 0 (38%). We just use the weights as a reference for the importancy of Random Forest, as already described.

In [69]:

display(spark.createDataFrame(pd.DataFrame(list(zip(labellist, en_lr_model1.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight')).agg((fn.count(fn.when(col("weight") == 0, True)) / fn.count(col("label"))).alias("Ratio")))

In [70]:
# create spark df with the first 10 lowest labels and corresponding weights
pd.DataFrame(list(zip(labellist, en_lr_model1.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').head(10)

In [71]:
# create spark df with the first 10 highest labels and corresponding weights
pd.DataFrame(list(zip(labellist, en_lr_model1.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').tail(10)

In [72]:
# create spark df with the first 10 lowest labels and corresponding weights
pd.DataFrame(list(zip(labellist, en_lr_model2.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').head(10)

In [73]:
# create spark df with the first 10 highest labels and corresponding weights
pd.DataFrame(list(zip(labellist, en_lr_model2.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').tail(10)

In [74]:
# create spark df with the first lowest labels and corresponding weights
pd.DataFrame(list(zip(labellist, en_lr_model3.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').head(10)

In [75]:
# create spark df with the first 10 highest labels and corresponding weights
pd.DataFrame(list(zip(labellist, en_lr_model3.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').tail(10)

### Weights of the important features of the RF with 15 trees as reference

In [77]:
# create spark df with all coefficients of LR model 2 (best performance in terms of AUC and accuracy combined)
dflrcoeff = spark.createDataFrame(pd.DataFrame(list(zip(labellist, en_lr_model2.stages[4].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight'))

In [78]:
# list with the highest 8 importancies based on RF model with 15 trees (best performance in terms of AUC and accuracy combined)
rf15imp = ["Storage","Messaging","numeric_count_investor","USA","venture","USA_investor","numeric_age","numeric_total_raised_usd"]

In [79]:
# display weights for the important factors from the RF model with 15 trees in order to see if its negative or positive
display(dflrcoeff.where(col("label").isin(rf15imp)).orderBy("weight"))

## NEURAL NETWORKS

### Standard Model

In [82]:
# define neural networks (MultilayerPerceptron) classifier
mlp = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setLayers([2137, 2]).\
    setFeaturesCol('features')

In [83]:
# define and fit neural network pipeline 
mlp_simple_model = Pipeline(stages=[vanum, std, vacate, va, mlp]).fit(training_df)

In [84]:
# define evaluators for accuracy and area under the curve
evaluator = evaluation.MulticlassClassificationEvaluator(metricName="accuracy")
evaluatorauc = evaluation.MulticlassClassificationEvaluator()

In [85]:
# apply pipeline to validation dataset
dfnn = mlp_simple_model.transform(validation_df)

In [86]:
# print accuracy and area under the curve for NN
print(evaluator.evaluate(dfnn))
print(evaluatorauc.evaluate(dfnn))

### Models with Hidden Layers

In [88]:
# define neural networks (MultilayerPerceptron) classifier with hidden layers
mlp2 = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setFeaturesCol('features').\
    setLayers([2137,30,30, 2])
mlp3 = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setFeaturesCol('features').\
    setLayers([2137,10,10, 2])
mlp4 = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setFeaturesCol('features').\
    setLayers([2137,20,20, 2])
mlp5 = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setFeaturesCol('features').\
    setLayers([2137,30, 2])
mlp6 = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setFeaturesCol('features').\
    setLayers([2137,30,30,30, 2])

In [89]:
# define and fit pipeline
mlp2_model = Pipeline(stages=[vanum, std, vacate, va, mlp2]).fit(training_df)
mlp3_model = Pipeline(stages=[vanum, std, vacate, va, mlp3]).fit(training_df)
mlp4_model = Pipeline(stages=[vanum, std, vacate, va, mlp4]).fit(training_df)
mlp5_model = Pipeline(stages=[vanum, std, vacate, va, mlp3]).fit(training_df)
mlp6_model = Pipeline(stages=[vanum, std, vacate, va, mlp4]).fit(training_df)

In [90]:
# apply and fit model to validation dataset
dfnn2 = mlp2_model.transform(validation_df)
dfnn3 = mlp3_model.transform(validation_df)
dfnn4 = mlp4_model.transform(validation_df)
dfnn5 = mlp3_model.transform(validation_df)
dfnn6 = mlp4_model.transform(validation_df)

### Performance

The accuracy of the Neural Networks models is a bit worse in comparison to Random Forest and Logistic Regression (on the validation dataset). However, the AUC increases by a lot. This actually might be an indication that the Neural Networks performs the best, regardless of the worse accuracy.

NN Model with 2 hidden layers and 30 neurons each seems to perform the best.

In [93]:
# print accuracy and area under the curve
print("NN Model with 2 hidden layers and 30 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnn2)))
print("NN Model with 2 hidden layers and 30 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnn2)))
print("____________________________________________________________________________")
print("NN Model with 2 hidden layers and 10 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnn3)))
print("NN Model with 2 hidden layers and 10 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnn3)))
print("____________________________________________________________________________")
print("NN Model with 2 hidden layers and 20 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnn4)))
print("NN Model with 2 hidden layers and 20 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnn4)))
print("____________________________________________________________________________")
print("NN Model with 1 hidden layer and 30 neurons: Accuracy = {}".format(evaluator.evaluate(dfnn5)))
print("NN Model with 1 hidden layer and 30 neurons: AUC = {}".format(evaluatorauc.evaluate(dfnn5)))
print("____________________________________________________________________________")
print("NN Model with 3 hidden layers and 30 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnn6)))
print("NN Model with 3 hidden layers and 30 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnn6)))

##PCA

### Decision of the right Number of PCs

We also performed a dimensional reduction due to the high number of feature, which is mainly caused by the levels of categorical features. We chose to use 200 PCs as a base and then to look at the cumulative variance. We found out that roughly 85 PCs can explain 90% of the variance.

In [97]:
# define and fit a PCA model wit k = 200 as reference
# 
pcavar = feature.PCA(k=200, inputCol='features', outputCol='pca_feat')
pca_var = Pipeline(
      stages=[vanum, std, vacate, va, pcavar])
pca_var_model = pca_var.fit(df_coded)
varlist = pca_var_model.stages[4].explainedVariance
npvar = np.cumsum(varlist)
pci = [pci for pci in range(1,201,1)]
dfvar = spark.createDataFrame(pd.DataFrame({"Number of PCs": pci, "Cumulative Variance Explained": npvar}))
display(dfvar)

**So we reach a explained variance of 90% with roughly 85 PCs**

### Modeling

In [100]:
# define pca feature function with k = 85
pca = feature.PCA(k=85, inputCol='features', outputCol='pca_feat')

In [101]:
# define the parameters and pipelines for the logistic regressions with PCA features
lambda_par1 = 0.01
alpha_par1 = 0.05
en_lr1_pca = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('pca_feat').\
        setRegParam(lambda_par1).\
        setElasticNetParam(alpha_par1)

# change the parameters of the second classifier below
lambda_par2 = 0.05
alpha_par2 = 0.05
en_lr2_pca = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('pca_feat').\
        setRegParam(lambda_par2).\
        setElasticNetParam(alpha_par2)

# change the parameters of the thrid classifier below
lambda_par3 = 0.1
alpha_par3 = 0.05
en_lr3_pca = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('pca_feat').\
        setRegParam(lambda_par3).\
        setElasticNetParam(alpha_par3)
        
en_lr_estimator1_pca = Pipeline(
    stages=[vanum, std, vacate, va, pca, en_lr1])
en_lr_estimator2_pca = Pipeline(
    stages=[vanum, std, vacate, va, pca, en_lr2])
en_lr_estimator3_pca = Pipeline(
    stages=[vanum, std, vacate, va, pca, en_lr3])

In [102]:
# fit the PCA logistic regression pipelines
en_lr_model1_pca = en_lr_estimator1_pca.fit(training_df)
en_lr_model2_pca = en_lr_estimator2_pca.fit(training_df)
en_lr_model3_pca = en_lr_estimator3_pca.fit(training_df)

In [103]:
# apply the pipelines to the validation dataset
dfmodel1_pca = en_lr_model1_pca.transform(validation_df)
dfmodel2_pca = en_lr_model2_pca.transform(validation_df)
dfmodel3_pca = en_lr_model3_pca.transform(validation_df)

In [104]:
#display(dfmodel2_pca)

### Performance

For some, the numbers of the AUC and accuracies are the same as of the Logistic Regression. The same applies to the raw predictions and probabilities. This might mean that the dimensional reduction has basically no impact at all.

So the performance of the Logistic Regression with PCA has the same conclusion as the normla Logisic Regression.

*We feel like there is a mistake, but we can not find it*

In [107]:
display(dfmodel2_pca)

In [108]:
display(dfmodel2)

In [109]:
# print the areas under the curve and accuracies of the different PCA logistic regression models
print("Logistic Regression Model 1: AUC = {}".format(bce.evaluate(dfmodel1_pca)))
print("Logistic Regression Model 2: AUC = {}".format(bce.evaluate(dfmodel2_pca)))
print("Logistic Regression Model 3: AUC = {}".format(bce.evaluate(dfmodel3_pca)))
#
print(dfmodel1_pca.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 1")).show())
print(dfmodel2_pca.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 2")).show())
print(dfmodel3_pca.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 3")).show())

### Exploring Loadings

In [111]:
# assign loadings of the first 3 PCs 
pc1 = en_lr_model2_pca.stages[4].pc.toArray()[:, 0].tolist()
pc2 = en_lr_model2_pca.stages[4].pc.toArray()[:, 1].tolist()
pc3 = en_lr_model2_pca.stages[4].pc.toArray()[:, 2].tolist()

In [112]:
# create pandas df with the loadings/PCs and the corresponding labels
pc_loadings = pd.DataFrame([labellist, pc1, pc2, pc3]).T.rename(columns={0: 'label', 
                                                                          1: 'load_pc1',
                                                                          2: 'load_pc2',
                                                                          3: 'load_pc3',})

In [113]:
# create spark df with the 5 highest and lowest loadings of the first PC
load1 = spark.createDataFrame(pd.concat((pc_loadings.sort_values('load_pc1').head(), 
           pc_loadings.sort_values('load_pc1').tail())))
display(load1)

Here we can see the top and bottom 5 features based on the loadings that describe the first PC.

In [115]:
# create spark df with the 5 highest and lowest loadings of the second PC
load2 = spark.createDataFrame(pd.concat((pc_loadings.sort_values('load_pc2').head(), 
           pc_loadings.sort_values('load_pc2').tail())))
display(load2)

Here we can see the top and bottom 5 features based on the loadings that describe the second PC.

In [117]:
# create spark df with the 5 highest and lowest loadings of the third PC
load3 = spark.createDataFrame(pd.concat((pc_loadings.sort_values('load_pc3').head(), 
           pc_loadings.sort_values('load_pc3').tail())))
display(load3)

Here we can see the top and bottom 5 features based on the loadings that describe the third PC.

### Plotting the first two PCs (with our own method)

In [120]:
# create spark df with the loadings of the first two PCs
dfpc_loadings = (spark.createDataFrame(pc_loadings))

In [121]:
dfmodel2_pca_pd = dfmodel2_pca.toPandas()

In [122]:
# loadings of PC1 as numpy array
nppc1 = np.asarray(pc1, dtype=np.float32)
# loadings of PC2 as numpy array
nppc2 = np.asarray(pc2, dtype=np.float32)

pc1list = []
pc2list = []
labellist2 = []

# calculate PC1 and PC2 for the first 500 rows
for i in range(500):    
  feat = dfmodel2_pca_pd.features[i].toArray()
  multi1 = np.multiply(nppc1,feat)
  multi2 = np.multiply(nppc2,feat)
  pc1list.append(sum(multi1))
  pc2list.append(sum(multi2))
  labellist2.append(dfmodel2_pca_pd.labelacq[i])
# the pc lists represent basically the pca_feat column after the PCA transformation

In [123]:
dfpcas = spark.createDataFrame(pd.DataFrame({"Acquisition": labellist2, "PC1": pc1list, "PC2": pc2list}))

Down below, we plotted the PC1 and PC2 o the first 500 observations. The color indicates if it was an acquisition or not. 

(*for this traning dataset*) The first PC is described by age & time to first funding (+) and number of investors & funding rounds (-). The second PC is described strongly by seed (+) and funding rounds & number of investors (-), and to a lesser degree age & total raised in USD. 

The figure has a cone like shape with more acquistions towards the middle of the cone. The majority of the observations gather around 1 - 0.5 (PC2) and -2 - 0 (PC1). This might indicate that most of the companies got seeded (PC2), are young, had a short time to first funding, has a good amount of investors and/or funding rounds. 

The companies that open op the cone to the right side seems to be older and had a longer time to first funding, as well as probably no seed funding. The cone opening on the left side might be younger than the average company, had a fast first funding and seemed to have more funding rounds & number of investors than the average company.

It is hard to see a pattern for the acquistions, so it would require more analyses (or the existing analyses as a support) to make any comments about the position of the acquistions.

In [125]:
display(dfpcas)

### Loadings vs. weights

In [127]:
# create spark df with all coefficients of LR model 2 (best performance in terms of AUC and accuracy combined)
# dflrcoeff
intloadings = ["seed","Q1","Q2","Q3","GBR_investor","GBR","numeric_time_to_first_funding","numeric_age","numeric_total_raised_usd","numeric_count_investor","numeric_funding_rounds","venture","USA_investor"]

dflrcoeff_filtered = dflrcoeff.where(col("label").isin(intloadings)).orderBy(col("weight").desc())
display(dflrcoeff_filtered)

We also tried to compare the loadings with the weights of the features in order to get any kind of insight. 

Maybe it would be possible to bring the loadings and weights of the features on the same scale and then create some sort of likelihood areas in different PC plots. And then its maybe even possible to combine these plots in order to make a prediction. 

*But that would be a project on its own*

In [129]:
# Display loadings and corresponding weights from the Logistic Regression
load1drop = load1.drop("load_pc2","load_pc3").orderBy(col("load_pc1").desc())
load1drop_join = load1drop.join(dflrcoeff_filtered, load1drop["label"] == dflrcoeff_filtered["label"], 'leftouter').drop(dflrcoeff_filtered["label"])
display(load1drop_join.orderBy(col("load_pc1").desc()))

In [130]:
# Display loadings and corresponding weights from the Logistic Regression
load2drop = load2.drop("load_pc1","load_pc3").orderBy(col("load_pc2").desc())
load2drop_join = load2drop.join(dflrcoeff_filtered, load2drop["label"] == dflrcoeff_filtered["label"], 'leftouter').drop(dflrcoeff_filtered["label"])
display(load2drop_join.orderBy(col("load_pc2").desc()))

# TESTING PERFORMANCE

We tested with the best performing (AUC and Accuracy) Random Forest model, Logistic Regression Model and Neureal Network Model.

In [133]:
# Best performing random forest model
dfrf_15_test = rf_pipeline_15.transform(testing_df)
print("Random Forest 15 trees: AUC = {}".format(bce.evaluate(dfrf_15_test)))
print(dfrf_15_test.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Random Forest with 15 trees")).show())
# Best performing logistic regression model
dfmodel2_pca_test = en_lr_model2_pca.transform(testing_df)
print("Logistic Regression Model 2: AUC = {}".format(bce.evaluate(dfmodel2_pca_test)))
print(dfmodel2_pca_test.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 2")).show())
# Best performing neural network model
dfnntest = mlp2_model.transform(testing_df)
print("NN Model with 3 hidden layers and 30 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnntest)))
print("NN Model with 3 hidden layers and 30 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnntest)))

The AUC and accuracy increased for every model in comparison to the validation performance. Random forest seems to have the best accuracy while having the worst AUC, which should raise a red flag and would require further analyses and explanations. Neural networks perform the best overall with a huge increase in the AUC value in comparison to the other models. 

So it would be worth it to really try to increase those numbers further with the optimal number of hidden layers and neurons.

# Additional Analysis with all columns standardized

As mentioned in the beginning of the notebook, we noticed that logistic regression gives us "bad" weights for the features. We noticed that when standardizing all of the features (all dummies included, not only numerical), that we get "better" weights with the logistic regression model, which also make sense when combining the results with the random forest conclusions.

In [137]:
training_df.columns[22:27]

In [138]:
training_df.columns[28:37]

In [139]:
# define vector assembler with the features for the modelling
va9 = VectorAssembler(). \
      setInputCols(training_df.columns[22:27] + training_df.columns[28:37]). \
      setOutputCol('features_nonstd')
	  	  
std9 = feature.StandardScaler(withMean=True, withStd=True).setInputCol('features_nonstd').setOutputCol('features')

## Random Forest 25 trees

In [141]:
# define binary classification evaluator
bce = BinaryClassificationEvaluator()

In [142]:
rf9 = RandomForestClassifier(numTrees=25, maxBins=10000, featuresCol='features', labelCol='label')
rf_pipeline_9 = Pipeline(stages=[va9, std9, rf9]).fit(training_df)
dfrf_9 = rf_pipeline_9.transform(validation_df)
print("Random Forest 25 trees: AUC = {}".format(bce.evaluate(dfrf_9)))
print(dfrf_9.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Random Forest with 25 trees")).show())
rfw9 = spark.createDataFrame(pd.DataFrame(list(zip(labellist, rf_pipeline_9.stages[2].featureImportances.toArray())),
            columns = ['column', 'importancy']).sort_values('importancy').tail(20))

The AUC and accuracy for random forest is similiar to the one in the main part.

In [144]:
display(rfw9)

The impoartancies also paint a similar picture.

## Logistric Regression

In [147]:
lambda_par9 = 0.05
alpha_par9 = 0.05
en_lr9 = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('features').\
        setRegParam(lambda_par9).\
        setElasticNetParam(alpha_par9)

In [148]:
en_lr_estimator9 = Pipeline(stages=[va9, std9, en_lr9])
en_lr_model9 = en_lr_estimator9.fit(training_df)
dfmodel9 = en_lr_model9.transform(validation_df)
print("Logistic Regression Model: AUC = {}".format(bce.evaluate(dfmodel9)))	
print(dfmodel9.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model")).show())	

Also in terms of logistic regression, we have a similar picture.

Here the first big difference occurs, the weights with completely standardized features are way "better", becaue it seems that the standardized categorical features (esp. the cities) get washed out and at the same time get actually better fitted in.

In [151]:
pd.DataFrame(list(zip(labellist, en_lr_model9.stages[2].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').tail(40)

The results tell us that it has huge impact when the company is decently established, is located in the USA, has the majority of the investors from the USA, has funding mostly by ventures and is maybe in the advertising market (category).

The negative weights show that the time to first funding should not be too high, meaning a company should attract investors fairly quick. Too many rounds also seems to have a negative impact. Seed funding seems to be negative as well, this seems to make sense because banks and venture capital investors estimate these kind of companies as risky investment, meaning the company has to esablish itself first (hihger age).

<blockquote>
What is 'Seed Capital'
Seed capital is the initial capital used when starting a business, often coming from the founders' personal assets, friends or family, for covering initial operating expenses and attracting venture capitalists. This type of funding is often obtained in exchange for an equity stake in the enterprise, although with less formal contractual overhead than standard equity financing. Because banks and venture capital investors view seed capital as an "at risk" investment by the promoters of a new venture, capital providers may wait until a business is more established before making larger investments of venture capital funding.
  
Source: https://www.investopedia.com/terms/s/seedcapital.asp
</blockquote>

This conclusion seems to be consistent with the results (importancies) of the random forest models (main and additional part).

In [153]:
pd.DataFrame(list(zip(labellist, en_lr_model9.stages[2].coefficients.toArray())),
            columns = ['label', 'weight']).sort_values('weight').head(10)

## Neural Networks

In [155]:
# define evaluators for accuracy and area under the curve
evaluator = evaluation.MulticlassClassificationEvaluator(metricName="accuracy")
evaluatorauc = evaluation.MulticlassClassificationEvaluator()

mlp9 = classification.MultilayerPerceptronClassifier(seed=0).\
    setStepSize(0.2).\
    setMaxIter(200).\
    setFeaturesCol('features').\
    setLayers([2137,30,30, 2])
	
mlp9_model = Pipeline(stages=[va9, std9, mlp9]).fit(training_df)
dfnn9 = mlp9_model.transform(validation_df)
print("NN Model with 2 hidden layers and 30 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnn9)))
print("NN Model with 2 hidden layers and 30 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnn9)))

The performance is a little bit worse than in the main part, but this also could be caused by the not optimal number of hidden layers and neurons.

## PCA + Logistic Regression

As mentioned at the beginning, 85 PCs cover not enough variance when standardizing all the columns, which means we need to find the correct number.

In [159]:
pca91 = feature.PCA(k=1500, inputCol='features', outputCol='pca_feat')
en_lr_estimator91_pca = Pipeline(
    stages=[va9, std9, pca91, en_lr2])
en_lr_model91_pca = en_lr_estimator91_pca.fit(training_df)
varlist9 =en_lr_model91_pca.stages[2].explainedVariance
npvar9 = np.cumsum(varlist9[1000:1500])
# 0.7 as starting point because we start at 1000
# was found in other analysis
npvar9 = [x+0.7 for x in npvar9]
pci9 = [pci9 for pci9 in range(1000,1500,1)]
dfvar9 = spark.createDataFrame(pd.DataFrame({"Number of PCs": pci9, "Cumlative Variance Explained": npvar9}))
display(dfvar9)

So we achieve 90% explained variance at around 1332 PCs, which is down to roughly 62% of the original features.

In [161]:
pca9 = feature.PCA(k=1332, inputCol='features', outputCol='pca_feat')
en_lr9_pca = LogisticRegression().\
        setLabelCol('label').\
        setFeaturesCol('pca_feat').\
        setRegParam(lambda_par9).\
        setElasticNetParam(alpha_par9)
en_lr_estimator9_pca = Pipeline(
    stages=[va9, std9, pca9, en_lr2])
en_lr_model9_pca = en_lr_estimator9_pca.fit(training_df)
dfmodel9_pca = en_lr_model9_pca.transform(validation_df)

In [162]:
print("Logistic Regression Model: AUC = {}".format(bce.evaluate(dfmodel9_pca)))
#
print(dfmodel9_pca.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 1")).show())

Also this performance does not change that much with all columns standardized and the appropriate number of PCs.

In [164]:
pc1 = en_lr_model9_pca.stages[2].pc.toArray()[:, 0].tolist()
pc2 = en_lr_model9_pca.stages[2].pc.toArray()[:, 1].tolist()
# create pandas df with the loadings/PCs and the corresponding labels
pc_loadings = pd.DataFrame([labellist, pc1, pc2]).T.rename(columns={0: 'label', 
                                                                          1: 'load_pc1',
                                                                          2: 'load_pc2',})

In [165]:
load1 = spark.createDataFrame(pd.concat((pc_loadings.sort_values('load_pc1').head(), 
           pc_loadings.sort_values('load_pc1').tail())))
display(load1)	

In [166]:
# create spark df with the 5 highest and lowest loadings of the second PC
load2 = spark.createDataFrame(pd.concat((pc_loadings.sort_values('load_pc2').head(), 
           pc_loadings.sort_values('load_pc2').tail())))
display(load2)	

## Testing Performance

In [168]:
dfrf_9_test = rf_pipeline_9.transform(testing_df)
print("Random Forest 25 trees: AUC = {}".format(bce.evaluate(dfrf_9_test)))
print(dfrf_9_test.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Random Forest with 25 trees")).show())
#
dfmodel9_pca_test = en_lr_model9_pca.transform(testing_df)
print("Logistic Regression Model: AUC = {}".format(bce.evaluate(dfmodel9_pca_test)))
print(dfmodel9_pca_test.select(fn.expr('float(label = prediction)').alias('correct')).select(fn.avg('correct').alias("Accuracy for Model 2")).show())
#
dfnntest9 = mlp9_model.transform(testing_df)
print("NN Model with 3 hidden layers and 30 neurons each: Accuracy = {}".format(evaluator.evaluate(dfnntest9)))
print("NN Model with 3 hidden layers and 30 neurons each: AUC = {}".format(evaluatorauc.evaluate(dfnntest9)))

With all columns standardized, we come to the same conclusion.

The AUC and accuracy increased for every model in comparison to the validation performance. Random forest seems to have the best accuracy while having the worst AUC, which should raise a red flag and would require further analyses and explanations. Neural networks perform the best overall with a huge increase in the AUC value in comparison to the other models. 

So it would also be worth it to really try to increase those numbers further with the optimal number of hidden layers and neurons here.

However, the performance of the models seems to be a little worse than the performance when only standardizing the numerical features, but it is not significant.