## <font color='#607c8e'>Startup's Success Analysis</font>
<font color='#cb416b'>Data Science Foundation Program BootCamp<br/></font>
Raquel Câmara Porto

<b>Objective: Predicting the profit of a new Startup based on certain features and deciding whether one should invest in a particular startup or not.</b>

### <font color='#3c4142'>Model Building</font>

First of all, in order to keep using the variables and dataset from other Notebooks, it is necessary to import the files into this Notebook. For this, Jupyter provide us with a Built-in magic command <font color='#9a0eea'><b>%</b></font>run that allows us to have access to all the methods and variables created on some other files.

In [1]:
# importing variables and libs
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

%run "./2º Exploratory Data Analysis (EDA).ipynb"

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


To make use of the features and target variables on the machine learning models, it is necessary to first convert their types into an array of values the models will be able to read. So, all the categorical data must be transformed into numerical data.

In [2]:
# Turning the variables into an array of the values
features_vals = features.values
target_vals = target.values

In [3]:
# Transforming State column in features variable into a Numerical type with OneHotEncoder()
transformer = ColumnTransformer(transformers=
                                [("features", OneHotEncoder(),[3])],
                                remainder='passthrough')

features_vals = transformer.fit_transform(features_vals.tolist())

In [4]:
# Checking the transformed data
features_vals

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 1.0, 94657.16, 145077.58

However, the features variable still is an <b>object</b> dtype.<br/>
To meet the criterias for the model creation, this variable will be transformed into a normal <b>array</b>.

In [5]:
# Making the features dtype as float values
features_vals = features_vals.astype(float)
features_vals

array([[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.6534920e+05,
        1.3689780e+05, 4.7178410e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.6259770e+05,
        1.5137759e+05, 4.4389853e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.5344151e+05,
        1.0114555e+05, 4.0793454e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.4437241e+05,
        1.1867185e+05, 3.8319962e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.4210734e+05,
        9.1391770e+04, 3.6616842e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.3187690e+05,
        9.9814710e+04, 3.6286136e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3461546e+05,
        1.4719887e+05, 1.2771682e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
        1.4553006e+05, 3.2387668e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.2054252e+05,
        1.4871895e+05, 3.1161329e+05],
       [1.0000000e+00, 0.0000000e+00,

With the data ready to be used, we separate the values of both features and target variables into <b>training</b> and <b>testing</b> data for our models to be fed. 

In [6]:
# Separating the values into training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(features_vals,
                                                    target_vals,
                                                    test_size=0.2,
                                                    random_state=1)

In [7]:
# Checking the samples shapes
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(40, 6) (10, 6) (40,) (10,)


### <font color='#3c4142'>Decision Tree Regression Model</font>

In [8]:
#Creating the fisrt model using Decision Tree Regression
DTR = DecisionTreeRegressor(min_samples_leaf=3)
DTR.fit(X_train, Y_train)

DecisionTreeRegressor(min_samples_leaf=3)

In [9]:
# Checking the scores of both training and testing samples
print(DTR.score(X_train, Y_train))
print(DTR.score(X_test, Y_test))

0.9610596256645947
0.9810304512387338


Using a max_depth of 2 for, we achieved a satisfactory result with a <b>generalized model</b>.<br/>

As the name implies, a generalized model is a model that has the ability to generalize well from our training data when facing unseen data, and does not perform <b>overfitting</b>.<br/>

Overfitting occurs when a model goes <b>too deep</b> on learning the training data to the point of start making predictions based on <b>noises</b> (outliers and randomness). If that happens, we will see an algorithm that can perform oddly and extraordinarily well on its training data, but very poorly on any new and unseen data presented to it. And that happens because the model becomes too flexible in learning from the data presented and as a consequence, it starts making predictions with higher variances.

A well regularized model should perform properly on unseen data and not only on its training data.

### <font color='#3c4142'>Random Forest Regression Model</font>

In [10]:
# Creating the second model using Random Forest Regression
RF = RandomForestRegressor(n_estimators=2, max_depth=3, random_state=8)
RF.fit(X_train, Y_train.ravel())

RandomForestRegressor(max_depth=3, n_estimators=2, random_state=8)

In [11]:
# Checking the scores of both training and testing samples
print(RF.score(X_train, Y_train))
print(RF.score(X_test, Y_test))

0.9673556712742566
0.9225856875164918


In [12]:
# Looking for better results
for i in range(4,10):
    RF=RandomForestRegressor(n_estimators=i, max_depth=3, random_state=8)
    RF.fit(X_train, Y_train.ravel())
    train_score = RF.score(X_train, Y_train)
    test_score = RF.score(X_test, Y_test)

    print("estimator: %d" %i)
    print("- Training Score: %.5f" %train_score)
    print("- Testing Score:  %.5f" %test_score)

estimator: 4
- Training Score: 0.96883
- Testing Score:  0.98413
estimator: 5
- Training Score: 0.97149
- Testing Score:  0.97399
estimator: 6
- Training Score: 0.97392
- Testing Score:  0.96964
estimator: 7
- Training Score: 0.97574
- Testing Score:  0.96183
estimator: 8
- Training Score: 0.97441
- Testing Score:  0.96492
estimator: 9
- Training Score: 0.97468
- Testing Score:  0.96784


We can conclude that with 4 estimators, a max depth of 3 and a random state of 8, a better generalized model could be derived for this particular dataset.<br/>
That happened when the model had its testing score performing better than the training score. It means that the model is able to fit very well the data, and the independent variable (Y) can explain most of the changes on the dependet variables (X).<br/>

The score method returns the coefficient of determination <b>R²</b> of the prediction, and it indicates the percentage of the variance in the dependent variable that the independent variables can explain. R-squared measures the strength of the <b>relationship</b> between the model and the X variables on a 0 – 100% scale.<br/>

With the model in hands, we can now start graphically analyze the original data with the predicted data from both models and come to a conclusion about the starrup's profits.