## Model.ipynb

**AUTHOR:** Shiyan Boxer

**DATE:** Dec 28th, 2020

**DESCRIPTION:** Create the model by doing the following:
- Choose relevant columns
- Get dummy data
- Split training and testing data
- Random forest
- Tune model with GridsearchCV
- Test ensembles

**DEPENDENCIES:**
- Python 3.8.6
- pandas
- numpy 
- pickle
- sklearn train test split
- sklearn RandomForestRegressor
- sklearn GridSearchCV
- sklearn mean_absolute_error

#### Import dependencies

In [1]:
import pandas as pd
import numpy as np
import os

#### Read file into DataFrame

In [2]:
df = pd.read_excel('C://Users//shiya//Documents//Dev//Startup-Success-Predictor//Clean//after.xlsx')
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd (divide by 1000),status,country_code,state_code,region,...,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H,success
0,/organization/advanced-northern-graphite-leade...,Advanced Northern Graphite Leaders,http://www.anglinc.ca,|Clean Technology|,Clean Technology,0,operating,CAN,AB,Sherwood Park,...,0,0,0,0,0,0,0,0,0,1
1,/organization/celebration-creation,Celebration Creation,http://www.celebrationcreation.ca,|Real Estate|,Real Estate,0,operating,CAN,AB,Calgary,...,0,0,0,0,0,0,0,0,0,1
2,/organization/justparts,JustParts,http://www.JustParts.com,|Auto|Marketplaces|E-Commerce|,Marketplaces,0,operating,CAN,AB,Thunder Bay,...,0,0,0,0,0,0,0,0,0,1
3,/organization/knighthaven,KnightHaven,http://www.knighthaven.com/,|Entertainment|Games|,Games,0,operating,CAN,AB,AB - Other,...,0,0,0,0,0,0,0,0,0,1
4,/organization/kotch-international-transportati...,Kotch International Transportation Design Spec...,http://www.kotchexotictours.com,|Transportation|,Transportation,0,operating,CAN,AB,AB - Other,...,0,0,0,0,0,0,0,0,0,1


#### Choose relevant columns (funding_total_usd, success, country_code, founded_year) and set it to df_model

In [41]:
df_model = df[['success','funding_total_usd (divide by 1000)','founded_year']]
print(df_model)

       success  funding_total_usd (divide by 1000)  founded_year
0            1                                   0          2012
1            1                                   0          2011
2            1                                   0          2006
3            1                                   0          2011
4            1                                   0          2014
...        ...                                 ...           ...
22217        1                             9950002          2008
22218        1                             9952199          2012
22219        1                             9957650          1997
22220        1                             9990000          2011
22221        1                             9999999          2000

[22222 rows x 3 columns]


#### Get dummy data

In [42]:
df_dummie = pd.get_dummies(df_model)
print(df_dummie)

       success  funding_total_usd (divide by 1000)  founded_year
0            1                                   0          2012
1            1                                   0          2011
2            1                                   0          2006
3            1                                   0          2011
4            1                                   0          2014
...        ...                                 ...           ...
22217        1                             9950002          2008
22218        1                             9952199          2012
22219        1                             9957650          1997
22220        1                             9990000          2011
22221        1                             9999999          2000

[22222 rows x 3 columns]


#### Train test split
- Create x variable (everything excluding success)
- Create x and y variable
- Split x and y train (80%) and test set (20%)

In [43]:
from sklearn.model_selection import train_test_split

X = df_dummie.drop('success', axis =1)
y = df_dummie.success.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X\n', X)
print('\n')
print('y\n', y)
print(X_train,'X_train\n', X_test, 'X_test\n', y_train,'y_train\n', y_test, 'y_test\n')

X
        funding_total_usd (divide by 1000)  founded_year
0                                       0          2012
1                                       0          2011
2                                       0          2006
3                                       0          2011
4                                       0          2014
...                                   ...           ...
22217                             9950002          2008
22218                             9952199          2012
22219                             9957650          1997
22220                             9990000          2011
22221                             9999999          2000

[22222 rows x 2 columns]


y
 [1 1 1 ... 1 1 1]
       funding_total_usd (divide by 1000)  founded_year
19881                              700000          2010
11263                              250000          2012
6271                             15000000          2003
8300                              1199936          2

In [44]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((17777, 2), (17777,), (4445, 2), (4445,))

#### Linear Regression
- Create an instance of the class LinearRegression
- Calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output (x and y) as the arguments

In [45]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

#### Test ensembles
- Get output coefficient of determination (𝑅²)
- Get intercept and slope

In [46]:
r_sq = model.score(X_train, y_train) # input predictor x and regressor y and output coefficient of determination (𝑅²) with 
print('coefficient of determination:', r_sq)
print('\n')
print('intercept:', model.intercept_)
print('\n')
print('slope:', model.coef_)

coefficient of determination: 0.00010736079517037478


intercept: 1.0037458005602544


slope: [ 9.66022749e-12 -2.88902567e-05]


#### Pickle the model 
- Turn model into API endpoint using Flask (productionization)

In [8]:
import pickle
pickl = {'model': model.fit(X_train, y_train)}
pickle.dump( pickl, open( 'model_file' + ".p", "wb" ) )

file_name = "model_file.p"
with open(file_name, 'rb') as pickled:
    data = pickle.load(pickled)
    model = data['model']

model.predict(np.array(list(X_test.iloc[1,:])).reshape(1,-1))[0]

0.945163778298854

In [9]:
X_test = np.array(X_test)

In [10]:
X_test[0]

array([   0, 2013,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    1,    0],
      dtype=int64)

In [47]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, np.around(model.predict(X_test),0))

0.9437570303712036

In [48]:
predictions = model.predict(X_test)
answer = list(np.around(predictions,0) == y_test)

In [49]:
acc = answer.count(True) / len(answer)
acc

0.9437570303712036

In [50]:
from sklearn.metrics import classification_report


In [52]:
print(classification_report(y_test, np.around(model.predict(X_test),0)))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       250
           1       0.94      1.00      0.97      4195

   micro avg       0.94      0.94      0.94      4445
   macro avg       0.47      0.50      0.49      4445
weighted avg       0.89      0.94      0.92      4445



  'precision', 'predicted', average, warn_for)
