## Model.ipynb

**AUTHOR:** Shiyan Boxer

**DATE:** Dec 28th, 2020

**DESCRIPTION:** Create the model by doing the following:
- Choose relevant columns
- Get dummy data
- Split training and testing data
- Random forest
- Tune model with GridsearchCV
- Test ensembles

**DEPENDENCIES:**
- Python 3.8.6
- pandas
- numpy 
- pickle
- sklearn train test split
- sklearn RandomForestRegressor
- sklearn GridSearchCV
- sklearn mean_absolute_error

#### Import dependencies

In [1]:
import pandas as pd
import numpy as np
import os

#### Read file into DataFrame

In [3]:
df = pd.read_excel('C://Users//shiya//Documents//Dev//Startup-Success-Predictor//Clean//after.xlsx')
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd (divide by 1000),status,country_code,state_code,region,...,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H,success
0,/organization/advanced-northern-graphite-leade...,Advanced Northern Graphite Leaders,http://www.anglinc.ca,|Clean Technology|,Clean Technology,0,operating,CAN,AB,Sherwood Park,...,0,0,0,0,0,0,0,0,0,1
1,/organization/celebration-creation,Celebration Creation,http://www.celebrationcreation.ca,|Real Estate|,Real Estate,0,operating,CAN,AB,Calgary,...,0,0,0,0,0,0,0,0,0,1
2,/organization/justparts,JustParts,http://www.JustParts.com,|Auto|Marketplaces|E-Commerce|,Marketplaces,0,operating,CAN,AB,Thunder Bay,...,0,0,0,0,0,0,0,0,0,1
3,/organization/knighthaven,KnightHaven,http://www.knighthaven.com/,|Entertainment|Games|,Games,0,operating,CAN,AB,AB - Other,...,0,0,0,0,0,0,0,0,0,1
4,/organization/kotch-international-transportati...,Kotch International Transportation Design Spec...,http://www.kotchexotictours.com,|Transportation|,Transportation,0,operating,CAN,AB,AB - Other,...,0,0,0,0,0,0,0,0,0,1


#### Choose relevant columns (funding_total_usd, success, country_code, founded_year) and set it to df_model

In [4]:
df_model = df[['success','funding_total_usd (divide by 1000)','country_code','founded_year']]
print(df_model)

       success  funding_total_usd (divide by 1000) country_code  founded_year
0            1                                   0          CAN          2012
1            1                                   0          CAN          2011
2            1                                   0          CAN          2006
3            1                                   0          CAN          2011
4            1                                   0          CAN          2014
...        ...                                 ...          ...           ...
22217        1                             9950002          USA          2008
22218        1                             9952199          USA          2012
22219        1                             9957650          USA          1997
22220        1                             9990000          USA          2011
22221        1                             9999999          USA          2000

[22222 rows x 4 columns]


#### Get dummy data

In [5]:
df_dummie = pd.get_dummies(df_model)
print(df_dummie)

       success  funding_total_usd (divide by 1000)  founded_year  \
0            1                                   0          2012   
1            1                                   0          2011   
2            1                                   0          2006   
3            1                                   0          2011   
4            1                                   0          2014   
...        ...                                 ...           ...   
22217        1                             9950002          2008   
22218        1                             9952199          2012   
22219        1                             9957650          1997   
22220        1                             9990000          2011   
22221        1                             9999999          2000   

       country_code_ARE  country_code_ARG  country_code_AUS  country_code_AUT  \
0                     0                 0                 0                 0   
1                    

#### Train test split
- Create x variable (everything excluding success)
- Create x and y variable
- Split x and y train (80%) and test set (20%)

In [6]:
from sklearn.model_selection import train_test_split

X = df_dummie.drop('success', axis =1)
y = df_dummie.success.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X\n', X)
print('\n')
print('y\n', y)
print(X_train,'X_train\n', X_test, 'X_test\n', y_train,'y_train\n', y_test, 'y_test\n')

X
        funding_total_usd (divide by 1000)  founded_year  country_code_ARE  \
0                                       0          2012                 0   
1                                       0          2011                 0   
2                                       0          2006                 0   
3                                       0          2011                 0   
4                                       0          2014                 0   
...                                   ...           ...               ...   
22217                             9950002          2008                 0   
22218                             9952199          2012                 0   
22219                             9957650          1997                 0   
22220                             9990000          2011                 0   
22221                             9999999          2000                 0   

       country_code_ARG  country_code_AUS  country_code_AUT  country_cod

#### Linear Regression

In [7]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() # create an instance of the class LinearRegression
model.fit(X_train, y_train) # you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output (x and y) as the arguments

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

#### Test ensembles
- Test model on x set
- Get mean absolute error (MAE) value

In [8]:
r_sq = model.score(X_train, y_train) # input predictor x and regressor y and output coefficient of determination (𝑅²) with 
print('coefficient of determination:', r_sq)
print('\n')
print('intercept:', model.intercept_)
print('\n')
print('slope:', model.coef_)

coefficient of determination: 0.0023312664843709863


intercept: 1.0236641301809732


slope: [ 9.74564108e-12 -3.11617559e-05  3.88507506e-02  3.88740519e-02
  3.88973335e-02  3.89656435e-02  3.87914871e-02  4.01476961e-07
  3.89734428e-02  3.89944183e-02 -1.19423445e-02  3.86459069e-02
  3.89788384e-02  3.88034049e-02  3.88195747e-02  3.87677759e-02
  3.88314213e-02  3.88819586e-02  3.88755017e-02  3.88402728e-02
  3.89050777e-02  3.89158309e-02  1.83713715e-03  3.89827043e-02
  3.89919933e-02  3.88190987e-02 -2.94409932e-01 -2.17826552e-02
  3.88391286e-02 -1.11499178e-02  3.89208891e-02  3.90160057e-02
  3.89881301e-02  3.88147448e-02  3.89881368e-02 -2.36754835e-07
  3.86422586e-02  3.89598311e-02  3.90591328e-02  3.89968544e-02
 -2.46750630e-01  3.90269654e-02  3.87597667e-02  3.89657833e-02
 -9.61065473e-01 -7.22086803e-02  3.89390629e-02  3.88098757e-02
  3.87848053e-02  3.89655902e-02  3.89171004e-02  3.88721846e-02
  3.89034653e-02 -1.58438064e-02  3.87409916e-02]
