#### Description
- Predicts Success/Status which can be operating", "acquired", or "closed" based on total funding, founding year
- Split data 80% training and 20% testing
- Train on linear regression
- Pickle / serialize in bit form the model


#### Import dependencies

In [1]:
import pandas as pd
import numpy as np

#### Read file into DataFrame
- Open the xlsx file and store the DataFrame in variable "df"
- Print the first 5 lines of the file using the head() function

In [2]:
df = pd.read_excel('after.xlsx')
df.head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H,success
0,/organization/advanced-northern-graphite-leade...,Advanced Northern Graphite Leaders,http://www.anglinc.ca,|Clean Technology|,Clean Technology,0,operating,CAN,AB,Sherwood Park,...,0,0,0,0,0,0,0,0,0,1
1,/organization/celebration-creation,Celebration Creation,http://www.celebrationcreation.ca,|Real Estate|,Real Estate,0,operating,CAN,AB,Calgary,...,0,0,0,0,0,0,0,0,0,1
2,/organization/justparts,JustParts,http://www.JustParts.com,|Auto|Marketplaces|E-Commerce|,Marketplaces,0,operating,CAN,AB,Thunder Bay,...,0,0,0,0,0,0,0,0,0,1
3,/organization/knighthaven,KnightHaven,http://www.knighthaven.com/,|Entertainment|Games|,Games,0,operating,CAN,AB,AB - Other,...,0,0,0,0,0,0,0,0,0,1
4,/organization/kotch-international-transportati...,Kotch International Transportation Design Spec...,http://www.kotchexotictours.com,|Transportation|,Transportation,0,operating,CAN,AB,AB - Other,...,0,0,0,0,0,0,0,0,0,1


#### Choose relevant columns (funding_total_usd, success, country_code, founded_year) and assign the filtered dataset to a variable called "df_model"

In [3]:
df_model = df[['success','funding_total_usd','founded_year']]
print(df_model)

       success  funding_total_usd  founded_year
0            1                  0          2012
1            1                  0          2011
2            1                  0          2006
3            1                  0          2011
4            1                  0          2014
...        ...                ...           ...
22217        1                  0          2008
22218        1                  0          2012
22219        1                  0          1997
22220        1                  0          2011
22221        1                  0          2000

[22222 rows x 3 columns]


#### Get dummy data on  df_model
- Using the get_dummies() [pandas function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) which converts categorical variable into dummy/indicator variables.

In [4]:
df_dummie = pd.get_dummies(df_model)
print(df_dummie)

       success  funding_total_usd  founded_year
0            1                  0          2012
1            1                  0          2011
2            1                  0          2006
3            1                  0          2011
4            1                  0          2014
...        ...                ...           ...
22217        1                  0          2008
22218        1                  0          2012
22219        1                  0          1997
22220        1                  0          2011
22221        1                  0          2000

[22222 rows x 3 columns]


#### Train test split
- Create x variable (everything excluding "success")
- Create y variable ("success" values)
- Split x and y train (80%) and test set (20%)

In [5]:
from sklearn.model_selection import train_test_split

# Create x and y variables
X = df_dummie.drop('success', axis =1) # axis = 1 means drop column
y = df_dummie.success.values

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the data
print(X, y)
print(X_train, X_test, y_train,y_test)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

       funding_total_usd  founded_year
0                      0          2012
1                      0          2011
2                      0          2006
3                      0          2011
4                      0          2014
...                  ...           ...
22217                  0          2008
22218                  0          2012
22219                  0          1997
22220                  0          2011
22221                  0          2000

[22222 rows x 2 columns] [1 1 1 ... 1 1 1]
       funding_total_usd  founded_year
19881                  0          2010
11263                  0          2012
6271                   0          2003
8300                   0          2012
18798                  0          2012
...                  ...           ...
11964                  0          2012
21575                  0          2009
5390                   0          1999
860                    0          2013
15795                  0          2009

[17777 rows x 2 col

((17777, 2), (17777,), (4445, 2), (4445,))

#### Train with Linear Regression
- Create an instance of the class LinearRegression
- Calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output (x and y) as the arguments

In [6]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train) # Fit linear model fit(X, y[, sample_weight])
print(model)

LinearRegression()


#### Train with Decision Tree

In [7]:
# https://scikit-learn.org/stable/modules/tree.html#classification
from sklearn import tree

model_2 = tree.DecisionTreeClassifier()
model_2 = model_2.fit(X_train, y_train)

#### Get Score - Linear Regression
- Get output coefficient of determination (𝑅²)
- Get intercept and slope

In [8]:
r_sq = model.score(X_train, y_train) # input predictor x and regressor y and output coefficient of determination (𝑅²) with 
print('coefficient of determination:', r_sq)
print('\n')
print('intercept:', model.intercept_)
print('\n')
print('slope:', model.coef_)

coefficient of determination: 2.589753489323776e-06


intercept: 1.0414482173331012


slope: [ 0.00000000e+00 -4.75896619e-05]


#### Get Classification Report - Linear Regression

In [9]:
from sklearn.metrics import classification_report
print(classification_report(y_test, np.around(model.predict(X_test),0)))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       250
           1       0.94      1.00      0.97      4195

    accuracy                           0.94      4445
   macro avg       0.47      0.50      0.49      4445
weighted avg       0.89      0.94      0.92      4445



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Test Accuracy - Linear Regression

In [10]:
from sklearn.metrics import accuracy_score, precision_score
y_pred = np.around(model.predict(X_test),0)
accuracy_score(y_test, y_pred) * 100
precision_score(y_test, y_pred, average='micro')


0.9437570303712036

#### Pickle the model 
- Pickle the model so it can be loaded in the Flask API to be called for making predictions

In [11]:
import pickle
pickl = {'model': model.fit(X_train, y_train)}
pickle.dump( pickl, open( 'model_file' + ".p", "wb" ) )

file_name = "model_file.p"
with open(file_name, 'rb') as pickled:
    data = pickle.load(pickled)
    model = data['model']