## Predicting a Startups Profit/Success Rate using Multiple Linear Regression in Python

While building this model using Multiple Linear Regression, we deal with a dataset which contains the details of 50 startup’s and predicts the profit of a new Startup based on certain features.To Venture Capitalists this could be a boon as to whether they should invest in a particular Startup or not.
So lets say that you work for a Venture Capitalist and your firm has hired you as a Data Scientist to derive insights into the data, and help them to predict whether a particular startup would be safe to invest in or not.
We can also derive useful insights into the data by actually seeing as to what difference does it make if a Startup is launched in a particular state.Or Which startup’s end up performing better by seeing that if they spent more money on marketing or was it their stellar R&D department which led them to this huge profit and in turn huge fame and success.


### Part 1 — Data Preprocessing
Importing the libraries…

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the dataset

In [8]:
data = pd.read_csv('D:\\Dataset\\50_Startups.csv')
data.shape

(50, 5)

In [9]:
data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


The dataset contains the following features(independent variables): — 

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          50 non-null float64
Administration     50 non-null float64
Marketing Spend    50 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 1.8+ KB


In [12]:
# Let's check statistics of dataset
data.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


In [13]:
#Let's check missing values in dataset
data.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

In [None]:
# Let's sort the features of dependent (predictor) and independent (target) variable

In [14]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

If we take a look at our Dataset we can clearly see that State is a String type variable and like we have discussed,We cannot feed String type variables into our Machine Learning model as it can only work with numbers.To overcome this problem we use the Label Encoder object and create Dummy Variables using the OneHotEncoder object…
Lets say that if we had only 2 states New York and California namely in our dataset then our OneHotEncoder will be of 2 columns only…
Similarly for n different states it would have n columns and each state would be represented by a series of 0s and 1s wherein all columns would be 0
except for the column for that particular state.
For ex:-
If A,B,C are 3 states then A=100,B=010,C=001
I think now you might be getting my point as to how the OneHotEncoder works…

### Encoding categorical data
Importing the Label Encoder Class along with OneHotEncoder

In [15]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Creating an Object of the Label Encoder class

In [16]:
labelencoder = LabelEncoder()

As it is clear that the only categorical data is the name of the state which is stored at the 3rd Index in our Dataset so we encode that column!

In [17]:
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Dropping the 1st column out of the Dataset which contains one of the OneHotEncoded values…

In [18]:
X = X[:, 1:]

### Splitting the dataset into the Training set and Test set
Importing the Libraries and Applying Cross Validation with 80% data as Training Data and 20% as Test Data.

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2— Fitting our Linear Regression Model
Fitting Multiple Linear Regression to the Training set
The Linear Regression equation would look like — — > y=b(0)+b(1)x(1)+b(2)x(2)+b(3)x(3)+b(4)D(1)+b(5)D(2)+b(6)D(3)…b(n+3)D(m-1)

Importing the Linear Regression Class

In [20]:
from sklearn.linear_model import LinearRegression

Creating an object of the Linear Regression Class

In [21]:
regressor = LinearRegression()

Fit the created object to our training set

In [22]:
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

## Part 3 — Predicting the Test set results

In [23]:
y_pred = regressor.predict(X_test)

Printing the predicted values : To see the difference b/w predicted and actual results we will also print the test values

In [24]:
pd.DataFrame({"Actual": y_test, "Predict": y_pred}).head()

Unnamed: 0,Actual,Predict
0,103282.38,103015.201598
1,144259.4,132582.277608
2,146121.95,132447.738452
3,77798.83,71976.098513
4,191050.39,178537.482211


Do you think that its actually the optimal model which we have just built?
Our model contains all the features some of which are statistically insignificant to our predictions.What we need to do is find a team of all independent variables which are infact helpful to our predictions.

In [37]:
from sklearn.metrics import mean_squared_error, r2_score

mean_squared_error = mean_squared_error(y_test, y_pred)
print('Mean Squared Error: %.2f' % mean_squared_error)

r2_score = r2_score(y_test, y_pred)
print('R2 Score: %.2f' % r2_score)

Mean Squared Error: 83502864.03
R2 Score: 0.93
