# Multiple Linear Regression for Startup Success Prediction

Imagine you are an investor looking to understand the factors that contribute to the success of startups. You have a dataset that includes information on R&D Spend, Administration, Marketing Spend, and the State in which the startup is based. Your goal is to predict the profit of a startup based on these factors using Multiple Linear Regression.

> A tutorial on How to use Multiple Linear Regression.

## 0.Data Preprocessing

### 0.1 Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### 0.2 Importing the dataset

0.2A Add CSV file



In [None]:
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

0.2B By Address:

In [2]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Importing the dataset
dataset = pd.read_csv('/content/drive/My Drive/Colab Notebooks/50_Startups.csv')

print("Numbert of total records:", len(dataset))
print()
dataset.head()


Mounted at /content/drive
Numbert of total records: 50



Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


### 0.3 Check if any null value

In [3]:
#Check if there are any missing values
dataset.isna().sum()
#We can see that there are no missing values in the dataset

Unnamed: 0,0
R&D Spend,0
Administration,0
Marketing Spend,0
State,0
Profit,0


In [4]:
#Extract some preliminary info about the dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


## 0.4 Split into X & y (Dependent and Independent Variables)

In [5]:
X = dataset.drop('Profit', axis=1) #Drop the profit column, which is the y
print(X)

    R&D Spend  Administration  Marketing Spend       State
0   165349.20       136897.80        471784.10    New York
1   162597.70       151377.59        443898.53  California
2   153441.51       101145.55        407934.54     Florida
3   144372.41       118671.85        383199.62    New York
4   142107.34        91391.77        366168.42     Florida
5   131876.90        99814.71        362861.36    New York
6   134615.46       147198.87        127716.82  California
7   130298.13       145530.06        323876.68     Florida
8   120542.52       148718.95        311613.29    New York
9   123334.88       108679.17        304981.62  California
10  101913.08       110594.11        229160.95     Florida
11  100671.96        91790.61        249744.55  California
12   93863.75       127320.38        249839.44     Florida
13   91992.39       135495.07        252664.93  California
14  119943.24       156547.42        256512.92     Florida
15  114523.61       122616.84        261776.23    New Yo

In [6]:
y = dataset['Profit']
print(y)

0     192261.83
1     191792.06
2     191050.39
3     182901.99
4     166187.94
5     156991.12
6     156122.51
7     155752.60
8     152211.77
9     149759.96
10    146121.95
11    144259.40
12    141585.52
13    134307.35
14    132602.65
15    129917.04
16    126992.93
17    125370.37
18    124266.90
19    122776.86
20    118474.03
21    111313.02
22    110352.25
23    108733.99
24    108552.04
25    107404.34
26    105733.54
27    105008.31
28    103282.38
29    101004.64
30     99937.59
31     97483.56
32     97427.84
33     96778.92
34     96712.80
35     96479.51
36     90708.19
37     89949.14
38     81229.06
39     81005.76
40     78239.91
41     77798.83
42     71498.49
43     69758.98
44     65200.33
45     64926.08
46     49490.75
47     42559.73
48     35673.41
49     14681.40
Name: Profit, dtype: float64


### 0.5 Encoding categorical data

Machine learning models generally work best with numerical data. The "State" column likely contains text values (e.g., "New York", "California"), which need to be converted into a numerical format the model can understand. This conversion process is called encoding. This specific code uses a technique called one-hot encoding.



In [7]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_feature = ["State"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_feature)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(X)

This section deals with displaying the data after it has been transformed using one-hot encoding. One-hot encoding creates new columns for categorical variables (in this case, "State") and represents each category with a binary value (0 or 1).

In essence, this code snippet aims to display the transformed data in a clear and organized manner using pandas DataFrame, making it easier to understand the effects of one-hot encoding on the dataset.

In [8]:
#print(transformed_X)
pd.DataFrame(transformed_X).head()
#We can print more neatly by converting it into a pandas dataframe again

Unnamed: 0,0,1,2,3,4,5
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42


This code snippet addresses a common issue in machine learning called the dummy variable trap.

What is the dummy variable trap?

When representing categorical data (like the "State" column) using one-hot encoding, we create a new column for each category.
This can introduce redundancy, as the information in one column can be inferred from the others.
This redundancy can negatively impact the performance of the model.

In [9]:
#Let's avoid the dummy variable trap by deleteing one column from the one hot encoded columns
transformed_X= np.delete(transformed_X, 0, 1)
print(transformed_X)

[[0.0000000e+00 1.0000000e+00 1.6534920e+05 1.3689780e+05 4.7178410e+05]
 [0.0000000e+00 0.0000000e+00 1.6259770e+05 1.5137759e+05 4.4389853e+05]
 [1.0000000e+00 0.0000000e+00 1.5344151e+05 1.0114555e+05 4.0793454e+05]
 [0.0000000e+00 1.0000000e+00 1.4437241e+05 1.1867185e+05 3.8319962e+05]
 [1.0000000e+00 0.0000000e+00 1.4210734e+05 9.1391770e+04 3.6616842e+05]
 [0.0000000e+00 1.0000000e+00 1.3187690e+05 9.9814710e+04 3.6286136e+05]
 [0.0000000e+00 0.0000000e+00 1.3461546e+05 1.4719887e+05 1.2771682e+05]
 [1.0000000e+00 0.0000000e+00 1.3029813e+05 1.4553006e+05 3.2387668e+05]
 [0.0000000e+00 1.0000000e+00 1.2054252e+05 1.4871895e+05 3.1161329e+05]
 [0.0000000e+00 0.0000000e+00 1.2333488e+05 1.0867917e+05 3.0498162e+05]
 [1.0000000e+00 0.0000000e+00 1.0191308e+05 1.1059411e+05 2.2916095e+05]
 [0.0000000e+00 0.0000000e+00 1.0067196e+05 9.1790610e+04 2.4974455e+05]
 [1.0000000e+00 0.0000000e+00 9.3863750e+04 1.2732038e+05 2.4983944e+05]
 [0.0000000e+00 0.0000000e+00 9.1992390e+04 1.35495

### 0.6 Splitting the dataset into the Training set and Test set

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size = 0.25, random_state = 2509)

random_state: This parameter ensures that the splitting is reproducible. Setting it to a specific value (like 2509 in this case) guarantees that the same split will be generated each time the code is run.

## 1. Training the Multiple Linear Regression model on the Training set

In [11]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

### 1.1 Score

This line of code is used to evaluate the performance of the trained Multiple Linear Regression model (regressor) on the test data.

This code is used to evaluate the performance of a trained machine learning model, specifically a Multiple Linear Regression model, on a set of data that it has not seen before during training. This unseen data is often referred to as the test set.

In [12]:
#test accuracy
regressor.score(X_test, y_test)

0.984006429174176

This code snippet is used to evaluate the performance of the trained Multiple Linear Regression model (regressor) on the training data.

In [13]:
#train accuracy
regressor.score(X_train, y_train)

0.9377294646372122

## 2. Predicting the Test set results

This line is where the actual prediction happens using the trained Multiple Linear Regression model.

In [14]:
y_pred = regressor.predict(X_test)

This code snippet is all about evaluating the performance of the machine learning model built previously. It uses a metric called Mean Squared Error (MSE) to do this **(evaluating how well the machine learning model performed on the test data).**

In [15]:
from sklearn.metrics import mean_squared_error
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
# The Mean Squared Error is a measure of the average squared difference between your model's
# predicted values (y_pred) and the actual values (y_test).


Mean Squared Error: 24637182.38629506


### 2.1 Compare Predicted results

This line of code is essentially creating a container (d) to hold both the model's predictions (y_pred) and the true values (y_test) for comparison. This is a common practice in machine learning to assess how well the model's predictions match the actual outcomes.

In [16]:
#Create a dictionary from the actual test result and predicted result
d = {'y_pred': y_pred, 'y_test': y_test}

This line of code is using the pandas library (pd) to create a DataFrame. A DataFrame is essentially a table-like structure that organizes data into rows and columns, making it easier to work with and analyze.

'y_pred' : holding the values predicted by your model
'y_test' : holding the actual, true values from your dataset.

In [17]:
pd.DataFrame(d)

Unnamed: 0,y_pred,y_test
32,98884.371543,97427.84
33,100047.235184,96778.92
47,47766.247901,42559.73
9,154976.558305,149759.96
37,91129.087779,89949.14
8,151755.926389,152211.77
23,112436.19586,108733.99
24,113375.898676,108552.04
17,130706.106786,125370.37
1,189141.730655,191792.06


This section of the code is about using the trained machine learning model (regressor) to make a prediction on a new, unseen data point.

New_Data = [[1,1,165349.20,136897.80,471784.10]]:

This line creates a variable called New_Data and assigns it a list of lists, effectively creating a 2D array. This 2D array represents the new startup data for which we want to predict profit.
Important: The values within this list correspond to the features used to train the model, including the encoded state. For instance, the first two values (1, 1) might represent the one-hot encoded values for the 'State' of the startup (California and Florida), while the next values represent "R&D Spend", "Administration", and "Marketing Spend".
y_predNew = regressor.predict(New_Data):

Here, the regressor (the trained model) is used to make the prediction.
The predict() function takes New_Data as input and calculates a predicted profit.
The predicted profit is stored in a new variable called y_predNew.
print(y_predNew):

Finally, this line prints the predicted profit (y_predNew) to the console, showing the model's output for the new startup data.

In [None]:
#Make a prediction with a new data, give encoded value for state
New_Data = [[1,1,165349.20,136897.80,471784.10]]#Here we made new data as a 2D array.
#This is because regressor.predict expects a 2D array
y_predNew = regressor.predict(New_Data)
print(y_predNew)

[193288.15380141]
