# Intuition

    1. Expression: 
        y = b0 + b1*x1 + b2*x2 + ..... + bn*xn
        
    2. Assumptions:
        a. Linearity
        b. Homoscendasticity
        c. Multivariate Normality
        d. Independence of Errors
        e. Lack of multicollinearity
        
    3. Dummy Variables
        y = b0 + b1*(R&DSpend) + b2*(Admin) + b3*(Marketing) + b4*(State)
        State? How to multiply b4 with some non-numeric stuff? or with categorical values?
        Ans: Create dummy variables
            Note: Donot include "All" of the columns of your dummy varibles into the regression. 
                  Always emit one dummy variable.
        Hence: y = b0 + b1*(R&DSpend) + b2*(Admin) + b3*(Marketing) + b4*(D1)
        We have absolutely nothing to worry about the dummy variable and its trap, i.e we dont have to remove the one dummy variable.
        
     4. Understand what P-Value means? 
         In short, if p value is greater than significance level , it indicates that , that variable doesnot contribute much or might be negligible in predicting the target variable. So it is dropped!
     
     5. Building a Model
         - Decide which attribute to keep and which to discard, i.e selecting the right attributes
         5.1) All-in
         5.2) Backward Elimination (Step-wise Regression) -> Preferable
         5.3) Forward Elimination (Step-wise Regression)
         5.4) Bi-directional Elimination (Step-wise Regression)
         5.5) Score Comparision
         
         Good News! We dont really need to worry about the features to keep and discard. The class of sklearn will itself identify the features and keeps them!
         
[IMPORTANT]: We dont need to apply feature scaling in Multiple Linear Regression(MLR) and Simple Linear Regression, because in the equation of MLR we have coefficients that multiply to each features hence it doesn't matter that some features have high values and the others low values. The multiplier (or the coeff.) of MLR will adjust itself.

[IMPORTANT]: Do we need to apply the assumptions before getting started? Well, that look pretty on theory, however, we can simply go ahead with MLR without considering the assumptions and look for the accuracy and final results. If results are bad, reject the model else great! you got the right model!

# Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing the dataset

In [7]:
dataset = pd.read_csv("Chapter_3_50_Startups_Data.csv")
X = dataset.iloc[:, :-1].values 
Y = dataset.iloc[:,-1].values
# print(X,Y)
#iloc-> locate indexes
# We use .values to convert it into numpy array.

# Encoding Categorial Data

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])],remainder='passthrough')
X = np.array(ct.fit_transform(X))
# we are forcing the output to be a numpy array, hence calling the np.arrar()

# Splitting the Dataset

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

# Training the Multiple Linear Regression model on Training Set

In [8]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

# Predicting the Test set results

In [25]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1),Y_test.reshape(len(Y_test),1)),1))

[[114664.42 105008.31]
 [ 90593.16  96479.51]
 [ 75692.84  78239.91]
 [ 70221.89  81229.06]
 [179790.26 191050.39]
 [171576.92 182901.99]
 [ 49753.59  35673.41]
 [102276.66 101004.64]
 [ 58649.38  49490.75]
 [ 98272.03  97483.56]]


In [22]:
for i in range(len(X_test)):
    print(y_pred[i], Y_test[i])

114664.41715867177 105008.31
90593.155316208 96479.51
75692.8415157455 78239.91
70221.88679652037 81229.06
179790.25514872276 191050.39
171576.92018520602 182901.99
49753.58752030707 35673.41
102276.65888936335 101004.64
58649.37795761135 49490.75
98272.0256113121 97483.56


##### Question 1: How do I use my multiple linear regression model to make a single prediction, for example the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California?

In [26]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[180892.25]


Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features X, "California" was encoded as "1, 0, 0". And be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

##### Getting the final linear regression equation with the values of the coefficients


In [27]:
print(regressor.coef_)
print(regressor.intercept_)

[-2.85e+02  2.98e+02 -1.24e+01  7.74e-01 -9.44e-03  2.89e-02]
49834.88507323134


Therefore, the equation of our multiple linear regression model is:

Profit=86.6×Dummy State 1−873×Dummy State 2+786×Dummy State 3−0.773×R&D Spend+0.0329×Administration+0.0366×Marketing Spend+42467.53

Important Note: To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.