In [6]:
import pandas as pd
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

In [7]:
%matplotlib inline

In [8]:
# In this Multiple Regression model, we are given a dataset of 50 startup companies. It gives you the profit for the
# companies for a financial year and their different spending patterns. So a venture capitalist company has hired you
# as a data scientist and you need to predict if they should invest in a given unknown company given their spending pattern.
# Remember, they are interested in making maximum profit. It's not a YES or no classification, you need to predict
# the profit and then a decision needs to be taken, so it becomes a regression model problem.
# Lets import our dataset
dataset = pd.read_csv('/home/rajatgirotra/study/machine_learning/course/MachineLearningA-ZTemplateFolder/Part2_Regression/Section5_MultipleLinearRegression/50_Startups.csv')

In [9]:
# we have 30 observations in our dataset
dataset

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


In [10]:
# Just like linear regression formula is y = mx + b. Multiple linear regression formula is
# y = b1x1 + b2x2 + b3x3 + ... + b0 (where b0 is the y intercept) and b1, b2, b3, etc are all coefficients
# Independent variables : x1, x2, x3 etc
# dependant variable: y
# constant: b0
# ceofficients: b1, b2, b3 etc


In [12]:
# separate features and labels
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [15]:
# create training and test data
X_train, X_test, Y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [20]:
# Change categorical data State to quantitative data
# VERY VERY VERY VERY IMPORTANT INFO TO FOLLOW
##########################################################################################
""" 
Converting categorical data to quantitative is done by adding DUMMY VARIABLES. Assume your Multi Liner Regression
Equation is 

y = b0 + b1x1 + b2x2 + b3x3 + ....

where one categorical column is StateName with value either NewYork or California, then you introduce dummy column
NewYork and California with value either 0 or 1(ie like a switch). Each row will have 1 in either NewYork or California

so the equation now becomes

y = b0 + b1x1 + b2x2 + b4D4 + b5D5 (where x3 (ie state col) is split into dummy cols NewYork and California).

This equation is actually wrong, because the dummy cols behave like a switch. So lets say if NewYork is D4, and b4 is 0,
i.e b4D4 = 0, then it means that by default, D5 should be 1. i.e. D5 = 1-D4, and the equation should really be

y = b0 + b1x1 + b2x2 + b4D4, and if b4D4 is 0, the equation becomes y = b0 + b1x1 + b2x2 for California. We say that
the coefficient for California is included in the constant b0. Therefore, never fall in the dummy variable trap.

Always use as n-1 dummy expressions where n is the number of dummy cols.




Also you have read before how to choose good features. Dont use features which are useless. Using too many features
also makes it difficult to represent to a large audience and reason out why so many features are used.
There are some steps to take to build a good model:
1) All-in
2) Backward Elimination
3) Forward Selection
4) Bi-directional Elimination
5) Score comparison 

Step 2, 3, 4 are togther referred to as Stepwise regression

1) All-in means you just know what features are your best predictors (prior knowledge) because
   a) You know that from domain knowledge
   b) or from your experience (you have done such a model before)
   c) or some-one gave you those predictors and asked to use those
   d) Or you are preparing for Backward Elimination
   
2) Backward Elimination: This method has some steps to it.
   a) Step 1: Select a Significance Level (SL).. Example SL = 0.05 (ie 5%)
   b) Step 2: Fit the full model with all possible predictors (All-in)
   c) Step 3: Consider the predictor with the highest P-value. if P > SL, go to step 4, otherwise go to FIN (ie finished, your model is ready)
   d) Step 4: Remove the predictor
   e) Step 5: Fit the model again without the predictor
   f) Step 6: Go to step c again

3) Forward Selection: Much complex than BE.
   a) Step 1: Select a Significance Level (SL).. Example SL = 0.05 (ie 5%)
   b) For each predictor, we fit a simple linear regression model. Then we select the model with the lowest P-value.
   Example: let say you have 4 predictor F1, F2, F3, F4. Then you fit 4 simple linear regression models each for F1, F2, F3, F4
   Let say F3 had the lowest P value.
   c) We keep this variable (F3), and then we fit all possible models with one extra predictor added to the ones you already have
   ie. we now create linear regression models with two variables where one variable is always F3, so options are:
   F1F3, F2F3, F4F3.
   d) Consider the predictor with the lowest P-value. If P < SL, go to step c), otherwise go to FIN (your model is ready)
   Let say this was F2F3. and P value was less than 0.05. So we repeat step c) again, and fit all possible models with one extra
   predictor to the ones we already have, ie F1F2F3, F4F2F3. Ie we create linear regression model with 3 variables.
   Let say this time lowest P value is for F1F2F3 and value > 0.05. So we stop.

4) Bi-directional elimination: Is a combination of 2) and 3)
   a) Step 1: Select a SL to stay and a SL to enter. SLSTAY = 0.05 and SLENTER = 0.05
   b) Step 2: Perform the next step of Forward Selection (ie. new variables must have P < SLENTER to enter)
   So same example as above: you have F3 predictor with the lowest P value. Let say P-value(F3) = 0.02
   c) Step 3: Perform ALL steps of BE. (ie. old variables must have a P-value > SLSTAY to stay)
   at the first iteration P-value(F3) = 0.02 is < 0.05, so we go to FIN in BE
   d) Step 4: go to step b). Let say we get F1F3, F2F3, F4F3 and F2F3 has lowest P-value 0.04
   e) Again at step c)P-value of (F2F3) = 0.04 < 0.05, so we go to FIN in BE
   f) Again at step b) You add F1F2F3, then at step c) you eliminate F2, and you are left with F1F3.
   g) At any step where you can not add any variables or delete any old variables you are done.
   
5) Score Comparison: ie. brute force approach. 
   a) Select a criteria for your model example: r-squared. 
   b) Calculate r-squared for all possible combinations of predictors : 2^n -1 for n predictors
   c) Choose the model with the best criteria.
   Note a good approach as the number of models is growing exponentially and is very resource consuming.

We will use backward elimination in our study as it is the fastest.
A word on p-value: Fill this section when you read more about difference in experimental and theoretical propability.
"""

###########################################################################################
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [21]:
le_X = LabelEncoder()
ohe = OneHotEncoder(categorical_features=[3])

In [22]:
X_train[:, 3] = le_X.fit_transform(X_train[:, 3])


In [25]:
X_train = ohe.fit_transform(X_train)

In [26]:
X_train

<40x6 sparse matrix of type '<class 'numpy.float64'>'
	with 155 stored elements in COOrdinate format>