# Business Problem

There are 50 companies in total, and I am will be doing a explanatory analysis of its profit and loss financial statement by examining how much each company spent on its 3 major operational spans: **Research and Development**, **Administration**, **Marketing**. I also examined the state of which the company resides and the _profit_ of that company for that financial year. 

Create a model based off the dataset sample that will allow them to assess where and which companiesthey want to invest to achieve their goal of maximising profit. 

### Challenge 
A venture capitalist fund has asked to analyze this dataset and create a model that will tell the venture capitalist fund which types of companies it should invest in. 
1. Their main criteria in selecting a company is its profit, thus this is the dependent variable. 
    - I want to understand how companies perform better to generate higher profit, is it their R&D spend or other independent variables? Do companies perform better in California compared to New York? Will a company spending more on marketing perform better or a company that spends less on marketing?
    - Create a regression model based off the dataset sample given. 
2. Is there any correlations between **Profit** and amounts on different expenses(**R&D,Marketing, Admin, State etc.**)

# Multiple Linear Regression Model Concept

### Assumptions of a Linear Regression:
Befpre building a linear regression model, I always need to check that these assumptions are true, to proceed in building a good linear regression model:
1. Linearity 
2. Lack of multicollinearity 
3. Independence of errors

### Multiple Linear Regression Formula 

**_ y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1


# Categorical Variables

State is a categorical variable thus we cannot add it into the above equation. To deal with categorical variables, I always create **_dummy variables_**. 
- **Dummy Variable Trap**: Never include all of your dummy variable columns in your regression model. 

### Dummy Variable Trap
The reason I do not include all the dummy variables in my regression model is because if they will be strongly correlated and predict one another. 

If I have more than a few independent variables that predicts another in a regression model, then we have a multicollinearity issue:
- This effect on the model will not allow me to cannot distinguish between the effects of each dummy variable on another dummy variable. 
- This is the Dummy Variable Trap 
- If you have 9 dummy variables, only include 8 dummy variables. 

# Building a Regression Model
Keep the only important variable that is statistically significant in predicting something. **_Selecting the right variables is the process of building a good model._**

### Backward Eliminiation 
1. Select a significance level to stay in the model (e.g. SL = 0.05)
2. Fit the full model with all possible predictors 
3. Consider the predictor with the highest P-value. If P > SL, go to step 4, otherwise finish the model. 
4. Remove the predictor. 
5. Fit model without this variable. 
    - Rebuild the model again so the coefficients are going to be different (i.e. the constants are going to be different). 
    - You need to perform this step because once you remove a variable, it affects all the other variables in your whole regression. 









# Prepare the Dataset
No need to apply feature scaling to Multiple Linear Regressions since library takes care of that when I fit the MLR to training set after data preprocessing

In [8]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 

#Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:,:-1].values #removes last column
Y = dataset.iloc[:,4].values

# Encoding Categorical Variables
Need to encode **State** independent variable since it is a categorical variable
- **_Encoding categorical variables must be done before splitting the data set_**
- Using the LabelEncoder object to fit and transform the state categorical variable, **changes the text to numbers**

In [9]:
#Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
#State variable below, changing text to numbers
X[:,3] = labelencoder_X.fit_transform(X[:,3])
#Changing text to numbers to create dummy variables 
onehotencoder = OneHotEncoder(categorical_features=[3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding Dummy Variable Trap 
Taking columns from index 1 to the rest, to avoid trap since I want one less of the total 3 dummy variables I have. 
- I don't have to do this, since the regression model library takes care of this but I wanted to explain this trap

In [10]:
X = X[:,1:]
#No longer contains the first column

# Splitting Dataset into Training Set and Test Set 
Using 20% of the dataset for testing:
- 10 observations in the test set and 40 observations in training set

In [18]:
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=0)

# Fitting Multiple Linear Regression Model to Training Set
Profit is a lienar combination of the independent variables. 

- I created an object of the class LinearRegression
- This regressor object, I will use it to fit this object to the training set

In [41]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# Testing Performance of Multiple Linear Regression Model on Test Set

I will be predicting the test set results by creating a vector of predictions. 
- I compared real profits and predicted profits with our ten observations 
- **Y_test contains real profits**
- **Y_pred contains vector predictions based on our model**

In [23]:
y_pred = regressor.predict(X_test)
print(Y_test, y_pred)
#Comparing real profits and predicted profits
#Results below shows a multiple linear dependency relationship

[103282.38 144259.4  146121.95  77798.83 191050.39 105008.31  81229.06
  97483.56 110352.25 166187.94] [103015.20159796 132582.27760815 132447.73845175  71976.09851258
 178537.48221056 116161.24230166  67851.69209676  98791.73374687
 113969.43533013 167921.06569551]


# Building to the Optimal Model (Backward Elimination)
Creating a new matrix of optimal features, meaning **_independent variables that has a high impact on what we are predicting (profit)_**
- Removing non-statistically variables from the model 
- Find an optimal team of independent variables, so it will have a high impact on profit.
- I observed the summary of my regressor to see the P-Value in order to build a more robust model. 


Writing the index of each column below in X, since we're removing the index at each step afterwards. 

When finding the P-Values, the lower the P-values, the more significant your independnet variable will be. 

In [31]:
import statsmodels.formula.api as sm
#The statsmodel library does not take into account the constant in our linear regression model
#We will need to add it into our matrix of independent variables 
#Unlike the Linear Regression library, we have to add in the constant variable. 

#Appending the column of ones into our matrix
#Adding a matrix of 50 lines and 1 column
X = np.append(arr = np.ones((50,1)).astype(int), values = X ,axis = 1)

#The first coluimn contains 50 lines of ones and corresponds to this b0*x0
#Displaying first 10 columns
print(X[10,:])

[1.0000000e+00 1.0000000e+00 1.0000000e+00 1.0000000e+00 1.0000000e+00
 1.0000000e+00 0.0000000e+00 1.0191308e+05 1.1059411e+05 2.2916095e+05]


In [38]:
#Optimal Model
X_opt = X[:,[0,1,2,3,4,5]]

#Fitting Multiple Linear Regression model to our future optimal matrix features X and Y
#Creating a new regressor from statsmodel library
#I am not using the same regression from linear regression library 
regressor_OLS = sm.OLS(endog = Y, exog = X_opt).fit() 
#I am fitting OLS to x_opt and y

In [39]:
#Removing variables ith p-values >0.05
X_opt = X[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog=Y,exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.014
Model:,OLS,Adj. R-squared:,-0.007
Method:,Least Squares,F-statistic:,0.6575
Date:,"Tue, 18 Dec 2018",Prob (F-statistic):,0.421
Time:,22:57:03,Log-Likelihood:,-600.31
No. Observations:,50,AIC:,1205.0
Df Residuals:,48,BIC:,1208.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.721e+04,1734.185,15.689,0.000,2.37e+04,3.07e+04
x1,2.721e+04,1734.185,15.689,0.000,2.37e+04,3.07e+04
x2,2.721e+04,1734.185,15.689,0.000,2.37e+04,3.07e+04
x3,2.721e+04,1734.185,15.689,0.000,2.37e+04,3.07e+04
x4,9943.2135,1.23e+04,0.811,0.421,-1.47e+04,3.46e+04

0,1,2,3
Omnibus:,0.077,Durbin-Watson:,0.058
Prob(Omnibus):,0.962,Jarque-Bera (JB):,0.123
Skew:,0.08,Prob(JB):,0.94
Kurtosis:,2.817,Cond. No.,1.77e+33


In [33]:
#Moving variables
X_opt = X[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog=Y,exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.014
Model:,OLS,Adj. R-squared:,-0.007
Method:,Least Squares,F-statistic:,0.6575
Date:,"Tue, 18 Dec 2018",Prob (F-statistic):,0.421
Time:,22:55:19,Log-Likelihood:,-600.31
No. Observations:,50,AIC:,1205.0
Df Residuals:,48,BIC:,1208.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.628e+04,2312.246,15.689,0.000,3.16e+04,4.09e+04
x1,3.628e+04,2312.246,15.689,0.000,3.16e+04,4.09e+04
x2,3.628e+04,2312.246,15.689,0.000,3.16e+04,4.09e+04
x3,9943.2135,1.23e+04,0.811,0.421,-1.47e+04,3.46e+04

0,1,2,3
Omnibus:,0.077,Durbin-Watson:,0.058
Prob(Omnibus):,0.962,Jarque-Bera (JB):,0.123
Skew:,0.08,Prob(JB):,0.94
Kurtosis:,2.817,Cond. No.,6.03e+18


In [35]:
#Removing variables
X_opt = X[:,[0,3,5]]
regressor_OLS = sm.OLS(endog=Y,exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.014
Model:,OLS,Adj. R-squared:,-0.007
Method:,Least Squares,F-statistic:,0.6575
Date:,"Tue, 18 Dec 2018",Prob (F-statistic):,0.421
Time:,22:55:31,Log-Likelihood:,-600.31
No. Observations:,50,AIC:,1205.0
Df Residuals:,48,BIC:,1208.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.442e+04,3468.369,15.689,0.000,4.74e+04,6.14e+04
x1,5.442e+04,3468.369,15.689,0.000,4.74e+04,6.14e+04
x2,9943.2135,1.23e+04,0.811,0.421,-1.47e+04,3.46e+04

0,1,2,3
Omnibus:,0.077,Durbin-Watson:,0.058
Prob(Omnibus):,0.962,Jarque-Bera (JB):,0.123
Skew:,0.08,Prob(JB):,0.94
Kurtosis:,2.817,Cond. No.,1.17e+18


In [40]:
X_opt = X[:,[0,3]]
regressor_OLS = sm.OLS(endog=Y,exog=X_opt).fit()
regressor_OLS.summary()

  return self.ess/self.df_model
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


0,1,2,3
Dep. Variable:,y,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,
Date:,"Tue, 18 Dec 2018",Prob (F-statistic):,
Time:,22:59:52,Log-Likelihood:,-600.65
No. Observations:,50,AIC:,1203.0
Df Residuals:,49,BIC:,1205.0
Df Model:,0,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.601e+04,2850.077,19.651,0.000,5.03e+04,6.17e+04
x1,5.601e+04,2850.077,19.651,0.000,5.03e+04,6.17e+04

0,1,2,3
Omnibus:,0.018,Durbin-Watson:,0.02
Prob(Omnibus):,0.991,Jarque-Bera (JB):,0.068
Skew:,0.023,Prob(JB):,0.966
Kurtosis:,2.825,Cond. No.,4.91e+16


# Results 
The optimal team of independent variables that can predict profit with highest statistical significance is R&D Spend

### Problems with Results 
Should I keep marketing spend? Should I have removed it? 
- I will use other powerful metrics such as R Squared and Adjusted R Squared that will help me decide with more certainty if I should keep the independent variable Marketing Spend. 

### Purpose 
Analysing this problem, its purpose was so I can inform investors which start up to invest their money. Predicting future profits for new-startups based on the regression model created from the sample given in this dataset. As well as see which independent variable has the highest effect on **profit** so investors can choose to invest in companies wiseley based on highly statistically significant factors that affect profit. 

