# About Dataset:
dataset has data collected from New York, California and Florida about 50 business Startups "17 in each state". The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending.
- We have to  make a Model that can predict the profit based on the comapanies data.

## Multiple Linear Regression

### 1.1 importing Libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 1.2 Load dataset

In [None]:

dataset = pd.read_csv('../input/50-startups/50_Startups.csv')
dataset.head()

### 2 EDA

#### 2.1 identifying  the missing values

In [None]:
dataset.isnull().sum()

- There is no null values

#### 2.2 checking the datatype

In [None]:
dataset.info()

- Here __state__ column is object type. later we will convert this column  into dummy varible.

#### 2.3 Descriptive Analysis

In [None]:
dataset.describe()

#### 2.4  Checking the distribution of  'R&D Spend',    'Administration'   &  'Marketing Spend'

In [None]:
sns.distplot(dataset['R&D Spend'], color = 'green')

In [None]:
sns.distplot(dataset['Administration'], color = 'red')

In [None]:
sns.distplot(dataset['Marketing Spend'], color = 'orange')

####  2.5 Checking the relation b/w the features and o/p variable

In [None]:
sns.pairplot(dataset)

- Here above we can see that R&D Spend have linear relationship with Profit.
- So here it's most significant feature compare to others.

#### 2.6 find the correlation

In [None]:
dataset.corr()

In [None]:
sns.heatmap(dataset.corr(), annot = True)

- Here above we can see that __R&D Spend__ is highly correlated to __Profit__.

### 3. Data Preparing

#### 3.1 splitting  data into dependent &  independent varibles 

In [None]:

X = dataset.iloc[:,:-1].values

y = dataset.iloc[:,4].values

print(X)

In [None]:
print(y)

#### 3.2 Encoding categorical data :

 To encode the categorical variable into numbers, we will use the LabelEncoder class. But it is not sufficient because it still has some relational order, which may create a wrong model. So in order to remove this problem, we will use OneHotEncoder, which will create the dummy variables. Below is code for it:


In [None]:

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer

labelencoder_X = LabelEncoder()
X[:,3] = labelencoder_X.fit_transform(X[:,3])

# Country column
ct = ColumnTransformer([("Country", OneHotEncoder(), [3])], remainder = 'passthrough')
                                # creating dummy var(for states means 3 diff. column ) 
X = ct.fit_transform(X)
print(X)



####  3.3  Avoiding the dummy variable trap:

If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.

In [None]:

X = X[:,1:]
print(X)

#### 3.4 Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size = 1/3, random_state = 0)

### 4 Modeling

#### 4.1 Training the Multiple Linear Regression model on the Training set

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)


### 5. Making the predictions and evaluating the model

#### 5.1 Predicting the Test set results

In [None]:
y_pred = regressor.predict(X_test)

#### 5.2 comparing the actual_price with predicted_price

In [None]:

for i,j in np.nditer((y_test, y_pred)):
    print(i,"      ", j)  # compare actual price vs predicted price
    

#### 5.3 evaluate the train & test score performance

In [None]:
print(regressor.score(X_train, y_train))
print(regressor.score(X_test,y_test))



### 6. find optimal Model using backward elimination

#### Backward elimination:
Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output.


Unnecessary features increase the complexity of the model. Hence it is good to have only the most significant features and keep our model simple to get the better result.

In [None]:

import statsmodels.api as sm
# (bydefault it's not take constant(thetas 0 ,we have to put theta_0 * X0 = 1
# that's why we are creating col. of 1's and trying to put in the starting of X)
X = np.append(arr  = np.ones((50,1)).astype(int),values = X, axis = 1)# we are adding 1 extra col. in the starting  of X
print(X)


# np.append(values = X, np.ones((50,1)), axis = 1) # it will add col. at the last of X dataset

#### 6.1 applying backward elimination

In [None]:
X_opt =X[:, [0,1,2,3,4,5]].astype(float)
regressor_OLS = sm.OLS(y,X_opt).fit()  
regressor_OLS.summary()


In [None]:

X_opt =X[:, [0,1,3,4,5]].astype(float)   # removed X2(dummmy var) cause p>SL(significant level =0.5)
regressor_OLS = sm.OLS(y,X_opt).fit()           #    p>SL (0.990 > 0.05)
regressor_OLS.summary()


In [None]:
X_opt =X[:, [0,3,4,5]].astype(float)   # removed X1(dummy var) cause P>SL(0.953 > 0.05)
regressor_OLS = sm.OLS(y,X_opt).fit()
regressor_OLS.summary()


In [None]:
X_opt =X[:, [0,3,5]].astype(float)   #     removed X4(administration) cause P>SL (0.608 > 0.05)
regressor_OLS = sm.OLS(y,X_opt).fit()
regressor_OLS.summary()


In [None]:
X_opt =X[:, [0,3]].astype(float)   # removed X5 (marketing spend) cause P>SL(0.060 > 0.05)
regressor_OLS = sm.OLS(y,X_opt).fit()
regressor_OLS.summary()


### 7. Apply  optimal Multiple Linear Regression model

####  7.1 Extracting Independent and dependent Variable

In [None]:
  
x_BE= dataset.iloc[:,[0]].values  
y_BE= dataset.iloc[:, -1].values  
  

#### 7.2 Splitting the dataset into training and test set

In [None]:
 
from sklearn.model_selection import train_test_split  
x_BE_train, x_BE_test, y_BE_train, y_BE_test= train_test_split(x_BE, y_BE, test_size= 0.20, random_state=0)  
  

####  7.3 Fitting the MLR model to the training set

In [None]:
  
from sklearn.linear_model import LinearRegression  
regressor= LinearRegression()  
regressor.fit(np.array(x_BE_train).reshape(-1,1), y_BE_train)  

####  7.4 Predicting the Test set result

In [None]:

y_pred= regressor.predict(x_BE_test)  


####  7.5 Cheking the score

In [None]:
    
print('Train Score: ', regressor.score(x_BE_train, y_BE_train))  
print('Test Score: ', regressor.score(x_BE_test, y_BE_test))  

#### 7.6 Comparision b/w actual price and predicted price 

In [None]:
for i,j in np.nditer((y_BE_test,y_pred)):
    print(i,"      ", j)
    

### 8. Visualizing the final result 

####  8.1 Visualizing the R&d spend with Profits

R&D independent variable is a significant variable for the prediction. So we  predicted efficiently using this variable.
We can see below the relation of R&d spend with Profits.

In [None]:
plt.plot(dataset.iloc[:,0], dataset.iloc[:, 4], color = 'green')
plt.xlabel('R&D Spends')
plt.ylabel('Profits')
plt.title('Relation b/w the R&D spend and Profits')
plt.grid()

#### 8.2 Visualizing the train set result 

In [None]:
plt.scatter(x_BE_train, y_BE_train, color = 'red')
plt.plot(x_BE_train, regressor.predict(x_BE_train), color = 'blue')
plt.title('R&D spend vs Profit (Training set)')
plt.xlabel('R&D spend')
plt.ylabel('Profit')
plt.grid(color='gold', linestyle='-.', linewidth=0.7)
plt.show()


####  8.3 Visualizing the test set result 

In [None]:
plt.scatter(x_BE_test, y_BE_test, color = 'red')
plt.plot(x_BE_test, regressor.predict(x_BE_test), color = 'blue')
plt.title('R&D spend vs Profit (Test set)')
plt.xlabel('R&D spend')
plt.ylabel('Profit')
plt.grid(color = 'green', linestyle='-.', linewidth=0.7)
plt.show()

### 9. Conclusion :


We got this result by using one independent variable (R&D spend) only instead of four variables. Hence, now our model is simple and accurate.