# **Predicting values with Multiple Linear Regression**

The objective of this problem is to build a model to predict which Startups could give the best investment return. This could be useful to venture funds know in what firm (startup) invest.

This problem resolution is going to be organized in the following sections:
* **Section 1** - Data Analysis and Insights
* **Section 2** - Getting the training set and the test set
* **Section 3** - Getting the model
* **Section 4** - Conclusions about the model

### Section 1 - Data Analysis and Insights

In [None]:
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
#getting the data
df = pd.read_csv("../input/startup-logistic-regression/50_Startups.csv")

In [None]:
#show the first five elements of the dataframe
df.head()

In [None]:
#segregation of numerical columns and categorical columns

cat_col = df.select_dtypes(include=['object']).columns
num_col = df.select_dtypes(exclude=['object']).columns
df_cat = df[cat_col]
df_num = df[num_col]

In [None]:
#Informations about the dataframe
df.info()

In [None]:
#analyzing if there is null values
df_null = df.isna().mean()
df_null.sort_values(ascending = False)

As can be noticed, there aren't null values in the dataset so it won't be necessary a preprocessing to take care about it.

In [None]:
df.dtypes

In [None]:
outliers = ['Profit']
plt.rcParams['figure.figsize'] = [8,8]
sns.boxplot(data=df[outliers], orient="v", palette="Set1" ,whis=1.5,saturation=1, width=0.7)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Profit Range", fontweight = 'bold')
plt.xlabel("Continuous Variable", fontweight = 'bold')
df.shape

It is noticed that there is an outlier in the profit. However, it won't be removed in sake of the amount of data that is few (just 50 entries)

In [None]:
#checking for duplicates

df.loc[df.duplicated()]

Once again it is not necessary to make any preprocess because the dataset does not have any repeated values.

In [None]:
#Visualizing the diferent startup locations
plt.rcParams['figure.figsize'] = [5,5]
ax=df['State'].value_counts().plot(kind='bar',stacked=True, colormap = 'Set1', color= "green")
ax.title.set_text('State Locations')
plt.xlabel("Names of the States",fontweight = 'bold')
plt.ylabel("Count of States",fontweight = 'bold')

Insight:
* The locations of the startups are very balanced

In [None]:
plt.figure(figsize=(8,8))

plt.title('Startup Profit Distribution Plot')
sns.distplot(df['Profit'])

Insights:
* The distribution of profit of the startups are very similar to a Gaussian. In other words, the average profit (which is 100k) is the most frequent.

In [None]:
ax = sns.pairplot(df[num_col])

Insights:
* As expected by intuition, the more you invest in Research and Development,greater your profit.
* The marketing spend seems to be proportional with the profit. The reason could be that the more clients knowing the product (or service) of the startup, greater the revenue.
* The administration spend seems to have no relation with the profit.

In [None]:
plt.figure(figsize = (8, 8))
sns.heatmap(df.corr(), cmap="RdYlGn")
plt.show()

The heatmap of correlations seems to confirm the same things shown in the previous graphics.

In [None]:
plt.figure(figsize=(8, 8))
sns.boxplot(x = 'State', y = 'Profit', data = df)
plt.show()

Insights:
* All outliers presented are in the state of New York. 
* The startups located in the state of California have a greater range between the minimum and the maximum profit.

In [None]:
#Separating the dependent and independent values
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [None]:
#transforming the categorical value
ct = ColumnTransformer(transformers = [("enconder", OneHotEncoder(), [3])],
                       remainder = "passthrough")
X = np.array(ct.fit_transform(X))

### Section 2 - Getting the training set and the test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

### Section 3 - Getting the Model

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

Just by looking, seems that the values (the predicted values and the real values) are very close to each other.

### Section 4 - Conclusions

In [None]:
r2_score(y_pred, y_test)

The R2 score is 0,967 which represents an accuracy of the linear regression of 96,7%. This is a very good model to predict future values of profits in Startups.

In [None]:
print("R2 value is " + str(r2_score(y_pred, y_test)))
print("The Mean Squared Error is " + str(mean_squared_error(y_test, y_pred)))

In [None]:
# getting the coeficients of the model
regressor.coef_

In [None]:
#getting the constant of the model
regressor.intercept_

In [None]:
#Visualizing once again the independent variables
X[:5, :]

In [None]:
#Visualizing once again the dataset
df.head()

Creating a label to show the equations:

$ x_{1} $ = R&D Spend

$ x_{2} $ = Administration Spend

$ x_{3} $ = Marketing Spend

$ d_{1} $ = Boolean variable to represent if the startup is in the state of California or not

$ d_{2} $ = Boolean variable to represent if the startup is in the state of Florida or not

$ d_{3} $ = Boolean variable to represent if the startup is in the state of New York or not

$y$ = Profit


So, the equation is:

$y = 0,813x_{1} - 0,0161x_{2} + 0,0247x_{3} - 526d_{1} - 28d_{2} + 555d_{3}$

We can notice that the R2 value of 0,967 is an awesome number to predict in which startup a venture fund should invest.
Some things is interesting to notice:
* The Research & Development spend is proportional to the profit. So is good to invest in firms or companies that have a policy in invest a good part of its profit in Research and Development;
* The Marketing spend is also proportional to the profit (however not in the same level of Reserach & Development spend). This could happen because the clients know better the portfolio and/or quality of the firm. Also, the Marketing is responsible to show the Startup to the world.
* The Administration spend has no correlation with the profit. So a good thing the Startups should do is decrease the spend is this sector.
