<h1>Profit Prediction from Expenses</h1>

by <b>Santanu Sikder</b>

In [None]:
# Import the necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

<h3>Data Loading and Cleaning</h3>

In [None]:
# Load the dataset containing the various expenses of 50 startups and their profits
df = pd.read_csv("../input/various-expenses-and-the-profits-of-50-startups/50_Startups.csv")
# Preview
df.head()

In [None]:
# Check for missing values
df.info()

Neither there is any missing value, nor any unstandardised one.
The only thing left is to convert the categorical variable State into a numerical one.

In [None]:
# LabelEncoder from sklearn.preprocessing can be used to convert multiple categories in a categorical variable into numerical values
catToNum = LabelEncoder()
df["State"] = catToNum.fit_transform(df["State"])
# Preview
df.head()

In [None]:
# Change the name of the R&D Spend column to RD and Marketing Spend to Marketing
df.rename(columns = {"R&D Spend" : "RD", "Marketing Spend" : "Marketing"}, inplace = True)
# Preview
df.head()

<h3>Data Analysis and Visualisation</h3>

Let's study the dataframe a bit more using describe and info.

In [None]:
df.describe()

In [None]:
df.info()

To understand the above data better, I'm going to plot a boxplot.

In [None]:
plt.figure(figsize = (14, 8))
sns.boxplot(x = df.columns, y = [df[col] for col in df.columns])

From the above analysis, we see that Marketing costs have the widest range and Administration costs have the narrowest one. The expenses are almost equally distributed among the two sides of the medians and no type of expenses has any outlier.
However the range of the profits is quite small and has a lower outlier.

Also, the Research & Development expenses (RD) and the Administration costs have a similar range as that of the Profits, but that of the Marketing costs is very different (large and varied).

Let's look at the correlation table to check how good are the expenses at predicting the profit, in terms of linear relationship.

In [None]:
df.corr()

This informs us that Administration costs and State are poorly related to Profit linearly and that RD is very strongly related (quite obvious).
But the State categorical variable MIGHT be a good factor in determining Profit, if combined with other factors. Let's check how different are the profits in case of different States.

In [None]:
# Check the differences in the means
# California: 0, Florida: 1, New York: 2
df.groupby("State", as_index = True).mean()["Profit"].to_frame()

In [None]:
# Check the differences in minimums and maximums
maximums = df.groupby("State").max()["Profit"].to_frame()
minimums = df.groupby("State").min()["Profit"].to_frame()
minimums.merge(maximums, on = "State").rename(columns = {"Profit_x" : "min", "Profit_y" : "max"})

The maximum Profits yielded by the States are almost equal while the minimum values have visible differences.
Time for the boxplots!

In [None]:
sns.boxplot(data = df, x = "State", y = "Profit")

For deeper understanding, I'll do a F-test and examine the F-values for the various pairs of groups

In [None]:
stateGroups = df.groupby("State", as_index = True)
gCal, gFlor, gNY = stateGroups.get_group(0)["Profit"], stateGroups.get_group(1)["Profit"], stateGroups.get_group(2)["Profit"]
# So I got the groups above and now I'll create a list for making the pairing easy during the F-tests
profitGroups = [gCal, gFlor, gNY, gCal]
for i in range(3):
    f_score, p_value = stats.f_oneway(profitGroups[i], profitGroups[i + 1])
    print('''\
    Category pair: (%d, %d)
    F-Score = %f
    P-Value (Confidence Score) = %f
    '''%(i, (i + 1) % 3, f_score, p_value))

The above data confirms our correlation table that State is not a good linear predictor for Profit alone.

Let's plot the regression plots between Profit and each of the expenses, and also the State!

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (14, 12))
fig.suptitle("Regression plots: Profit vs expenses and State", fontsize = 20)
axesList = list(axes[0])
axesList.extend(list(axes[1]))

for i, axis in enumerate(axesList):
    col = df.columns[i]
    sns.regplot(data = df, x = col, y = "Profit", ax = axis)
    axis.set_title("Profit vs %s %s"%(col, "costs" if col != "State" else "categories"), fontsize = 15)

# plt.savefig("Profit_vs_Expenses.jpg")
plt.show()

<h3>Processing Data for Model Training</h3>

Fortunately, there's no need of standard scaling. So I'll directly perform the Train-Test split.

In [None]:
trainX, testX, trainy, testy = train_test_split(df[df.columns[:-1]], df[["Profit"]], test_size = 1/5, random_state = 0)

<h3>Model Training</h3>

In [None]:
# Instantiate the linear regression object
regr = LinearRegression()

In [None]:
# Train using the training set
regr.fit(trainX, trainy)

In [None]:
# Print out the coefficients matrix (m * n) and the intercept vector (m,)
print('''\
Coefficients: %s
Intercepts: %s
'''%(regr.coef_, regr.intercept_))

<h3>Model Evaluation and Testing our Model</h3>

I'll use MSE, RMSE and R2-Score to evaluate this model.

In [None]:
# Firstly, let's create the trainyCap and testyCap arrays by predicting values based on trainX and testX, respectively
trainyCap = regr.predict(trainX)
testyCap = regr.predict(testX)

Now I'll check the statistical scores in both the cases

In [None]:
# Training set's evaluation
mse = mean_squared_error(trainy, trainyCap)
rmse = np.sqrt(mse) # or mean_squared_error(trainy, trainyCap, squared = False)
r2 = r2_score(trainy, trainyCap)

In [None]:
# Test set's evaluation
mse2 = mean_squared_error(testy, testyCap)
rmse2 = np.sqrt(mse2) # or mean_squared_error(trainy, trainyCap, squared = False)
r22 = r2_score(testy, testyCap)

In [None]:
# Convert the above results into a dataframe
dfEvaluation = pd.DataFrame({"Train" : [mse, rmse, r2], "Test" : [mse2, rmse2, r22]}, index = ["MSE", "RMSE", "R2 Score"])

In [None]:
# View
dfEvaluation