# **Predict Restaurant Revenue Using Regression**
**By : Garry Ariel**

We will try to predict restaurant revenue based on given information. The steps can be divided into the following.
1. Analyzing and preprocessing the dataset
2. Feed the data into regression model
3. Use trained regression model from previous step to predict other restaurant revenue

Before we begin analyzing and preprocessing the dataset, we need to import some packages and read the dataset in.

In [None]:
# Import some packages needed
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import linear_model
from datetime import datetime

# Read the train and test dataset
train_df = pd.read_csv("../input/restaurant-revenue-prediction/train.csv.zip")
test_df = pd.read_csv("../input/restaurant-revenue-prediction/test.csv.zip")

# Remove Id field from train dataset
train_df = train_df.drop(["Id"], axis = 1)

# Look at some train data examples
train_df.head(10)

## **Analyze and Preprocess the Dataset**

In this first step, we will gather as much information we need from the dataset. We will start from Open Date field. As we see, Open Date field gives us information about when the restaurant started their business. In this form, we only get categorical type variable with so many unique values. So, we will try to convert it first into an integer that have meaningful measurement. We can do it in the following way. First, we take an anchor date. In this case we can take the date when this competition launched. Then, we can calculate how old that restaurant by counting how many days are there between the open date and anchor date. So now, the Open Date gives us information about the age of the restaurant, which is measurable. After that, we will plot the Open Date field with the revenue Field using scatter plot, and try to gain information from their relationship.

In [None]:
# Turn Open Date field into datetime type
train_df["Open Date"] = train_df["Open Date"].astype('datetime64[ns]')

# Get competition date
competition_date = datetime.strptime('2015-09-14', '%Y-%m-%d')

# Convert Open Date field into integer
train_df["Dummy"] = competition_date
train_df["Open Date"] = train_df["Open Date"] - train_df["Dummy"]
train_df["Open Date"] = train_df["Open Date"].dt.days
train_df = train_df.drop(["Dummy"], axis = 1)
train_df["Open Date"] = train_df["Open Date"].abs()

# Look the Open Date converted field
train_df[["Open Date"]].head(10)

In [None]:
# Draw scatter plot of Open Date vs Revenue
plt.scatter(train_df["Open Date"], train_df["revenue"])
plt.xlabel("Open Date")
plt.ylabel("Revenue")
plt.show()

First of all, we notice that there are 3 outliar data (the point which revenue above 1.25). We will remove those 3 data from train dataset in next step. Other than those outliar data, this plot doesn't give us any reliable information. It is more likely that there is a poor to no correlation between Open Date and Revenue. We will talk about correlation more later.

In [None]:
# Get index of those outliars
index_out = train_df[train_df['revenue'] > 12500000].index

# Remove outliar data from train dataset
train_df = train_df.drop(index_out)

# Copy train_df for later observations
for_corr_df = train_df

# To confirm, we now only have 134 rows
train_df

Next, we will try to gather information from City field. First, we will count how many unique values are there, and what is the occurence frequency of each value. We can use value_counts syntax as the following.

In [None]:
# Get the frequency of each unique value on City field
train_df[["City"]].value_counts()

There are 37 unique values, where most of them only appear once. And if we want to turn them into one-hot-encoding form (which is a needed step to deal with categorical type data), it will be too much if we convert them to be 37 different features. So, we can try to make it becomes only 8 features. Those cities who have occurence frequency greater than or equal to 4 will have their own class (so there are 7 classes), and for the rest will be categorized as "other". Following syntax will do our target and turn them into one-hot-encoding form.

In [None]:
# Turn City field into one-hot-encoding form
city_dummy_df = pd.get_dummies(train_df[["City"]], prefix = ['City'])

# Create new column titled City_Other
city_dummy_df["City_Other"] = 0
for index, rows in city_dummy_df.iterrows():
    if (
        rows["City_İstanbul"] == 0 and
        rows["City_Ankara"] == 0 and
        rows["City_İzmir"] == 0 and
        rows["City_Bursa"] == 0 and
        rows["City_Samsun"] == 0 and
        rows["City_Antalya"] == 0 and
        rows["City_Sakarya"] == 0
    ):
        city_dummy_df["City_Other"][index] = 1

# Choose only 8 features
city_dummy_df = city_dummy_df[["City_İstanbul", "City_Ankara", "City_İzmir", "City_Bursa", "City_Samsun", "City_Antalya", "City_Sakarya", "City_Other"]]

# Merge that one-hot-encoding dataframe to train_df
train_df = pd.merge(train_df, city_dummy_df, left_index = True, right_index = True)

# Look at the result
train_df.head(10)

Next, we will take a look at City Group field. We will draw a countplot for that field, which will give us information about what unique values are there, and how much the appearance of each value.

In [None]:
# Draw a chart about the frequency of each value of City Group field
ax = sns.countplot(x = "City Group", data = train_df)

We can see that there are only 2 unique values, where those values appear almost balance. So we will just turn it into standard one-hot-encoding form using following syntax.

In [None]:
# Turn City Group field into one-hot-encoding form
group_dummy_df = pd.get_dummies(train_df[["City Group"]], prefix = ['Group'])

# Merge that one-hot-encoding dataframe to train_df
train_df = pd.merge(train_df, group_dummy_df, left_index = True, right_index = True)

# Look at the result
train_df.head(10)

Next, we will take a look at Type field. We will draw a countplot for that field, which will give us information about what unique values are there, and how much the appearance of each value.

In [None]:
# Draw a chart about the frequency of each value of Type field (from train dataset and test dataset)
plt.figure(1)
ax = sns.countplot(x = "Type", data = train_df)
plt.figure(2)
ax = sns.countplot(x = "Type", data = test_df)

As we can see there are 3 unique values in train dataset. But, one of them has frequency of appearance so small. So we will change all DT values into one of IL or FC. To do this we can use the same strategy as we used before for City field, using revenue average. Furthermore, if we look at the test dataset, there are 4 unique values for this field. This means we don't have any information about MB type from train dataset. And since this is added in test dataset, we cannot turn it into FC or IL based on revenue. So we will just simply turn them into FC type (the most common type) later.

In [None]:
# Get information about revenue's average in each type
rev_avg_df = train_df[["Type", "revenue"]].groupby("Type").mean()
type_freq_df = train_df[["Type", "revenue"]].groupby("Type").count()
rev_info_by_type_df = pd.merge(type_freq_df, rev_avg_df, on = "Type").sort_values(by = ['revenue_x'], ascending = False)
rev_info_by_type_df = rev_info_by_type_df.rename(columns = {"revenue_y": "Average Rev", "revenue_x": "Frequency"})
rev_info_by_type_df

From above result, we can see that DT's average revenue is closer to IL rather than FC, so we will replace DT into IL.

In [None]:
# Replace all DT with IL
for index, rows in train_df.iterrows():
    if rows["Type"] == "DT":
        train_df["Type"][index] = "IL"

# Turn into one-hot-encoding form
type_dummy_df = pd.get_dummies(train_df[["Type"]], prefix = ['Type'])

# Merge that one-hot-encoding dataframe to train_df
train_df = pd.merge(train_df, type_dummy_df, left_index = True, right_index = True)

# Look at the result
train_df.head(10)

Next, we will look at correlation between each feature pair (Open Date, P1 - P37, and Revenue), and draw a heatmap. But before we calculate the correlation, we will normalize the dataset first.

In [None]:
# Get sub dataframe of all numerical type fields
numerical_df = for_corr_df.drop(["City", "City Group", "Type"], axis = 1)

# Normalize the dataset
numerical_df_val = numerical_df.values # Returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
numerical_scaled = min_max_scaler.fit_transform(numerical_df_val)
normal_numerical_df = pd.DataFrame(numerical_scaled)

# Get correlation matrix
corr_df = normal_numerical_df.corr()
corr_df = corr_df.abs()

# Draw correlation heatmap for all numerical type fields
plt.figure(figsize = (39, 39))
ax = sns.heatmap(corr_df)

We are interested to find a good correlation from any features to revenue. Darker tiles mean weaker correlation, and brighter tiles mean stronger correlation. Notice that row 38 is about the correlation of revenue to any other features. We can see there that most of the tiles is almost totally black (If we look closer, it is more that likely they only have around maximum 0.2 correlation with revenue). But among them, there still some features who get "brighter" than the other. We will focus our observation to such features, which are column 0, 2, and 28. Column 0 for Open Date feature, column 2 and 28 for P2 and P28 features respectively. Next we will look at the scatter plot of P2 vs Revenue and P28 vs Revenue.

In [None]:
# Draw scatter plot for each feature vs revenue
features_index = [2, 28]
counter = 1
for index in features_index:
    plt.figure(counter)
    x_field_name = "P" + str(index)
    plt.scatter(train_df[x_field_name], train_df["revenue"])
    plt.xlabel(x_field_name)
    plt.ylabel("Revenue")
    counter += 1

# Draw the plot in a frame
plt.show()

## **Feed the Dataset Into Regression Model**

Before we build the model, we will do feature selection. From our observation, we get that there are no strong correlation between P1 - P37 to revenue. But at least, we find that P2 and P28 is better than the other, so we will choose P2 and P28 as our features. We also know that Open Date is still better than those P1-P37 columns, so we will include Open Date too. We also have turned some categorical variables into one-hot-encoding form, which we will use as our features too.

In [None]:
# Feature selection
to_drop = []
for index in list(range(1, 38)):
    if index not in features_index:
        to_drop.append("P" + str(index))
train_df = train_df.drop(to_drop, axis = 1)

# Look at the result
train_df

Now, we will build and train our model. We will simply use simple linear regression model from sklearn packages. But before we fit in the dataset into the model, we separate the features and target first, then convert them into numpy array.

In [None]:
# Create regresson object and prepare the dataset
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train_df.drop(['revenue', 'City', 'City Group', 'Type'], axis = 1))
train_y = np.asanyarray(train_df[['revenue']])

# Feed the data into the model
regr.fit(train_x, train_y)

## **Predict Restaurant Revenue**

Now we are ready to use the model to predict restaurant revenue based on information given in test dataset. But before we can predict, first we need to preprocess the data like we did before to train dataset (convert open date, turn city, city group, and type to one-hot-encoding form, etc).

In [None]:
# --------------- Convert Open Date ---------------
# Turn Open Date field into datetime type
test_df["Open Date"] = test_df["Open Date"].astype('datetime64[ns]')

# Convert Open Date field into integer
test_df["Dummy"] = competition_date
test_df["Open Date"] = test_df["Open Date"] - test_df["Dummy"]
test_df["Open Date"] = test_df["Open Date"].dt.days
test_df = test_df.drop(["Dummy"], axis = 1)
test_df["Open Date"] = test_df["Open Date"].abs()

# Look the Open Date converted field
test_df[["Open Date"]].head(10)

In [None]:
# --------------- Convert City ---------------
# Turn City field into one-hot-encoding form
city_dummy_df = pd.get_dummies(test_df[["City"]], prefix = ['City'])

# Create new column titled City_Other
city_dummy_df["City_Other"] = 0
for index, rows in city_dummy_df.iterrows():
    if (
        rows["City_İstanbul"] == 0 and
        rows["City_Ankara"] == 0 and
        rows["City_İzmir"] == 0 and
        rows["City_Bursa"] == 0 and
        rows["City_Samsun"] == 0 and
        rows["City_Antalya"] == 0 and
        rows["City_Sakarya"] == 0
    ):
        city_dummy_df["City_Other"][index] = 1

# Choose only 8 features
city_dummy_df = city_dummy_df[["City_İstanbul", "City_Ankara", "City_İzmir", "City_Bursa", "City_Samsun", "City_Antalya", "City_Sakarya", "City_Other"]]

# Merge that one-hot-encoding dataframe to train_df
test_df = pd.merge(test_df, city_dummy_df, left_index = True, right_index = True)

# Look at the result
test_df.head(10)

In [None]:
# --------------- Convert City Group ---------------
# Turn City Group field into one-hot-encoding form
group_dummy_df = pd.get_dummies(test_df[["City Group"]], prefix = ['Group'])

# Merge that one-hot-encoding dataframe to train_df
test_df = pd.merge(test_df, group_dummy_df, left_index = True, right_index = True)

# Look at the result
test_df.head(10)

In [None]:
# --------------- Convert Type ---------------
# Replace all DT with IL
for index, rows in test_df.iterrows():
    if rows["Type"] == "DT":
        test_df["Type"][index] = "IL"
    elif rows["Type"] == "MB":
        test_df["Type"][index] = "FC"

# Turn into one-hot-encoding form
type_dummy_df = pd.get_dummies(test_df[["Type"]], prefix = ['Type'])

# Merge that one-hot-encoding dataframe to train_df
test_df = pd.merge(test_df, type_dummy_df, left_index = True, right_index = True)

# Look at the result
test_df.head(10)

In [None]:
# --------------- Feature Selection ---------------
to_drop = []
for index in list(range(1, 38)):
    if index not in features_index:
        to_drop.append("P" + str(index))
test_df = test_df.drop(to_drop, axis = 1)

# Look at the result
test_df.head(10)

In [None]:
# Prepare the dataset
test_x = np.asanyarray(test_df.drop(['Id', 'City', 'City Group', 'Type'], axis = 1))

We are ready to predict restaurant revenue using our trained regression model. After we get the prediction, we will convert it into pandas dataframe and save it to csv.

In [None]:
# Predict restaurant revenue
y_predict = regr.predict(test_x)

# Save into csv format
test_df["Prediction"] = y_predict
submit_df = test_df[["Id", "Prediction"]]
submit_df.to_csv("submission.csv", index = False)