# **Unzipping the Dataset**

In [1]:

import zipfile as z
zip_ref = z.ZipFile("/content/train.csv.zip", "r") #the source path is given
zip_ref.extractall("/content/") #the destination part is given

zip_ref.close()

FileNotFoundError: [Errno 2] No such file or directory: '/content/train.csv.zip'

# **Importing important libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
df=pd.read_csv("/content/train.csv")
df

# **Data Analysis & Data Visualization**

In [None]:
df.shape

This says that the dataset has 137 rows and 43 columns. Sumply meaning, there are 137 data points in the datatset and there are 43 features, one of them is target feature.


In [None]:
df.columns

In [None]:

df.isna().sum()

This says there are no Missing Values. Good for us.

In [None]:
df.info()

From here we see that **4 of the features are object type** which the model can not understand. So they need to be **encoded**. 

Apart from that, Another observation is that we don't need to use the feature 'ID' as it's anyway **not going to give me any insight of the revenue**. So we will simply drop it.

In [None]:
#The ID column is irrelevant so we will drop them.
df=df.drop('Id',axis=1)
df

Now I am going to convert the 'Open Date' feature in **datetime format** so that I can extract the month and year from it. I want to do this because the date doesn't give me any insight of the revenue. But the **month and year surely does.**

In [None]:
df['Open Date'] = pd.to_datetime(df['Open Date'])
df

Here I am extracting the **month** from the feature **'Open Date'**.

In [None]:
df['month']=[x.month for x in df['Open Date']]

Here I am extracting the **year** from the feature **'Open Date'**.

In [None]:
df['year']=[x.year for x in df['Open Date']]

Now we will **drop 'Open Date'** as well as we have extracted all the information from it and now its of no use to me.

In [None]:
df=df.drop(['Open Date'],axis=1)
df

Now let's try to **visualize the trends** in month and year to understand how they affect the revenue.

In [None]:
sns.countplot(df['month'])

From the above plot we can look at the occurence of various months in the dataset. We have the most data for the last 5 months. The highest of them is from **August** and **December**. Now let's see in which month did we have the most revenue. For this I will try to find the **mean of the revenue** for each month.

In [None]:
df.groupby('month')['revenue'].mean()


From here we can see that the month **January gave the most revenue** to the restraunts. **September** and **October** followed January. Let's try to plot a bargraph with the same and visualize the same trends.

In [None]:
sns.barplot('month','revenue',data=df)

So these bargraphs are giving out the same information. Now lets try to do same kind of stuffs for the newly generated feature **'year'**.

In [None]:
sns.countplot(df['year'])

Here we are having a bit of trouble in visualization because the labels in the x-axis are not clear. Let's try to zoom in a bit!!

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(df['year'])

From here we can see that the most of the data is from the years **2008-2013**. Out of them the most of the data is from the year 2011. The other years are contruibuting really less on the basis of number of data. This is also going to affect the results as well.


In [None]:
df.groupby('year')['revenue'].mean()


In [None]:
plt.figure(figsize=(15,6))
sns.barplot('year','revenue',data=df)

Out of all the years, the highest revenue was generated in the year **2000** and after that **1999** and **2005**.

# **Data Preprocessing.**

In [None]:
df['Type'].value_counts()


There are 3 distinct values in the feature **'Type'**. We can encode the values as this:

**FC as 0;**


**IL as 1;**

**DT as 2;**

The order or the numbers can be anything.

In [None]:
ty={'FC':0,'IL':1,'DT':2}
df['Type'] = df['Type'].map(ty)


In [None]:
df

In [None]:
df['City Group'].value_counts()


In [None]:
cg={'Big Cities':0,'Other':1}
df['City Group'] = df['City Group'].map(cg)


In [None]:
df

In [None]:
a=df['City'].value_counts()


Here manually creating the dictionary is inefficient. So we will store the city names in a list and then use the element as the **key** of the dictionary and the index of each element as its **key values**. 

In [None]:
b=a.index

In [None]:
c={}
for i,j in enumerate(b):
  c.update({j:i})
  print(c)


In [2]:
c

NameError: name 'c' is not defined

In [None]:
df['City'] = df['City'].map(c)
df

In [None]:
df.info()

Now everything looks just fine. So we can go ahead with the data and start model building.

# **Building the Model**

The first task will be to **split the dataset** into train set and test set.

In [None]:
from sklearn.model_selection import train_test_split
x=df.drop('revenue',axis=1)
y=df['revenue']
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.30)


Let's check the dimension of train and test set.

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

Before moving ahead, I am importing all the models from sklearn

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor

## Linear Regression

In [None]:
lr = LinearRegression() #create the object of the model
lr=lr.fit(X_train,y_train)

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score

In [None]:
pred = lr.predict(X_test)
s=mean_absolute_error(y_test,pred)
s1=mean_squared_error(y_test,pred)
s2=r2_score(y_test,pred)

print("The MAE with the linear regressor is: "+str(s))
print("The MsE with the linear regressor is: "+str(s1))
print("The R2_Score with the linear regressor is: "+str(s2))

## Decision Tree Regressor

In [None]:
dtr = DecisionTreeRegressor() #create the object of the model
dtr=dtr.fit(X_train,y_train)

In [None]:
pred = dtr.predict(X_test)
s=mean_absolute_error(y_test,pred)
s1=mean_squared_error(y_test,pred)
s2=r2_score(y_test,pred)

print("The MAE with the DT regressor is: "+str(s))
print("The MsE with the DT regressor is: "+str(s1))
print("The R2_Score with the DT regressor is: "+str(s2))

## Random Forest Regressor

In [None]:
r = RandomForestRegressor() #create the object of the model
r=r.fit(X_train,y_train)

In [None]:
pred = r.predict(X_test)
s=mean_absolute_error(y_test,pred)
s1=mean_squared_error(y_test,pred)
s2=r2_score(y_test,pred)

print("The MAE with the RF regressor is: "+str(s))
print("The MsE with the RF regressor is: "+str(s1))
print("The R2_Score with the RF regressor is: "+str(s2))

## K-Neighbors Regressor

In [None]:
knn=KNeighborsRegressor()
knn=knn.fit(X_train,y_train)

In [None]:
pred = knn.predict(X_test)
s=mean_absolute_error(y_test,pred)
s1=mean_squared_error(y_test,pred)
s2=r2_score(y_test,pred)

print("The MAE with the KNN regressor is: "+str(s))
print("The MsE with the KNN regressor is: "+str(s1))
print("The R2_Score with the KNN regressor is: "+str(s2))

## XGB Regressor

In [None]:
xgb=XGBRegressor()
xgb=xgb.fit(X_train,y_train)

In [None]:
pred = xgb.predict(X_test)
s=mean_absolute_error(y_test,pred)
s1=mean_squared_error(y_test,pred)
s2=r2_score(y_test,pred)

print("The MAE with the XGB regressor is: "+str(s))
print("The MsE with the XGB regressor is: "+str(s1))
print("The R2_Score with the XGB regressor is: "+str(s2))

From all the models, RFregressor gave the minimum error, So thats the best model and should be chosen as the final model.