# The Life Cycle in a Data Science Project

1. **Exploratory Data Analysis.**
2. **Feature Engineering.**
3. **Feature Selection.**
4. **Model Building.**
5. **Model Deployment.**

****In this project we will follow the same cylce****

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://akm-img-a-in.tosshub.com/sites/btmt/images/stories/zomato-fact-sheet_505_052417055850_111517063712.jpg?size=1200:675")

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/zomato-bangalore-restaurants/zomato.csv')

In [None]:
df.head()

#### Column description

* **url** : contains the url of the restaurant in the zomato website
* **address** : contains the address of the restaurant in Bengaluru
* **name** : contains the name of the restaurant
* **online_order** : whether online ordering is available in the restaurant or not
* **book_table** : table book option available or not
* **rate** : contains the overall rating of the restaurant out of 5
* **votes** : contains total number of rating for the restaurant as of the above mentioned date
* **phone** : contains the phone number of the restaurant
* **location** : contains the neighborhood in which the restaurant is located
* **rest_type** : restaurant type
* **dish_liked** : dishes people liked in the restaurant
* **cuisines** : food styles, separated by comma
* **cost_two** : contains the approximate cost for meal for two people
* **reviews_list** : list of tuples containing reviews for the restaurant, each tuple
* **menu_item** : contains list of menus available in the restaurant
* **service_type** : type of meal
* **serve_to** : contains the neighborhood in which the restaurant is listed

In [None]:
df.info()

### Checking for the missing values

In [None]:
pd.DataFrame(round(df.isnull().sum()/df.shape[0] * 100,3), columns = ['Missing'])

* The variable `dish_liked` has more tha 50% of the missing data. If we drop the data we would lose 50% of the data. To simplify the analysis 
we are going to drop some of the unecessary columns `url`, `address`, `phone`.

In [None]:
df.drop(['url', 'address', 'phone'], axis=1, inplace = True)

### Renaming few columns for our convinience

In [None]:
df.rename(columns = {"approx_cost(for two people)" : "cost_two", "listed_in(type)" : "service_type", "listed_in(city)" : "serve_to"}, inplace = True)

In [None]:
df.info()

In [None]:
df.columns

# 1. Exploratory Data Analysis.

* As we have seen above that the variable `cost_two` has `object` data type so we need to convert it into `integer` data type so that we can analyze the variable

In [None]:
# Converting the cost_two variaible into integer
df.cost_two = df.cost_two.astype(str)
df.cost_two = df.cost_two.apply(lambda x : x.replace(',','')).astype(float)

* To convert the varible to a float we could simply do `astype(float)` but in this case, this method wont work because of presence of comma in between the numbers eg. 1,200. To avoid this kind of problem, we are using `lambda` and `replace` function to replace comma with nothing and then convert to float 

In [None]:
df.rate.unique()

* We need to reply the `NEW` & `-` value from the rate column into `NaN` value to convert the `object` data type to `str`. 

In [None]:
df['rate'] = df.rate.replace('NEW', np.NaN)
df['rate'] = df.rate.replace('-', np.NaN)
df.rate = df.rate.astype(str)

In [None]:
df.rate = df.rate.apply(lambda x : x.replace('/5','')).astype(float)
df.head()

### Checking the count of each rating category present. 

In [None]:
plt.rcParams['figure.figsize'] = 14,7
sns.countplot(df['rate'], palette='Set1')
plt.title("Count plot of the rate variable")
plt.xticks(rotation = 90)
plt.show()

* From the above `rate` distribution it nearly follows **normal distribution with a mean equal to 3.7**. The graph show that the majority of the restaurant rating lies between the **3.4 - 4.2**. Very few restaurants have rating 4.8.

In [None]:
df.columns

### PLotting a joint plot for `rate` vs `votes`

* **Joint plot** allows us to compare two different variables and see if there is any relationship between these two variables. By using joint plot we can do both univariate and bivariate analysis by plotting the scatter plot (bivariate) and distribution plot (univariate) of two different variables in a single plotting grid

* **Univariate analysis** is the analysis of **one** (“uni”) variable. **Bivariate analysis** is the analysis of exactly **two** variables. **Multivariate analysis** is the analysis of **more than two** variables

In [None]:
plt.figure(figsize=(14,10))
sns.set_style("darkgrid")
sns.jointplot(x = 'rate', y = 'votes', data=df, color = 'darkgreen',height = 8, ratio = 4)

* From the scatter plot we can see that the restaurants with higher number of rating has more votes. The distribution plot of `votes` on the right side indicates that the majority of votes pooled lie in bucket of 1000-2500. 

### Bar Plot

* Barplot is one of the mostt commonly used graphic to represent the data. Barplot represents the data in rectangular bars with the length of the bar proportional to the value of variable. We will analyze the variable `location` and see in which area most of the restaurants are located in Bangalore.

In [None]:
# Analyzing the number of locations with respect to the location

df.location.value_counts().nlargest(10).plot(kind='barh')
plt.title("Number of restaurants by location")
plt.xlabel("Restaurant counts")
plt.show()

* From the above visualization we can say that most number of the restaurants are located at **BTM** which makes it most popular residential and commercial and residential places in Banglore.

In [None]:
df.columns

In [None]:
df.head()

## Pie Chart

* **Pie chart** is a type of graph in which a circle is divided into sectors that each represent a proportion of the whole.
* With the help of pie chart we are going to plot how much percentage of online orders are been placed

In [None]:
# Plotting a pie chart for online orders

trace = go.Pie(labels = ['Online_orders', 'No_online_orders'], values = df['online_order'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['lightskyblue','gold'], 
                           line=dict(color='#000000', width=1.5)))


layout = dict(title =  'Distribution of order variable')
           
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

From the above pie chart we can say people tend to order online rather than going into to a restaurant and dining in.

In [None]:
df.head()

### Restaurant Listed in

* Let's see to in which area most of the restaurants are listed in or deliver to

In [None]:
# Restaurants to serve to

df.serve_to.value_counts().nlargest(10).plot(kind = 'barh', color = 'r')
plt.title("Number of restaurants listed in")
plt.xlabel("Count")
plt.legend()
plt.show()

* As expected most of the restaurants listed_in deliver to **BTM** location because this area is home to over more than 3000 restaurants. Even though **Koramangaka 7th Block** dont have many restaurants still it stands second in terms of the m=number of restaurants that deliver to this location.

### Checking whether the online order facility impacts the rating of the restaurants

In [None]:
sns.countplot(x = df['rate'], hue = df['online_order'], palette= 'Set1')
plt.title("Distribution of restaurant rating over online order facility")
plt.show()

* We can clearly observer If the restaurant which dont have online order facility are more like to lose the rating as compared to the restuarants which have online order facility

In [None]:
df.rest_type.value_counts().nlargest(20).plot(kind = 'barh')
plt.title("Restaurant type")
plt.xlabel("Count")
plt.legend()
plt.show()

* From the above visuals shows the top 20 restaurant type. We can see that the restaurant type `Quick Bites` is more popular among people as compared to the rest of the restaurant types.

In [None]:
df.head()

In [None]:
df.dish_liked.value_counts().nlargest(20).plot(kind = 'barh')
plt.show()

* We are able to see the top 20 dish's liked by the people. In this graph we can clearly see that the dish `Biryani` gains the top most position as compared to the rest of the dishes

In [None]:
df.head()

### Checking which are the top 20 restaurants in Bangalore.

In [None]:
df.name.value_counts().nlargest(20).plot(kind = 'barh')
plt.legend()
plt.show()

* The restaurant which is more famous amongst people are `Cafe Coffee Day`.

In [None]:
df.head()

### Checking whether the online table booking affects the rating of the restaurant

In [None]:
# Plotting a pie chart for online orders

trace = go.Pie(labels = ['Table_booking_available', 'No_table_booking_available'], values = df['book_table'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['lightskyblue','gold'], 
                           line=dict(color='#000000', width=1.5)))


layout = dict(title =  'Distribution of order variable')
           
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

* From the above Pie chart we can see that 87.5% of the restaurants have table booking facility available and 12.5% dont have table booking available.

* Now lets check how to rating affects if the restaurant has table booking or not.

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(x = df['online_order'], hue = df['rate'], palette= 'Set1')
plt.title("Distribution of restaurant rating over table booking facility")
plt.show()

* We can clearly see that the ratings affects drastically if the restaurant has table booking or not. The restaurants which are having table booking facility tends to have higher ratings as compared to the restaurants which dont have table booking facility available with them.

## Comparing Biggest Restaurant Chain and Best Restaurant Chain

In [None]:
plt.rcParams['figure.figsize'] = 14,7
plt.subplot(1,2,1)

df.name.value_counts().head().plot(kind = 'barh', color = sns.color_palette("hls", 5))
plt.xlabel("Number Of Restaurants")
plt.title("Biggest Restaurant Chain (Top 5)")

plt.subplot(1,2,2)

df[df['rate'] >= 4.5]['name'].value_counts().nlargest(5).plot(kind = 'barh', color = sns.color_palette("Paired"))
plt.xlabel("Number Of Restaurants")
plt.title("Biggest Restaurant Chain (Top 5) - Rating more than 4.5")
plt.tight_layout()

* `Cafe Coffee Day` chain has over 90 cafes across the city that are listed in Zomato. On the other hand, **Truffles** - a burger chain has the best fast food restaurant (rating more than 4.5 out of 100), quality over quantity.

* If you visit Banglore next time and if you want to check out a good restaurant over a weekend dont forget to try the food at **Truffles**, **Hammered** and **Mainland China**.


# 2. Feature Engineering.

In [None]:
# checking for null values
df.isnull().sum()

In [None]:
# Replacing the NaN values in rate feature

df['rate'] = df['rate'].fillna(df['rate'].mean())

In [None]:
# Plotting a distplot
sns.distplot(df['rate'], color = 'darkgreen')
plt.title('Rating Distribution')
plt.show()

* The rating feature follows a normal distribution

In [None]:
# Replacing the NaN values for cost_two

df.cost_two.value_counts().mean()

In [None]:
# Replacing the NaN values for the cost_two feature with mean value

df['cost_two'] = df['cost_two'].fillna(df['cost_two'].mean())

In [None]:
# Plotting a distplot for cost_two feature
sns.distplot(df['cost_two'], color = 'darkgreen')
plt.title('Rating Distribution')
plt.show()

* The cost two feature also follows nearly normal distribution

In [None]:
df.head()

### Converting the categorical columns into integer

* We will perform One Hot Encoding operation on `online_order`,`book_table`,`location`,`rest_type`, `cuisines` columns.

In [None]:
df['online_order'] = pd.get_dummies(df['online_order'], drop_first=True)
df.head()

In [None]:
df['book_table'] = pd.get_dummies(df['book_table'], drop_first=True)
df.head()

In [None]:
# Performing One Hot Encoding on rest_type

get_dummies_rest_type = pd.get_dummies(df.rest_type)
get_dummies_rest_type.head(3)

In [None]:
# Performing One Hot Encoding on location

get_dummies_location = pd.get_dummies(df.location)
get_dummies_location.head(3)

In [None]:
# Performing One Hot Encoding on type

get_dummies_service_type = pd.get_dummies(df.service_type)
get_dummies_service_type.head(3)

In [None]:
# Concatinating the dataframes
final_df = pd.concat([df,get_dummies_rest_type,get_dummies_service_type, get_dummies_location], axis = 1)
final_df.head()

In [None]:
final_df.head(2)

In [None]:
final_df = final_df.drop(["name","rest_type","location", 'cuisines', 'dish_liked', 'reviews_list'],axis = 1)
final_df.head()

In [None]:
final_df.head()

In [None]:
final_df = final_df.drop(["menu_item","service_type","serve_to"],axis = 1)
final_df.head()

In [None]:
sns.heatmap(df.corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})

# 3. Feature Selection

In [None]:
# Splitting the features into independent and dependent variables

x = final_df.drop(['rate'], axis = 1)
x.head()

In [None]:
y = final_df['rate']

### Feature importance

* Feature importance gives you a score for each feature of your data, the higher the score the more important or relevant is the feature towards your output variable

* Feature importance is an in built class that comes with Tree Based Regressor, we will be using Extra Tree Regressor for extracting the top 10 features for the dataset

In [None]:
from sklearn.ensemble import ExtraTreesRegressor

model = ExtraTreesRegressor()
model.fit(x,y)

In [None]:
print(model.feature_importances_)

In [None]:
#plotting graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

* From the above graph we can see the top 10 most important features which are very important to train our model and get correct predictions. We will be using these features for our model building.

In [None]:
sns.distplot(df['rate'])

# 4. Model Building

In [None]:
#Spliting data into test and train

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.20)

## Applying Linear Regression Algorithm

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(x_train, y_train)

lr_pred = lr.predict(x_test)

In [None]:
r2 = r2_score(y_test,lr_pred)
print('R-Square Score: ',r2*100)

In [None]:
# Calculate the absolute errors
lr_errors = abs(lr_pred - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(lr_pred), 2), 'degrees.')

In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (lr_errors / y_test)
# Calculate and display accuracy
lr_accuracy = 100 - np.mean(mape)
print('Accuracy for Logistic Regression is :', round(lr_accuracy, 2), '%.')

In [None]:
sns.distplot(y_test-lr_pred)

In [None]:
#plotting the Random forest values predicated Rating

plt.figure(figsize=(12,7))

plt.scatter(y_test,x_test.iloc[:,2],color="blue")
plt.title("True rate vs Predicted rate",size=20,pad=15)
plt.xlabel('Rating',size = 15)
plt.ylabel('Frequency',size = 15)
plt.scatter(lr_pred,x_test.iloc[:,2],color="yellow")

In [None]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [None]:
print('mse:',metrics.mean_squared_error(y_test, lr_pred))
print('mae:',metrics.mean_absolute_error(y_test, lr_pred))


# Applying Decision tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

dtree = DecisionTreeRegressor(criterion='mse')
dtree.fit(x_train, y_train)

In [None]:
dtree_pred = dtree.predict(x_test)

In [None]:
r2 = r2_score(y_test,dtree_pred)
print('R-Square Score: ',r2*100)

# Calculate the absolute errors
dtree_errors = abs(dtree_pred - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(dtree_pred), 2), 'degrees.')

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (dtree_errors / y_test)
# Calculate and display accuracy
dtree_accuracy = 100 - np.mean(mape)
print('Accuracy for Decision tree regressor is :', round(dtree_accuracy, 2), '%.')

In [None]:
#plotting the Random forest values predicated Rating

plt.figure(figsize=(12,7))

plt.scatter(y_test,x_test.iloc[:,2],color="blue")
plt.title("True rate vs Predicted rate",size=20,pad=15)
plt.xlabel('Rating',size = 15)
plt.ylabel('Frequency',size = 15)
plt.scatter(dtree_pred,x_test.iloc[:,2],color="yellow")
plt.legend()

# Applying Random Forest Regressor Algorithm

In [None]:
from sklearn.ensemble import RandomForestRegressor

random_forest_regressor = RandomForestRegressor()
random_forest_regressor.fit(x_train, y_train)

In [None]:
rf_pred = random_forest_regressor.predict(x_test)

In [None]:
r2 = r2_score(y_test,rf_pred)
print('R-Square Score: ',r2*100)

# Calculate the absolute errors
rf_errors = abs(rf_pred - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(rf_pred), 2), 'degrees.')

# Calculate mean absolute percentage error (MAPE)
mape = 100 * (rf_errors / y_test)
# Calculate and display accuracy
rf_accuracy = 100 - np.mean(mape)
print('Accuracy for random forest regressor is :', round(rf_accuracy, 2), '%.')

In [None]:
#plotting the Random forest values predicated Rating

plt.figure(figsize=(12,7))

plt.scatter(y_test,x_test.iloc[:,2],color="blue")
plt.title("True rate vs Predicted rate",size=20,pad=15)
plt.xlabel('Rating',size = 15)
plt.ylabel('Frequency',size = 15)
plt.scatter(rf_pred,x_test.iloc[:,2],color="yellow")

# 5. Model Deployment

In [None]:
import pickle

In [None]:
# For Logistic Regression

# open a file where you want to store the data
file = open('logistic_regression.pkl', 'wb')

# dump information to that file
pickle.dump(lr, file)

In [None]:
# For Decision Tree Regressor

# open a file where you want to store the data
file = open('Decision_tree_model.pkl', 'wb')

# dump information to that file
pickle.dump(dtree, file)

In [None]:
# For Random Forest Regressor

# open a file where you want to store the data
file = open('Random_forest.pkl', 'wb')

# dump information to that file
pickle.dump(random_forest_regressor, file)

# If you like it please consider upvoting it, it will motivate me more.

# Thank You <3