### Analyzing the Zomato dataset to get an idea about the factors affecting the establishment of restaurants in Banglore

    Attributes definition:
    url: contains the url of the restaurant in the zomato website;
    address: contains the address of the restaurant in Bengaluru;
    name: contains the name of the restaurant;
    online-order: whether online ordering is available in the restaurant or no;
    book-table: table book option available or not;
    rate: contains the overall rating of the restaurant out of 5;
    votes: contains total number of rating for the restaurant as of the above mentioned date;
    phone: contains the phone number of the restaurant;
    location: contains the neighborhood in which the restaurant is located;
    rest-type: restaurant type.
    
    

#### Importing packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('seaborn')
import numpy as np
from scipy.stats import ttest_ind
import random
import array


# sklearn ->
from sklearn.metrics import r2_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error

#### Reading data

In [None]:
df=pd.read_csv("../input/zomato-bangalore-restaurants/zomato.csv")
df


### Preprocessing

Size of the zomato dataset:

In [None]:
df.size


Shape of the dataset (rows x columns):

In [None]:
df.shape

Show first five row information of the data:

In [None]:
df.head()

Describe the data

In [None]:
df.describe()

Check data information:

In [None]:
df.info()

#### Handlng missing values

Check the number of nulls/missing values

In [None]:
df.isna().sum()

Making cuisine null values to zero

In [None]:
df = df[df.cuisines.isna() == False]


In [None]:
df.isna().sum()

List the name of columns:

In [None]:
df.columns



#### Dummy variables

Deleting rows that are not required:


In [None]:
df = df.drop(["url","phone","address","listed_in(city)","menu_item"], axis = 1)


In [None]:
df.head()

Renaming columns:

In [None]:
df.rename(columns={'approx_cost(for two people)': 'avg_cost','listed_in(type)': 'listed_type'}, inplace=True)

df.columns



In [None]:
df.head()

In [None]:
df.info()

#### Transformations:

Converting column avg_cost to float type:

In [None]:
df['avg_cost'] = df['avg_cost'].astype(str)
df['avg_cost'] = df['avg_cost'].apply(lambda x: x.replace(',',''))
df['avg_cost'] = df['avg_cost'].astype(float)
df.info()

Dropping rows with rating "NEW" :

In [None]:
df.rate = df.rate.replace("NEW", np.nan)
df.dropna(how ='any', inplace = True)



Converting column rate from "/5" to point form:

In [None]:
df['rate'] = df['rate'].astype(str)
df['rate'] = df['rate'].apply(lambda x: x.replace('/5',''))
df['rate'] = df['rate'].astype(float)

df.head()

Taking list of columns with rating greater than 4.5 :

In [None]:
newdf_rate=df[['name','rate']].groupby(['rate'])
newdf_rate=newdf_rate.filter(lambda x: x.mean() >= 4.5)
newdf_rate=newdf_rate.sort_values(by=['rate'])
newdf_rate


#### Visualisation and Exploratory data analysis

In [None]:
df.name.value_counts().head()


Plot Restaurant Names vs its number

In [None]:
plt.figure(figsize = (10,5))
ax = df.name.value_counts()[:20].plot(kind = 'bar')
plt.xlabel("Restaurant Name")
plt.ylabel("No. of restaurants")
plt.title('Restaurant Names vs No of locations')

Number of restaurants that provide option of ordering online:

Online vs Offline Orders:

In [None]:
df.online_order.value_counts()



Plot Online vs Offline Orders:

In [None]:
plt.figure(figsize=(10,5))
ax = df.online_order.value_counts().plot(kind = 'bar')
plt.xlabel("Online/Offline Orders")
plt.ylabel("Count")
plt.title("Online/Offline Orders Count")


Number of restaurants that provide table booking:

Plot Book Table Facility Counts:

In [None]:
df.book_table.value_counts()

Plotting Book Table Facility Counts:

In [None]:
plt.figure(figsize=(10,5))
ax = df.book_table.value_counts().plot(kind = 'bar')
plt.xlabel("Book Table Facility")
plt.ylabel("Count")
plt.title("Book Table Facility Counts")

We see that most of the restaurants do not provide table booking option. They are very few, only around 5k out of almost 50k total.

In [None]:
df[df['book_table'] == 'No'].rate.describe()

The average rating of the restaurants without table booking option is around 3.8.

In [None]:
df[df['book_table'] == 'Yes'].rate.describe()

There are 6k restaurants with table booking facility and have an average rating of 4.1 which shows that restaurants with table booking receive better rating.

Plot location with highest no of restaurants:

In [None]:
df.location.value_counts().head()

Plot location with highest no of restaurants:

In [None]:
plt.figure(figsize=(10,10))
ax = df.location.value_counts()[:15].plot(kind = 'pie')
plt.title("Location with highest no of restaurant counts")
plt.legend()

Plot location with highest no of restaurant percentage

In [None]:
plt.figure(figsize=(10,10))
names = df.location.value_counts()[:15].index
values = df.location.value_counts()[:15].values
explode = [0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0] #exploding the first one out

plt.pie(values, explode=explode, autopct='%0.1f%%', labels = names) #autopct to display the percent value
plt.title("Percentage of restaurants present in that location")
plt.show()

Plot the number of restaurant types

In [None]:
df.rest_type.value_counts().head()



In [None]:
fig = plt.figure(figsize=(17,5))
retype = sns.countplot(df["rest_type"])
retype.set_xticklabels(retype.get_xticklabels(), rotation=90)
plt.ylabel("Count")
plt.xlabel("Restaurant types")




Plot highest no of restaurant types

In [None]:
plt.figure(figsize=(10,10))
names = df.rest_type.value_counts()[:15].index
values = df.rest_type.value_counts()[:15].values
explode = [0.1,0,0,0,0,0,0,0,0,0,0,0,0,0,0] 

plt.pie(values, explode=explode, autopct='%0.1f%%', labels = names)
plt.title("Percentage of restaurants types")
plt.show()



Most popular cuisines of Banglore

In [None]:
cuisines=df['cuisines'].value_counts()[:15]
sns.barplot(cuisines,cuisines.index)
plt.title("Most popular cuisines of Bangalore")
plt.xlabel("Number of restaurants")

In [None]:
df.head()

Most liked dish type

In [None]:
df.dish_liked.value_counts().head(5)

In [None]:
df.head()

### Hypothesis Testing (T-Test for comparing means of two groups)

Checking how are the restaurants that provide table booking is different from the ones that do not:

In [None]:
plt.figure(figsize=(10,5))
ax = df.book_table.value_counts().plot(kind = 'bar')
plt.xlabel("Book Table Facility")
plt.ylabel("Count")
plt.title("Book Table Facility Counts")

We see that most of the restaurants do not provide table booking option. They are very few, only around 5k out of almost 50k total.

In [None]:
print(df[df['book_table'] == 'No'].rate.describe())

The average rating of the restaurants without table booking option is around 3.8.

In [None]:
print(df[df['book_table'] == 'Yes'].rate.describe())

There are 6k restaurants with table booking facility and have an average rating of around 4.1 which shows that restaurants with table booking receive better rating.

##### Do restaurants with Table Booking facility actually receive higher rating? Let us test this hypothesis:

    H0: The mean rating of restaurants with table booking facility is same as mean rating of restaurants without the facility of booking table.
            
    Null Hypothesis,  H0:  μ1 == μ2 
            
    H1: The mean rating of restaurants with table booking facility is geater than mean rating of restaurants without the facility of booking table.
    
    Alternative Hypothesis,  H1:  μ1 > μ2

In [None]:
book_rate = df[['book_table','rate']]
book_rate

In [None]:
means_brate = book_rate.groupby('book_table').mean()
means_brate

In [None]:
def mean_diff(book_rate, groupby_type, attribute):
    means_table = book_rate.groupby(groupby_type).mean()
    return (means_table[attribute].iloc[0]- means_table[attribute].iloc[1])

In [None]:
observed_diff =  mean_diff(book_rate, 'book_table' ,'rate')
observed_diff

In [None]:
book_rate

In [None]:
shuffle = book_rate.sample(23259,replace = False)
shuffle

In [None]:
book_rate_os = book_rate.assign(shuffled_book_rate=shuffle['rate'].values)
book_rate_os

In [None]:
differences = np.zeros(5000)
for i in np.arange(5000):
    shuffled = book_rate.sample(23259,replace = False)
    original_and_shuffled = book_rate.assign(shuffled_book_rate=shuffled['rate'].values )
    difference = mean_diff(original_and_shuffled, 'rate' ,'shuffled_book_rate')
    differences[i] = difference

differences_df = pd.DataFrame(differences)
   
differences_df



Calculating emperical P value:

In [None]:
empirical_P = np.count_nonzero(differences <= observed_diff)/differences.size

In [None]:
differences_df.hist()
plt.title('Prediction Under Null Hypotheses');
plt.xlabel('Differences between Group Averages',fontsize=15)
plt.ylabel('Units',fontsize=15);
print('Observed Difference:', observed_diff)

plt.scatter(observed_diff, -1, color='red', s=50)

print('Empirical P-value:', empirical_P)

Since the difference value is little away from the heart of our distribution and also lesser than 5 percentage, we could reject the Null Hypothesis. The mean rating of restaurants with table booking facility is geater than mean rating of restaurants without the facility of booking table.

#### Student T test:

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups

T-test for the means of two independent samples of scores:

In [None]:
rand_Rate_Booking_No = pd.Series(random.sample(list(df[df['book_table'] == 'No'].rate), 30))
rand_Rate_Booking_Yes = pd.Series(random.sample(list(df[df['book_table'] == 'Yes'].rate), 20))

In [None]:
rand_Rate_Booking_No


Statistics of a random sample of restaurants without table booking facility:

In [None]:
rand_Rate_Booking_No.describe()

Statistics of a random sample of restaurants with table booking facility:

In [None]:
rand_Rate_Booking_Yes.describe()

    Definitions of variables used in calculation of Test Statistic:
    x1 = rand_Rate_Booking_Yes.mean()
    x2 = rand_Rate_Booking_No.mean()
    μ1 = pd.Series(df[df['book_table'] == 'Yes'].rate).mean() #population mean 1
    μ2 = pd.Series(df[df['book_table'] == 'No'].rate).mean() #population mean 2
    n1 = len(rand_Rate_Booking_Yes) #sample size 1
    n2 = len(rand_Rate_Booking_No) #sample size 1
    sd_1 = rand_Rate_Booking_Yes.std()
    sd_2 = rand_Rate_Booking_No.std()
    dof = min((n1 - 1), (n2 - 1))

In [None]:
#Using function to calculate out p - value:
t_stat, p = ttest_ind(rand_Rate_Booking_Yes,rand_Rate_Booking_No)
print('After running the T-test we get that test statistic is', t_stat,'and p-Value is', p)


This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

In [None]:
if (p < 0.05):
    print("Since P-value is lesser than significance level of 0.05, we reject the Null Hypothesis and Hence there is enough evidence that The mean rating of restaurants with table booking facility is geater than mean rating of restaurants without the facility of booking table. However we can not claim causal as there are other lurking variables.")
else:
    print("Since p-value is greater than significance level of 0.05, we fail to reject the Null Hypothesis and hence there is enough evidence that the mean rating of table booking retaurants is same as mean rating of restaurants without the facility of booking table.")

In [None]:
plt.figure(figsize=(30,10))
plt.subplot(1,2,1)
sns.countplot(df[df['book_table'] == 'Yes']['avg_cost'])
plt.xticks(rotation = 60)
plt.title('Rate distribution of restaurants with Booking table option')
plt.subplot(1,2,2)
# plt.figure(figsize=(20,5))
sns.countplot(df[df['book_table'] == 'No']['avg_cost'])
plt.xticks(rotation = 60)
plt.title('Rate distribution of restaurants without Booking table option')

After analyzing further, we see that restaurants with Table booking option have higher rating on average (From the Hypothesis) and are costlier than restaurants without booking table option. It start at 300 minimum where as the other type starts at Rs. 40. From the above graph we see that there are  fewer restaurants with rates above Rs.3000.

In [None]:
df.head()

In [None]:
zomato = df.copy()

Converting the labels into numeric form, so that it could be in machine readable form:

In [None]:
def Encode(zomato):
    for column in zomato.columns[~zomato.columns.isin(['rate', 'avg_cost', 'votes', 'location'])]:
        zomato[column] = zomato[column].factorize()[0]
    return zomato

zomato = Encode(zomato)


In [None]:
zomato.info()

Looking at the dataset after transformation:

In [None]:
zomato.head()

Seaborn heatmap function to plot the correlation grid (between different variables):

In [None]:
corr = zomato.corr()
plt.figure(figsize=(15,8))
sns.heatmap(corr, annot=True)
zomato.columns

Defining the independent variables and dependent variables:

There is positive correlation between 'votes' and 'rate' and also between 'avg_cost' and 'rate'. Taking them into consideration for prediction model.

In [None]:
x1 = zomato[['votes','avg_cost']]
x1

In [None]:
y = zomato['rate']
y

In [None]:
# y = y.values.reshape(-1,1)
# y.shape

Prediction Model: Linear Regression:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Getting Test and Training Set:

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x1,y,test_size=.1,random_state=353)


In [None]:
x_train.shape


In [None]:
y_train.shape

In [None]:
x_test.shape

In [None]:
y_test.shape

#### Preparing a Linear Regression Model:

In [None]:
reg=LinearRegression()
reg.fit(x_train,y_train)



In [None]:
y_pred=reg.predict(x_test)
y_pred

In [None]:
reg.intercept_


In [None]:
reg.coef_

In [None]:
reg.score(x_train, y_train)

In [None]:
mean_squared_error(y_test, y_pred)

#### Building an Ordinary Least Squares regression model:

In [None]:
import statsmodels.api as sm

In [None]:
X1 = sm.add_constant(x1)
ols = sm.OLS(y, X1).fit()

print(ols.summary())

p-values:

p-value is very low (almost equal to zero). So, the correlation between rate and the other two factors migt not affect

R² value:

R² value is 0.188 or about 18.8 percentage. The variability of rating hers, is explained by the average cost for two people and votes given by the customers but it is very low. This is not a satisfactory result for prediction.

#### Logistic Regression Classifier and Confusion Matrix:


Taking  Online Order and Book table columns for our classification.

In [None]:
x2 = zomato['online_order'].values.reshape(-1,1)
y2 = zomato['book_table']
print(x2,y2)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x2,y2,test_size=.1,random_state=353)


In [None]:
print(x_train,y_train)

In [None]:
clf = LogisticRegression().fit(x_train, y_train)
clf

In [None]:
y_pred = clf.predict(x_test)
y_pred

In [None]:
clf.predict_proba(x_test)



In [None]:
clf.score(x2, y2)

#### Confusion matrix:

In [None]:
matrix = confusion_matrix(y_test, y_pred, labels=[1,0])
print('Confusion matrix:\n',matrix)

# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(y_test,y_pred,labels=[1,0]).reshape(-1)
print('\nOutcome values:\n', tp, fn, fp, tn)

# classification report for precision, recall f1-score and accuracy
matrix = classification_report(y_test,y_pred,labels=[1,0])
print('\nClassification report:\n',matrix)

In [None]:
from sklearn.metrics import f1_score

In [None]:
print("The F-Measure is: ",f1_score(y_test, y_pred))

    F-measure (weighted average of the precision and recall) here is around 0.85. 
    (F1 = 2 * (precision * recall) / (precision + recall))

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,fmt = '.4g')

Accuracy score for the Logistic Regression Classifier:

In [None]:
print("Accuracy score: ",accuracy_score(x2, y2))

According to the Logistic regression classifier, the accuracy score is around 35 percent and  F-measure is 85. 

ROC curve:

In [None]:
logit_roc_auc = roc_auc_score(y_test, clf.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, clf.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

#### Kfold Cross validation:

In [None]:
cv = KFold(n_splits=10)
n_scores = cross_val_score(clf, x_train, y_train)
#print(n_scores)
np.mean(n_scores)

#### KNN Classifier:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors = 3)
KNN_model.fit(x_train, y_train)
KNN_pred = KNN_model.predict(x_test)
print("The accuracy score for KNN classier is with 3 neighbours is: ",accuracy_score(KNN_pred, y_test))
print("\nF measure score with 3 neighbours is: ",f1_score(y_test, y_pred))
print("\nClassification report with 3 neighbours is: ",classification_report(KNN_pred, y_test))

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors = 5)
KNN_model.fit(x_train, y_train)
KNN_pred = KNN_model.predict(x_test)
print("The accuracy score for KNN classier is with 5 neighbours is: ",accuracy_score(KNN_pred, y_test))
print("\nF measure score with 5 neighbours is: ",f1_score(y_test, y_pred))
print("\nClassification report with 5 neighbours is: ",classification_report(KNN_pred, y_test))

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors = 7)
KNN_model.fit(x_train, y_train)
KNN_pred = KNN_model.predict(x_test)
print("The accuracy score for KNN classier is with 7 neighbours is: ",accuracy_score(KNN_pred, y_test))
print("\nF measure score with 7 neighbours is: ",f1_score(y_test, y_pred))
print("\nClassification report with 7 neighbours is: ",classification_report(KNN_pred, y_test))

In [None]:
KNN_model = KNeighborsClassifier(n_neighbors = 9)
KNN_model.fit(x_train, y_train)
KNN_pred = KNN_model.predict(x_test)
print("The accuracy score for KNN classier is with 9 neighbours is: ",accuracy_score(KNN_pred, y_test))
print("\nF measure score with 9 neighbours is: ",f1_score(y_test, y_pred))
print("\nClassification report with 9 neighbours is: ",classification_report(KNN_pred, y_test))

#### According to the KNN Classifier, neighbours with more than or equal to 5, the accuracy score is around 74 percent and F-measure is 85. Hence the number of neighbours is taken as 5. 

In [None]:
sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,fmt = '.4g')