<a href="https://colab.research.google.com/github/uviniaveesha/data_analysis/blob/main/FoodHub_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Python Foundations: FoodHub Data Analysis

### Context

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

### Objective

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

### Data Description

The data contains the different data related to a food order. The detailed data dictionary is given below.

### Data Dictionary

* order_id: Unique ID of the order
* customer_id: ID of the customer who ordered the food
* restaurant_name: Name of the restaurant
* cuisine_type: Cuisine ordered by the customer
* cost: Cost of the order
* day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
* rating: Rating given by the customer out of 5
* food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
* delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

### Let us start by importing the required libraries

In [None]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

### Understanding the structure of the data

In [None]:
# read the data
df = pd.read_csv('foodhub_order.csv')
# returns the first 5 rows
df.head()

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
0,1477147,337525,Hangawi,Korean,30.75,Weekend,Not given,25,20
1,1477685,358141,Blue Ribbon Sushi Izakaya,Japanese,12.08,Weekend,Not given,25,23
2,1477070,66393,Cafe Habana,Mexican,12.23,Weekday,5,23,28
3,1477334,106968,Blue Ribbon Fried Chicken,American,29.2,Weekend,3,25,15
4,1478249,76942,Dirty Bird to Go,American,11.59,Weekday,4,25,24


#### Observations:

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.

### **Question 1:** Write the code to check the shape of the dataset and write your observations based on that.

In [None]:
df.shape

(1898, 9)

#### Observations: There are 1898 rows and 9 columns in the dataset


### Question 2: Write the observations based on the below output from the info() method.

In [None]:
# Use info() to print a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB


#### Observations: There are 1898 not null values in each column. There are 4 columns with integer type data, 1 with float type, and 4 with object type. 133.6+ KB of memory is used


### Question 3: 'restaurant_name', 'cuisine_type', 'day_of_the_week' are object types. Write the code to convert the mentioned features to 'category' and write your observations on the same.

In [None]:
# Coverting "objects" to "category" reduces the data space required to store the dataframe
# write the code to convert 'restaurant_name', 'cuisine_type', 'day_of_the_week' into categorical data

df['restaurant_name'] = df.restaurant_name.astype('category')
df['cuisine_type'] = df.cuisine_type.astype('category')
df['day_of_the_week'] = df.day_of_the_week.astype('category')

# Use info() to print a concise summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   order_id               1898 non-null   int64   
 1   customer_id            1898 non-null   int64   
 2   restaurant_name        1898 non-null   category
 3   cuisine_type           1898 non-null   category
 4   cost_of_the_order      1898 non-null   float64 
 5   day_of_the_week        1898 non-null   category
 6   rating                 1898 non-null   object  
 7   food_preparation_time  1898 non-null   int64   
 8   delivery_time          1898 non-null   int64   
dtypes: category(3), float64(1), int64(4), object(1)
memory usage: 102.7+ KB


#### Observations: After changing the data type to category, the memory usage is decreased (102.7+ KB now).


### **Question 4:** Write the code to find the summary statistics and write your observations based on that.

In [None]:
summary = df.describe()
print(summary)

           order_id    customer_id  cost_of_the_order  food_preparation_time  \
count  1.898000e+03    1898.000000        1898.000000            1898.000000   
mean   1.477496e+06  171168.478398          16.498851              27.371970   
std    5.480497e+02  113698.139743           7.483812               4.632481   
min    1.476547e+06    1311.000000           4.470000              20.000000   
25%    1.477021e+06   77787.750000          12.080000              23.000000   
50%    1.477496e+06  128600.000000          14.140000              27.000000   
75%    1.477970e+06  270525.000000          22.297500              31.000000   
max    1.478444e+06  405334.000000          35.410000              35.000000   

       delivery_time  
count    1898.000000  
mean       24.161749  
std         4.972637  
min        15.000000  
25%        20.000000  
50%        25.000000  
75%        28.000000  
max        33.000000  


#### Observations: The names of the numerical columns and the count, mean, std, min, max, q1, q2, q3 are returned. Majority of cost of the orders are below 22. The mean and 50% percentile are close for cost_of_the_order, food_preparation_time, and delivery_time which is a non-skewed distribution.


### **Question 5:** How many orders are not rated?

In [None]:
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
Not given,736
5,588
4,386
3,188


#### Observations: 736 orders are not rated.


### Exploratory Data Analysis (EDA)




### Question 6: Write the code to find the top 5 restaurants that have received the highest number of orders.

In [None]:
df['restaurant_name'].value_counts()[:5]

Unnamed: 0_level_0,count
restaurant_name,Unnamed: 1_level_1
Shake Shack,219
The Meatball Shop,132
Blue Ribbon Sushi,119
Blue Ribbon Fried Chicken,96
Parm,68


#### Observations: The top 5 restaurants are Shake Shack, The Meatball Shop, Blue Ribbon Sushi, Blue Ribbon Fried Chicken, and Parm.


### Question 7: Write the code to find the most popular cuisine on weekends.

In [None]:
df.groupby(by = ["day_of_the_week"])['cuisine_type'].value_counts()

  df.groupby(by = ["day_of_the_week"])['cuisine_type'].value_counts()


Unnamed: 0_level_0,Unnamed: 1_level_0,count
day_of_the_week,cuisine_type,Unnamed: 2_level_1
Weekday,American,169
Weekday,Japanese,135
Weekday,Italian,91
Weekday,Chinese,52
Weekday,Indian,24
Weekday,Mexican,24
Weekday,Middle Eastern,17
Weekday,Mediterranean,14
Weekday,Southern,6
Weekday,French,5


#### Observations: American cuisine is the most popular on weekends based on order count.


### Question 8: Write the code to find the number of total orders where the cost is above 20 dollars. What is the percentage of such orders in the dataset?

In [None]:
df.loc[df.cost_of_the_order > 20.0,'cost_of_the_order'].count()

555

#### Observations: There are 555 orders that cost above 20 dollars


### Question 9: Write the code to find the mean delivery time based on this dataset. (1 mark)

In [None]:
df.delivery_time.mean()

24.161749209694417

#### Observations: The mean delivery time is 24 minutes.


### Question 10: Suppose the company has decided to give a free coupon of 15 dollars to the customer who has spent the maximum amount on a single order. Write the code to find the ID of the customer along with the order details.



In [None]:
df.order_id.max()
df.loc[df['order_id'] == 1477814]

Unnamed: 0,order_id,customer_id,restaurant_name,cuisine_type,cost_of_the_order,day_of_the_week,rating,food_preparation_time,delivery_time
573,1477814,62359,Pylos,Mediterranean,35.41,Weekday,4,21,29


#### Observations: The customer ID of the maximum order is 62359, he orders from the Pylos, a mediterranean restaurant on the weekend and the cost of the order is 35.






### Question 11: Suppose the company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Write the code to find the restaurants fulfilling the criteria to get the promotional offer.

#### Observations:


### Question 12: Suppose the company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Write the code to find the net revenue generated on all the orders given in the dataset. (2 marks)

In [None]:
df['revenue'] = [order_cost * 0.25 if order_cost > 20 else
                        order_cost * 0.15 if order_cost > 5 else
                        0
                        for order_cost in df['cost_of_the_order']]

print("Revenue= ", df['revenue'].sum())

Revenue=  6166.303


#### Observations: The net revenue is 6166.


### Question 13: Suppose the company wants to analyze the total time required to deliver the food. Write the code to find out the percentage of orders that have more than 60 minutes of total delivery time. (2 marks)

Note: The total delivery time is the summation of the food preparation time and delivery time.

In [None]:
df['total_delivery_time'] = df['food_preparation_time'] + df['delivery_time']

c = df[df['total_delivery_time'] > 60].count()[:1]

print (c)

print( c / df.shape[0] * 100)

order_id    200
dtype: int64
order_id    10.537408
dtype: float64


#### Observations: The count of orders that took more than 60 minutes total delivery time is 200. it is 10% of the orders.


### Question 14: Suppose the company wants to analyze the delivery time of the orders on weekdays and weekends. Write the code to find the mean delivery time on weekdays and weekends. Write your observations on the results. (2 marks)

In [None]:
df.groupby('day_of_the_week')['delivery_time'].mean()

  df.groupby('day_of_the_week')['delivery_time'].mean()


Unnamed: 0_level_0,delivery_time
day_of_the_week,Unnamed: 1_level_1
Weekday,28.340037
Weekend,22.470022


#### Observations: Mean delivery time on weekdays is 28 minutes and mean delivery time on weekends is 22 minutes.


### Conclusion and Recommendations

### **Question 15:** Write the conclusions and business recommendations derived from the analysis. (3 marks)

#### Key Insights:

*   Shake Shack is the most popular restaurant
*   Most people love American, Japanese,Italian, and Chinese food.
*   Delivery times are faster on weekends
*   But more people order food on weekends than on weekdays.

Business recommendations:

*   Around 10% of orders take longer than an hour to be delivered. They should work on reducing the delivery time.

*   It is recommended to have more staff for delivery on weekends due to the higher number of orders.

*   Customers should be offered coupons or entered into a contest to encourage them to leave ratings. This could reduce the amount of not rated orders.









**Student Name : Uvini Aveesha Hettiarachchi**

