<a href="https://colab.research.google.com/github/usshaa/SMBDA/blob/main/C-4.2%3A%20Analyze_customer_reviews_and_ratings_of_fast_delivery_agents_in_India_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1️⃣ Project Overview**
- **Objective**: Analyze customer reviews and ratings of fast delivery agents in India.  
- **Dataset Source**: Kaggle ([India's Fast Delivery Agents Reviews and Ratings](https://www.kaggle.com/datasets/vivekattri/indias-fast-delivery-agents-reviews-and-ratings)).  
- **Technology Stack**: Hive, Spark (Databricks), HDFS/DBFS, Bash (for data download).  
- **Steps Involved**:
  1. Download and extract the dataset.  
  2. Upload the CSV file to DBFS/HDFS.  
  3. Create an **external table** in Hive.  
  4. Create a **cleaned table** by handling missing data & formatting issues.  
  5. Perform **data analysis** using Hive SQL queries.  

---

## **2️⃣ Step 1: Download and Upload Data**
### **Download Dataset (Bash Script)**
Run the following script to **download and extract** the dataset:


In [None]:
%python
!mkdir -p ~/Downloads/delivery_reviews
!curl -L -o ~/Downloads/delivery_reviews/dataset.zip \
  https://www.kaggle.com/api/v1/datasets/download/vivekattri/indias-fast-delivery-agents-reviews-and-ratings

!unzip -o ~/Downloads/delivery_reviews/dataset.zip -d ~/Downloads/delivery_reviews

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  172k  100  172k    0     0   322k      0 --:--:-- --:--:-- --:--:--  322k
Archive:  /root/Downloads/delivery_reviews/dataset.zip
  inflating: /root/Downloads/delivery_reviews/Fast Delivery Agent Reviews.csv  


In [None]:
%python
import os

local_path = "/root/Downloads/delivery_reviews/"
print(os.listdir(local_path))


['dataset.zip', 'Fast Delivery Agent Reviews.csv']


### **Upload Data to Databricks FileStore or HDFS**
**For Databricks (DBFS)**

In [None]:
%python
dbutils.fs.cp("file:/root/Downloads/delivery_reviews/Fast Delivery Agent Reviews.csv",
              "dbfs:/FileStore/tables/delivery_reviews/Fast_Delivery_Agent_Reviews.csv")


Out[56]: True

**For HDFS**

In [None]:
%python
dbutils.fs.mkdirs("dbfs:/user/hive/warehouse/delivery_reviews/")

Out[57]: True

In [None]:
%python
dbutils.fs.cp("file:/root/Downloads/delivery_reviews/",
              "dbfs:/user/hive/warehouse/delivery_reviews/",
              recurse=True)

Out[58]: True

## **3️⃣ Step 2: Create Hive External Table**
### **Create Database**

In [None]:
%sql
CREATE DATABASE IF NOT EXISTS delivery_reviews_db;
USE delivery_reviews_db;

### **Create External Table**

In [None]:
%sql
DROP Table delivery_reviews_raw;

In [None]:
%sql
CREATE EXTERNAL TABLE delivery_reviews_raw (
    Agent_Name STRING,
    Rating FLOAT,
    Review_Text STRING,
    Delivery_Time INT,
    Location STRING,
    Order_Type STRING,
    Customer_Feedback_Type STRING,
    Price_Range STRING,
    Discount_Applied STRING,
    Product_Availability STRING,
    Customer_Service_Rating INT,
    Order_Accuracy STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'dbfs:/FileStore/tables/delivery_reviews/';

**Verify Data**

In [None]:
%sql
SELECT * FROM delivery_reviews_raw LIMIT 5;

Agent_Name,Rating,Review_Text,Delivery_Time,Location,Order_Type,Customer_Feedback_Type,Price_Range,Discount_Applied,Product_Availability,Customer_Service_Rating,Order_Accuracy
Agent Name,,Review Text,,Location,Order Type,Customer Feedback Type,Price Range,Discount Applied,Product Availability,,Order Accuracy
Zepto,4.5,Purpose boy job cup decision girl now get job yard.,58.0,Delhi,Essentials,Neutral,High,Yes,Out of Stock,4.0,Incorrect
Zepto,2.1,Prevent production able both the box school way issue grow action figure one.,25.0,Lucknow,Grocery,Negative,Low,No,Out of Stock,2.0,Correct
JioMart,4.5,Family station listen agreement more kitchen lose hour hour.,54.0,Ahmedabad,Essentials,Neutral,Low,No,Out of Stock,3.0,Correct
JioMart,2.6,World north people area everything enter beyond Democrat beautiful very.,22.0,Chennai,Essentials,Neutral,Low,Yes,In Stock,1.0,Incorrect


## **4️⃣ Step 3: Create a Cleaned Table**
### **Data Cleaning & Transformation**
- Remove duplicate rows  
- Filter out missing or corrupted data  
- Standardize column formats  

In [None]:
%sql
DROP TABLE delivery_reviews_cleaned;

In [None]:
%sql
CREATE TABLE delivery_reviews_cleaned AS
SELECT
    Agent_Name,
    Rating,
    Review_Text,
    Delivery_Time,
    Location,
    Order_Type,
    Customer_Feedback_Type,
    Price_Range,
    Discount_Applied,
    Product_Availability,
    Customer_Service_Rating,
    Order_Accuracy
FROM delivery_reviews_raw
WHERE Agent_Name IS NOT NULL AND Rating IS NOT NULL;

num_affected_rows,num_inserted_rows


**Verify Cleaned Data**

In [None]:
%sql
SELECT COUNT(*) FROM delivery_reviews_cleaned;

count(1)
5000


## **5️⃣ Step 4: Data Analysis Queries**
### **1️⃣ Find the top 5 best-rated delivery agents**

In [None]:
%sql
SELECT Agent_Name, ROUND(AVG(Rating), 2) AS avg_rating
FROM delivery_reviews_cleaned
GROUP BY Agent_Name
ORDER BY avg_rating DESC
LIMIT 5;

Agent_Name,avg_rating
Swiggy Instamart,3.02
Zepto,3.01
Blinkit,2.99
JioMart,2.99


### **2️⃣ Count total reviews per location**

In [None]:
%sql
SELECT Location, COUNT(*) AS total_reviews
FROM delivery_reviews_cleaned
GROUP BY Location
ORDER BY total_reviews DESC;

Location,total_reviews
Kolkata,517
Ahmedabad,515
Pune,515
Delhi,514
Bangalore,513
Mumbai,498
Hyderabad,490
Jaipur,489
Chennai,478
Lucknow,471


### **3️⃣ Find the most common order type**

In [None]:
%sql
SELECT Order_Type, COUNT(*) AS total_orders
FROM delivery_reviews_cleaned
GROUP BY Order_Type
ORDER BY total_orders DESC;

Order_Type,total_orders
Electronics,1008
Food,1003
Essentials,1001
Grocery,995
Pharmacy,993


### **4️⃣ Find the average delivery time per agent**

In [None]:
%sql
SELECT Agent_Name, ROUND(AVG(Delivery_Time), 2) AS avg_delivery_time
FROM delivery_reviews_cleaned
GROUP BY Agent_Name
ORDER BY avg_delivery_time ASC;

Agent_Name,avg_delivery_time
Blinkit,34.65
JioMart,35.03
Zepto,35.06
Swiggy Instamart,35.12


### **5️⃣ Find the impact of discount on ratings**

In [None]:
%sql
SELECT Discount_Applied, ROUND(AVG(Rating), 2) AS avg_rating
FROM delivery_reviews_cleaned
GROUP BY Discount_Applied;

Discount_Applied,avg_rating
No,2.99
Yes,3.02


### **6️⃣ Categorize Ratings into Low, Medium, and High**

In [None]:
%sql
SELECT
    Agent_Name,
    Rating,
    CASE
        WHEN Rating < 3 THEN 'Low'
        WHEN Rating BETWEEN 3 AND 4 THEN 'Medium'
        ELSE 'High'
    END AS rating_category
FROM delivery_reviews_cleaned;

Agent_Name,Rating,rating_category
Zepto,4.5,High
Zepto,2.1,Low
JioMart,4.5,High
JioMart,2.6,Low
Zepto,3.6,Medium
Blinkit,1.9,Low
Blinkit,3.3,Medium
Blinkit,1.5,Low
Zepto,2.8,Low
JioMart,2.5,Low


## **6️⃣ Step 5: Export Data for Visualization**
**Convert Cleaned Data to Parquet**

In [None]:
%sql
DROP TABLE delivery_reviews_parquet;

In [None]:
%sql
CREATE TABLE delivery_reviews_parquet
STORED AS PARQUET AS
SELECT * FROM delivery_reviews_cleaned;

**Export to Local File**

### **Solution: Use Databricks File System (DBFS)**
To **export a Hive table or Parquet file from DBFS to your local machine**, follow these steps.

---

## **Step 1: Verify the Parquet Files in DBFS**
Run this command to check if the files exist:

In [None]:
%sql
SHOW TABLES;

database,tableName,isTemporary
delivery_reviews_db,delivery_reviews_cleaned,False
delivery_reviews_db,delivery_reviews_parquet,False
delivery_reviews_db,delivery_reviews_raw,False


In [None]:
%python
dbutils.fs.ls("dbfs:/user/hive/warehouse/delivery_reviews_db.db/")

Out[59]: [FileInfo(path='dbfs:/user/hive/warehouse/delivery_reviews_db.db/delivery_reviews_cleaned/', name='delivery_reviews_cleaned/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/hive/warehouse/delivery_reviews_db.db/delivery_reviews_parquet/', name='delivery_reviews_parquet/', size=0, modificationTime=0)]

In [None]:
%python
dbutils.fs.cp("dbfs:/user/hive/warehouse/delivery_reviews_db.db/delivery_reviews_parquet/", "dbfs:/FileStore/delivery_reviews.parquet", True)

Out[60]: True

In [None]:
%fs ls dbfs:/FileStore/delivery_reviews.parquet

path,name,size,modificationTime
dbfs:/FileStore/delivery_reviews.parquet/_SUCCESS,_SUCCESS,0,1739429030000
dbfs:/FileStore/delivery_reviews.parquet/_committed_2664462527895809174,_committed_2664462527895809174,123,1739428679000
dbfs:/FileStore/delivery_reviews.parquet/_committed_3176065044349503583,_committed_3176065044349503583,123,1739429030000
dbfs:/FileStore/delivery_reviews.parquet/_started_2664462527895809174,_started_2664462527895809174,0,1739428679000
dbfs:/FileStore/delivery_reviews.parquet/_started_3176065044349503583,_started_3176065044349503583,0,1739429030000
dbfs:/FileStore/delivery_reviews.parquet/part-00000-tid-2664462527895809174-caba0d87-a298-4e59-8ce4-fdcc65688a77-40-1-c000.snappy.parquet,part-00000-tid-2664462527895809174-caba0d87-a298-4e59-8ce4-fdcc65688a77-40-1-c000.snappy.parquet,219685,1739428679000
dbfs:/FileStore/delivery_reviews.parquet/part-00000-tid-3176065044349503583-19c97251-cb13-46d1-bdf1-59145e75a9f0-68-1-c000.snappy.parquet,part-00000-tid-3176065044349503583-19c97251-cb13-46d1-bdf1-59145e75a9f0-68-1-c000.snappy.parquet,219685,1739429030000


## **7️⃣ Summary of Commands**
| **Step**  | **Command** |
|-----------|------------|
| Download Data | `curl -L -o ~/Downloads/delivery_reviews/dataset.zip ...` |
| Upload to DBFS | `dbfs cp ~/Downloads/delivery_reviews/*.csv dbfs:/FileStore/tables/delivery_reviews/` |
| Create Database | `CREATE DATABASE delivery_reviews_db;` |
| Create External Table | `CREATE EXTERNAL TABLE delivery_reviews_raw ...` |
| Create Cleaned Table | `CREATE TABLE delivery_reviews_cleaned AS SELECT ...` |
| Count Records | `SELECT COUNT(*) FROM delivery_reviews_cleaned;` |
| Top Agents | `SELECT Agent_Name, AVG(Rating) FROM delivery_reviews_cleaned ...` |
| Orders by Location | `SELECT Location, COUNT(*) FROM delivery_reviews_cleaned ...` |
| Avg Delivery Time | `SELECT Agent_Name, AVG(Delivery_Time) FROM delivery_reviews_cleaned ...` |
| Convert to Parquet | `CREATE TABLE delivery_reviews_parquet STORED AS PARQUET AS SELECT * FROM delivery_reviews_cleaned;` |