# Cleaning Swiggy Bangalore Dataset

This notebook loads the raw Kaggle CSV, explores it, and applies cleaning steps before we build training data.

In [3]:
import pandas as pd
from pathlib import Path

RAW_CSV = Path("../data/raw/Swiggy Bangalore.csv")
CLEANED_CSV = Path("../data/raw/Swiggy Bangalore - cleaned.csv")

df = pd.read_csv(RAW_CSV)
print(f"Shape: {df.shape}")
df.head(10)

Shape: (10297, 7)


Unnamed: 0,Restaurant Name,Category,Rating,Cost for Two (in Rupees),Area,Offer,URL
0,Khichdi Paradise,"Home Food, Desserts, Beverages, Healthy Food",,250,Arekere,,https://www.swiggy.com/restaurants/khichdi-par...
1,Home Plate by EatFit,"North Indian, Home Food, Healthy Food, Indian,...",3.9,160,Arekere,,https://www.swiggy.com/restaurants/home-plate-...
2,THE GRILL & CO.,"Indian, Tandoor, Biryani",,300,Arekere,,https://www.swiggy.com/restaurants/the-grill-a...
3,555 Darjeeling Unique Asian Cuisine,Asian,,300,Arekere,,https://www.swiggy.com/restaurants/555-darjeel...
4,Momo Guy,"Asian, Tibetan, Desserts, Beverages",,200,Arekere,,https://www.swiggy.com/restaurants/momo-guy-jp...
5,BigBites,"Street Food, Fast Food, Tex-Mex",,300,Arekere,,https://www.swiggy.com/restaurants/bigbites-ar...
6,Domino's Pizza,Pizzas,,400,Arekere,,https://www.swiggy.com/restaurants/dominos-piz...
7,UDTA PUNJABI SWAD,"Indian, Tandoor, Biryani, Chinese",,250,Arekere,,https://www.swiggy.com/restaurants/udta-punjab...
8,Samosa Singh,"Snacks, North Indian, Desserts, Beverages",,150,Arekere,,https://www.swiggy.com/restaurants/samosa-sing...
9,HUNGER TREATS,"Burgers, Snacks",,300,Arekere,,https://www.swiggy.com/restaurants/hunger-trea...


## 1. Explore raw data

Check dtypes, missing values, and the literal "NA" strings used in the dataset.

In [4]:
print("Dtypes:")
print(df.dtypes)
print("\n--- Columns ---")
print(df.columns.tolist())

Dtypes:
Restaurant Name              object
Category                     object
Rating                      float64
Cost for Two (in Rupees)      int64
Area                         object
Offer                       float64
URL                          object
dtype: object

--- Columns ---
['Restaurant Name', 'Category', 'Rating', 'Cost for Two (in Rupees)', 'Area', 'Offer', 'URL']


In [5]:
# Count literal "NA" strings (not pandas NaN)
for col in df.columns:
    na_count = (df[col].astype(str).str.strip().str.upper() == "NA").sum()
    if na_count > 0:
        print(f"{col}: {na_count} 'NA' strings")

In [6]:
# Value counts for key columns
print("Restaurant Name - sample (first 5):")
print(df["Restaurant Name"].head())
print("\nArea - value_counts (top 10):")
print(df["Area"].value_counts().head(10))
print("\nCost for Two - describe:")
print(df["Cost for Two (in Rupees)"].describe())

Restaurant Name - sample (first 5):
0                       Khichdi Paradise
1                   Home Plate by EatFit
2                        THE GRILL & CO.
3    555 Darjeeling Unique Asian Cuisine
4                              Momo Guy 
Name: Restaurant Name, dtype: object

Area - value_counts (top 10):
Area
Kadubeesanahalli        588
Marathahalli            585
Rajarajeshwari Nagar    494
Indiranagar             480
Koramangala             480
HSR                     480
Central Bangalore       480
Kammanahalli            480
Whitefield              480
Yelahanka               480
Name: count, dtype: int64

Cost for Two - describe:
count    10297.000000
mean       305.419540
std        184.257752
min          1.000000
25%        200.000000
50%        250.000000
75%        350.000000
max       3500.000000
Name: Cost for Two (in Rupees), dtype: float64


## 2. Cleaning steps

- Replace literal `"NA"` with `NaN` for Rating, Offer (and other cols if needed).
- Strip whitespace from all string columns.
- Clean **Category**: remove trailing `...`, normalize spacing.
- Clean **Cost for Two**: coerce to numeric, drop or clip unreasonable values.
- Drop rows with missing or empty **Restaurant Name**.
- Optional: deduplicate by (Restaurant Name, Area).

In [7]:
# Work on a copy
clean = df.copy()

# Replace literal "NA" with NaN in object columns
for col in clean.select_dtypes(include=["object"]).columns:
    clean[col] = clean[col].replace({"NA": pd.NA, "N/A": pd.NA, "na": pd.NA, "n/a": pd.NA})
    clean[col] = clean[col].replace(r"^\s*NA\s*$", pd.NA, regex=True)

# Strip whitespace from all string columns
for col in clean.select_dtypes(include=["object"]).columns:
    clean[col] = clean[col].astype(str).str.strip()
    clean[col] = clean[col].replace("", pd.NA).replace("nan", pd.NA)

print("After NA replacement and strip - sample Rating, Offer:")
print(clean[["Rating", "Offer"]].head(10))

After NA replacement and strip - sample Rating, Offer:
   Rating  Offer
0     NaN    NaN
1     3.9    NaN
2     NaN    NaN
3     NaN    NaN
4     NaN    NaN
5     NaN    NaN
6     NaN    NaN
7     NaN    NaN
8     NaN    NaN
9     NaN    NaN


In [8]:
# Clean Category: remove trailing "...", collapse multiple spaces
clean["Category"] = (
    clean["Category"]
    .astype(str)
    .str.replace(r"\.{2,}$", "", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)
clean["Category"] = clean["Category"].replace("", pd.NA).replace("nan", pd.NA)
print("Category sample (after clean):")
print(clean["Category"].head(5).tolist())

Category sample (after clean):
['Home Food, Desserts, Beverages, Healthy Food', 'North Indian, Home Food, Healthy Food, Indian, Punjabi, South Indian, Rajasthani', 'Indian, Tandoor, Biryani', 'Asian', 'Asian, Tibetan, Desserts, Beverages']


In [9]:
# Cost for Two: coerce to numeric, clip unreasonable values (e.g. 0 or > 5000)
clean["Cost for Two (in Rupees)"] = pd.to_numeric(clean["Cost for Two (in Rupees)"], errors="coerce")
# Optional: clip to a sensible range (e.g. 50 to 5000)
clean["Cost for Two (in Rupees)"] = clean["Cost for Two (in Rupees)"].clip(lower=50, upper=5000)
print("Cost for Two - describe after clean:")
print(clean["Cost for Two (in Rupees)"].describe())

Cost for Two - describe after clean:
count    10297.000000
mean       305.499272
std        184.139208
min         50.000000
25%        200.000000
50%        250.000000
75%        350.000000
max       3500.000000
Name: Cost for Two (in Rupees), dtype: float64


In [10]:
# Drop rows with missing or empty Restaurant Name
before = len(clean)
clean = clean.dropna(subset=["Restaurant Name"])
clean = clean[clean["Restaurant Name"].astype(str).str.len() > 0]
after = len(clean)
print(f"Dropped {before - after} rows with missing/empty Restaurant Name. Rows left: {after}")

Dropped 0 rows with missing/empty Restaurant Name. Rows left: 10297


In [11]:
# Optional: deduplicate by (Restaurant Name, Area) - keep first occurrence
before = len(clean)
clean = clean.drop_duplicates(subset=["Restaurant Name", "Area"], keep="first")
print(f"After dedup (Name + Area): {before} -> {len(clean)} rows ({before - len(clean)} duplicates removed)")

After dedup (Name + Area): 10297 -> 10291 rows (6 duplicates removed)


In [12]:
# Normalize Restaurant Name and Area: collapse multiple spaces
clean["Restaurant Name"] = (
    clean["Restaurant Name"]
    .astype(str)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)
clean["Area"] = (
    clean["Area"]
    .astype(str)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)
clean["Area"] = clean["Area"].replace("", pd.NA).replace("nan", pd.NA)
clean.head()

Unnamed: 0,Restaurant Name,Category,Rating,Cost for Two (in Rupees),Area,Offer,URL
0,Khichdi Paradise,"Home Food, Desserts, Beverages, Healthy Food",,250,Arekere,,https://www.swiggy.com/restaurants/khichdi-par...
1,Home Plate by EatFit,"North Indian, Home Food, Healthy Food, Indian,...",3.9,160,Arekere,,https://www.swiggy.com/restaurants/home-plate-...
2,THE GRILL & CO.,"Indian, Tandoor, Biryani",,300,Arekere,,https://www.swiggy.com/restaurants/the-grill-a...
3,555 Darjeeling Unique Asian Cuisine,Asian,,300,Arekere,,https://www.swiggy.com/restaurants/555-darjeel...
4,Momo Guy,"Asian, Tibetan, Desserts, Beverages",,200,Arekere,,https://www.swiggy.com/restaurants/momo-guy-jp...


## 3. Save cleaned CSV

Write the cleaned DataFrame to `data/raw/Swiggy Bangalore - cleaned.csv`. You can point `prepare_data.py` at this file instead of the original if you prefer.

In [13]:
CLEANED_CSV.parent.mkdir(parents=True, exist_ok=True)
clean.to_csv(CLEANED_CSV, index=False)
print(f"Saved cleaned data to {CLEANED_CSV}")
print(f"Final shape: {clean.shape}")

Saved cleaned data to ../data/raw/Swiggy Bangalore - cleaned.csv
Final shape: (10291, 7)
