<a href="https://colab.research.google.com/github/quantumhome/DataAnalysisCaseStudy/blob/master/31stMay_BigBasket_Dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Big Basket 🧺**
  **Forget the days of grocery shopping being a chore! Imagine this: you're lounging on the couch, phone in hand, and with a few taps you've got a truckload (well, maybe a basketful) of fresh produce, pantry staples, and even household essentials on their way to your doorstep. That's the magic of bigbasket, India's one stop grocery shopping destination.**

  **They've got over 20,000 products from all your favorite brands, so you can stock up on everything you need without ever leaving home. Fruits and veggies? Got it. Dairy and meat for that epic dinner party? No problem. Bigbasket even has beauty supplies and cleaning products, so you can basically tackle your entire shopping list in one place. Plus, they have crazy convenient delivery options, so you can ditch the supermarket lines and spend that time doing way cooler things (like prepping for that dinner party!). Bigbasket basically makes grocery shopping a breeze, so you can get back to the fun stuff.**

<hr>

# **About the dataset 📊**

**This dataset is basically a big ol' bunch of info about products, all broken down into 10 easy-peasy pieces:**

  * **Index: This is just a fancy way of saying it's a unique ID for each item, like a fingerprint in the data world.**
  * **Product: The name of the product, just like you'd see it on the website.**
  * **Category: The broad group the product falls into, like groceries or home stuff.**
  * **Sub-Category: This is like zooming in on the category. So, maybe "groceries" becomes "fruits" or "home stuff" becomes "cleaning supplies."**
  * **Brand: Who makes the product? You know, like Nike or that yummy jam brand you love.**
  * **Sale Price: How much you gotta pay for it right now.**
  * **Market Price: This is kind of like a reference point, showing the usual price for the product.**
  * **Type: Another way to classify the product, just for extra organization.**
  * **Rating: What other customers think! This is a number showing how much people liked it.**
  * **Description: This is where they tell you all the juicy details about the dataset itself, what it includes and how it's put together**

**Dataset Link: https://drive.google.com/file/d/1aEuXxadTlHS4d_BBqrhVVurOFI154ATS/view?usp=drive_link**

# **Step 1 - Loading the libraries**

##### **Configuration Libraries**

In [None]:
import warnings
warnings.filterwarnings("ignore")

##### **Classical Libraries**

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

##### **External Libraries**

In [None]:
!pip install colorama
import colorama
from colorama import Fore, Back, Style

Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama
Successfully installed colorama-0.4.6


# **Step 2 - Data Ingestion**

### **Data Loading**

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Datasets/BBData.csv")

### **Data Inspection**

In [None]:
df.head().style.set_properties(
    **{
        "background-color": "#FF9B49",
        "color": "black",
        "border-color": "black",
        "border-style": "solid"
    }
)

Unnamed: 0,index,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.0,220.0,Hair Oil & Serum,4.1,"This Product contains Garlic Oil that is known to help proper digestion, maintain proper cholesterol levels, support cardiovascular and also build immunity. For Beauty tips, tricks & more visit https://bigbasket.blog/"
1,2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.0,180.0,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), refrigerator safe, dishwasher safe and can also be used for re-heating food and not for cooking. All containers come with airtight lids and a wide variety of attractive colours. Stack these stylish and colourful containers in your kitchen with ease and for a look-good factor."
2,3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.0,250.0,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your mother, sister, in-laws, boss or your friends, this beautiful designer piece wherever placed, is sure to beautify the surroundings Traditional design This type diya has been used for Diwali and All other Festivals for centuries. Sturdy and easy to carry The feet keep it balanced to ensure safety. Wonderful Oil Lamp made in Brass also called as Jyoti. This is a handcrafted piece of Indian brass Deepak."
3,4,Cereal Flip Lid Container/Storage Jar - Assorted Colour,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149.0,176.0,"Laundry, Storage Baskets",3.7,"Multipurpose container with an attractive design and made from food-grade plastic for your hygiene and safety ideal for storing pulses. Grains, spices, and more with easy opening and closing flip-open lid. Strong, durable and transparent body for longevity and easy identification of contents. Multipurpose storage solution for your daily needs stores your everyday food essentials in style with the Nakoda container set. With transparent bodies, you can easily identify your stored items without having to open the lids. These containers are ideal for storing a large variety of items such as food grains, snacks and pulses to sugar, spices, condiments and more. Featuring unique flip-open lids, you can easily open and close this container without any hassles. The Nakoda container is made from high-quality food-grade and BPA-free plastic that is 100% safe for storing food items. You can safely store your food items in this container without worrying about contamination and harmful toxins. As they are constructed using highly durable virgin plastic, this container will last for a long time even with regular use. This container can enhance the overall look of your kitchen decor. Being dishwasher safe, cleaning and maintaining this container is an easy task. You can also use a simple soap solution to manually wash and retain their looks for a long time."
4,5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162.0,162.0,Bathing Bars & Soaps,4.4,"Nivea Creme Soft Soap gives your skin the best care that it must get. The soft bar consists of Vitamins F and Almonds which are really skin gracious and help you get great skin. It provides the skin with moisture and leaves behind flawless and smooth skin. It makes sure that your body is totally free of germs & dirt and at the same time well nourished.For Beauty tips, tricks & more visit https://bigbasket.blog/"


# **Step 3 - Cleaning and Preprocessing Data**

### **Null Check**

In [None]:
df.isnull().sum()

Unnamed: 0,0
index,0
product,1
category,0
sub_category,0
brand,1
sale_price,0
market_price,0
type,0
rating,8626
description,115


**Assumption**
* **The data values that are missing in the ratings column, we are assuming that either they are new to inventory or they are lowest on sales**

* **Ratings: 0, Description: NotFound**

**Working with the null values**

In [None]:
df["product"] = df["product"].fillna("NoProductNameFound")
df["brand"] = df["brand"].fillna("NoBrandNameFound")
df["rating"] = df["rating"].fillna(0)
df["description"] = df["description"].fillna("NoDescriptionFound")

**Rounding off the data**

In [None]:
# Rounding off the sales price for easy understanding
df["sale_price"] = df["sale_price"].round().astype(int)

# Rounding off the sales price for easy understanding
df["market_price"] = df["market_price"].round().astype(int)

**Data Inspection**

In [None]:
df.head()

Unnamed: 0,index,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220,220,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...
1,2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180,180,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ..."
2,3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119,250,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m..."
3,4,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149,176,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...
4,5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162,162,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...


# **Step 4 - Exploratory Data Analysis (EDA)**

### **Task 1 - Find out the discounts based on the market price we are providing?**

In [None]:
df["finalized_discounts"] = (((df["market_price"] - df["sale_price"]) / df["market_price"])*100)

In [None]:
df.head(10)

Unnamed: 0,index,product,category,sub_category,brand,sale_price,market_price,type,rating,description,finalized_discounts
0,1,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220,220,Hair Oil & Serum,4.1,This Product contains Garlic Oil that is known...,0.0
1,2,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180,180,Water & Fridge Bottles,2.3,"Each product is microwave safe (without lid), ...",0.0
2,3,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119,250,Lamp & Lamp Oil,3.4,"A perfect gift for all occasions, be it your m...",52.4
3,4,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149,176,"Laundry, Storage Baskets",3.7,Multipurpose container with an attractive desi...,15.340909
4,5,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162,162,Bathing Bars & Soaps,4.4,Nivea Creme Soft Soap gives your skin the best...,0.0
5,6,Germ - Removal Multipurpose Wipes,Cleaning & Household,All Purpose Cleaners,Nature Protect,169,199,Disinfectant Spray & Cleaners,3.3,Stay protected from contamination with Multipu...,15.075377
6,7,Multani Mati,Beauty & Hygiene,Skin Care,Satinance,58,58,Face Care,3.6,Satinance multani matti is an excellent skin t...,0.0
7,8,Hand Sanitizer - 70% Alcohol Base,Beauty & Hygiene,Bath & Hand Wash,Bionova,250,250,Hand Wash & Sanitizers,4.0,70%Alcohol based is gentle of hand leaves skin...,0.0
8,9,Biotin & Collagen Volumizing Hair Shampoo + Bi...,Beauty & Hygiene,Hair Care,StBotanica,1098,1098,Shampoo & Conditioner,3.5,"An exclusive blend with Vitamin B7 Biotin, Hyd...",0.0
9,10,"Scrub Pad - Anti- Bacterial, Regular",Cleaning & Household,"Mops, Brushes & Scrubs",Scotch brite,20,20,"Utensil Scrub-Pad, Glove",4.3,Scotch Brite Anti- Bacterial Scrub Pad thoroug...,0.0


**As seen, we can get a idea that the products that are not quite bought are majorly having discouts. On the other hand, the products that are quite sold or are regularly are not having any discounts**

#### **Get an overview of the entire data**

In [None]:
print(Back.BLACK + Style.BRIGHT + "Summary of Products" + Style.RESET_ALL)
print(Fore.RED + "Total Number of unique products:" + Style.RESET_ALL + Fore.YELLOW + str(df["product"].nunique()) + Style.RESET_ALL)
print(Fore.RED + "Total Number of unique products categories:" + Style.RESET_ALL + Fore.YELLOW + str(df["category"].nunique())+ Style.RESET_ALL)
print(Fore.RED + "Total Number of unique products sub categories:"  + Style.RESET_ALL + Fore.YELLOW +str(df["sub_category"].nunique())+ Style.RESET_ALL)
print(Fore.RED + "Total Number of unique products type:" + Style.RESET_ALL + Fore.YELLOW + str(df["type"].nunique())+ Style.RESET_ALL)
print(Fore.RED + "Total Number of unique products brands:" + Style.RESET_ALL + Fore.YELLOW + str(df["brand"].nunique())+ Style.RESET_ALL)

[40m[1mSummary of Products[0m
[31mTotal Number of unique products:[0m[33m23541[0m
[31mTotal Number of unique products categories:[0m[33m11[0m
[31mTotal Number of unique products sub categories:[0m[33m90[0m
[31mTotal Number of unique products type:[0m[33m426[0m
[31mTotal Number of unique products brands:[0m[33m2314[0m


### **Task 2 - Analyze the data based on the Products and Categories to anticipate the demand**

In [None]:
# Grabbing the data from category and data
df_product_category = df[["category", "product"]]

In [None]:
# Drop all the duplicates as we have workn with the distinct data points
df_product_category = df_product_category.drop_duplicates()

In [None]:
# Now, grouping the data by category based on count of the products
df_product_category = df_product_category.groupby("category").agg(product_count = ("product", "count")).reset_index().sort_values("product_count", ascending = False)

In [None]:
# Results
df_product_category.head()

Unnamed: 0,category,product_count
2,Beauty & Hygiene,6839
8,Gourmet & World Food,4109
9,"Kitchen, Garden & Pets",3186
10,Snacks & Branded Foods,2454
4,Cleaning & Household,2411


In [None]:
# Visualize
fig = px.bar(df_product_category, x = "category", y = "product_count", color = "category", title = "Analysis based on category via product count")
fig.show()

**Insights**
  * **We have to invest more on that things which are in demand for better utilise of place**
  * **Out of all the given categories, `Beauty & Hygiene` is having the most products**
  * **Followed by the same, the `Gourmet & World Food` and `Kitchen, Garden & Pets` are the ones that are in the top 3**
  * **Given this data, we can easily analyse that demands are mostly from these three categories, since these categories combined have more products as compared to the rest of the data**

### **Task 3 - Analyze on the basis of brand and type to understand, which brands are famous among the consumers**

In [None]:
# Grab the data
df_brand_type = df[["brand", "type"]]

In [None]:
# DRop the duplicates
df_brand_type = df_brand_type.drop_duplicates()

In [None]:
# Grouping the data for the brand based on the count
df_brand_type = df_brand_type.groupby("brand").agg(type_count = ("type", "count")).reset_index().sort_values("type_count", ascending = False)

**Results**

In [None]:
df_brand_type.head(10)

Unnamed: 0,brand,type_count
2296,bb Combo,53
741,Fresho,41
171,BB Home,37
507,Dabur,34
2298,bb Royal,32
502,DP,31
1383,NUTRIWISH,23
1480,Nutty Yogi,23
2165,Urban Platter,21
1589,Patanjali,21


**Visualize**

In [None]:
fig = px.bar(df_brand_type.head(10), x = "brand", y = "type_count", color = "brand", title = "Analysis based on brand via type count")
fig.show()

* **Planning - We get more home base brand in the stock**
* **Sourcing, Manufacturing**

### **Task 4 - Analyse the data over sub_categories and categories, and try to bring in insight based on demands related to categories**

In [None]:
subcategory_data = df.groupby("category").agg(subcategory_count = ("sub_category", "count")).reset_index().sort_values("subcategory_count", ascending = False)

In [None]:
subcategory_data.head()

Unnamed: 0,category,subcategory_count
2,Beauty & Hygiene,7867
8,Gourmet & World Food,4690
9,"Kitchen, Garden & Pets",3580
10,Snacks & Branded Foods,2814
6,"Foodgrains, Oil & Masala",2676


**Visualize**

In [None]:
fig = px.pie(subcategory_data, values = "subcategory_count", names = "category", title = "Analysis based on category via subcategory count")
fig.show()

#### **Understand the subcategory distribution in Beauty & Hygiene**

In [None]:
beauty_hygiene_subcategories = df[df['category'] == 'Beauty & Hygiene']['sub_category'].value_counts().reset_index()
beauty_hygiene_subcategories.columns = ['subcategory', 'count']

print(Fore.YELLOW + "Subcategory Distribution in Beauty & Hygiene:" + Style.RESET_ALL)
print(beauty_hygiene_subcategories)

fig = px.bar(beauty_hygiene_subcategories, x='subcategory', y='count', color='subcategory',
             title='Subcategory Distribution in Beauty & Hygiene')
fig.show()

[33mSubcategory Distribution in Beauty & Hygiene:[0m
           subcategory  count
0            Skin Care   2294
1    Health & Medicine   1133
2            Hair Care   1028
3    Fragrances & Deos   1000
4     Bath & Hand Wash    996
5       Men's Grooming    805
6     Feminine Hygiene    285
7            Oral Care    271
8               Makeup     48
9  Mothers & Maternity      7


#### **Understand the subcategory distribution in Gourmet & World Food**

In [None]:
gourmet_world_food_subcategories = df[df['category'] == 'Gourmet & World Food']['sub_category'].value_counts().reset_index()
gourmet_world_food_subcategories.columns = ['subcategory', 'count']

print(Fore.YELLOW + "Subcategory Distribution in Gourmet & World Food:" + Style.RESET_ALL)
print(gourmet_world_food_subcategories)

fig = px.bar(gourmet_world_food_subcategories, x='subcategory', y='count', color='subcategory',
             title='Subcategory Distribution in Gourmet & World Food')
fig.show()

[33mSubcategory Distribution in Gourmet & World Food:[0m
                 subcategory  count
0   Snacks, Dry Fruits, Nuts    840
1         Drinks & Beverages    736
2     Cooking & Baking Needs    693
3     Sauces, Spreads & Dips    667
4      Chocolates & Biscuits    609
5             Dairy & Cheese    253
6      Pasta, Soup & Noodles    251
7             Oils & Vinegar    239
8        Cereals & Breakfast    204
9    Tinned & Processed Food    168
10             Bakery Snacks     14
11      Atta, Flours & Sooji     12
12      Rice & Rice Products      3
13             Mutton & Lamb      1


#### **Understand the subcategory distribution in Kitchen, Garden & Pets**

In [None]:
kitchen_garden_pets_subcategories = df[df['category'] == 'Kitchen, Garden & Pets']['sub_category'].value_counts().reset_index()
kitchen_garden_pets_subcategories.columns = ['subcategory', 'count']

print(Fore.YELLOW + "Subcategory Distribution in Kitchen, Garden & Pets:" + Style.RESET_ALL)
print(kitchen_garden_pets_subcategories)

fig = px.bar(kitchen_garden_pets_subcategories, x='subcategory', y='count', color='subcategory',
             title='Subcategory Distribution in Kitchen, Garden & Pets')
fig.show()

[33mSubcategory Distribution in Kitchen, Garden & Pets:[0m
                subcategory  count
0     Storage & Accessories   1015
1        Crockery & Cutlery    890
2    Pet Food & Accessories    356
3      Cookware & Non Stick    354
4            Steel Utensils    353
5       Kitchen Accessories    330
6  Appliances & Electricals    138
7         Flask & Casserole     48
8                  Bakeware     48
9                 Gardening     48


### **Task 5 - Analyze price of the first category beauty and hygiene in order to understand the price for different sub categories, and also check for the average discount provided in each sub category**

In [None]:
beauty_hygiene_df = df[df['category'] == 'Beauty & Hygiene']

In [None]:
beauty_hygiene_price_discount = beauty_hygiene_df.groupby('sub_category').agg(
    average_sale_price=('sale_price', 'mean'),
    average_market_price=('market_price', 'mean'),
    average_discount=('finalized_discounts', 'mean')
).reset_index()

In [None]:
print(Fore.YELLOW + "Price and Discount Analysis for Beauty & Hygiene Subcategories:" + Style.RESET_ALL)
beauty_hygiene_price_discount

[33mPrice and Discount Analysis for Beauty & Hygiene Subcategories:[0m


Unnamed: 0,sub_category,average_sale_price,average_market_price,average_discount
0,Bath & Hand Wash,229.817269,258.277108,9.087102
1,Feminine Hygiene,316.729825,370.624561,10.389217
2,Fragrances & Deos,893.039,1136.047,22.162203
3,Hair Care,383.544747,430.699416,8.279233
4,Health & Medicine,365.691086,382.766108,4.393515
5,Makeup,329.4375,455.333333,24.091967
6,Men's Grooming,322.834783,395.787578,15.918188
7,Mothers & Maternity,333.714286,435.428571,14.767701
8,Oral Care,182.243542,206.826568,9.596557
9,Skin Care,412.098518,482.879686,14.322309




Here are the key insights based on the preceding analysis:

1.  **Discount Strategy:** Products with less demand appear to have higher discounts. Products that are regularly or frequently purchased seem to have no discounts. This suggests a strategy to clear slow-moving inventory or incentivize purchases of less popular items.

2.  **Product Diversity and Demand:**
    *   `Beauty & Hygiene` has the largest variety of products, indicating it's a major focus area and likely experiences high overall demand.
    *   `Gourmet & World Food` and `Kitchen, Garden & Pets` also have a significant number of products, placing them among the top categories in terms of product offering and potentially demand.
    *   These top 3 categories (`Beauty & Hygiene`, `Gourmet & World Food`, `Kitchen, Garden & Pets`) combined represent a substantial portion of the product catalog, suggesting that focusing on inventory and strategies within these areas is crucial for meeting customer needs.

3.  **Brand Popularity:** The analysis of brands based on the count of `type` associated with them can indicate which brands have a wider range of products or are more prominent in the catalog. The top brands identified from the bar chart likely represent popular or widely distributed brands that customers are familiar with. Further analysis is needed to confirm if 'type count' directly correlates with consumer preference or sales volume, but it's a good indicator of the brand's presence.

4.  **Subcategory Demand within Top Categories:**
    *   Within `Beauty & Hygiene`, specific subcategories like `Hair Care`, `Skin Care`, `Oral Care`, and `Bath & Shower` show the highest counts, indicating strong demand or a wide variety of products in these areas.
    *   Within `Gourmet & World Food`, `Oil & Vinegar`, `Chocolates & Candies`, and `Bakery & Desserts` have a high number of products, suggesting these are popular choices.
    *   Within `Kitchen, Garden & Pets`, `Plasticware`, `Cookware & Serveware`, and `Pet Store` subcategories are dominant, pointing to significant product offerings or demand in these areas.

5.  **Pricing and Discount Trends in Beauty & Hygiene:** The analysis of average sale price, market price, and discount for subcategories within `Beauty & Hygiene` reveals pricing strategies within this high-demand category. Different subcategories have varying average prices and discount levels, which could be influenced by product cost, competition, or promotional strategies specific to each subcategory. This information is valuable for optimizing pricing and promotional offers within the `Beauty & Hygiene` section. For instance, identifying subcategories with high average discounts might suggest areas where inventory needs to be moved or where competition is driving prices down.

In [None]:
fig_price = px.bar(beauty_hygiene_price_discount, x='sub_category', y='average_sale_price', color='sub_category',
                   title='Average Sale Price per Subcategory in Beauty & Hygiene')
fig_price.show()

In [None]:
fig_discount = px.bar(beauty_hygiene_price_discount, x='sub_category', y='average_discount', color='sub_category',
                      title='Average Discount per Subcategory in Beauty & Hygiene')
fig_discount.show()

##### **Understanding the price for a specific brand**

In [None]:
makeup_df = df[df['sub_category'] == 'Makeup']

In [None]:
makeup_brand_counts = makeup_df['brand'].value_counts().reset_index()
makeup_brand_counts.columns = ['brand', 'count']

In [None]:
print(Fore.YELLOW + "Brand Distribution in Makeup Subcategory:" + Style.RESET_ALL)
print(makeup_brand_counts)


fig = px.bar(makeup_brand_counts, x='brand', y='count', color='brand',
             title='Brand Distribution in Makeup Subcategory')
fig.show()


[33mBrand Distribution in Makeup Subcategory:[0m
                  brand  count
0                 Lakme     17
1   Maybelline New York      9
2          Loreal Paris      3
3               Garnier      3
4                 Nivea      2
5              bb Combo      2
6                 Spinz      1
7           Blue heaven      1
8                   Mud      1
9                  Olay      1
10            Cosmetics      1
11           Just Herbs      1
12          Pony Effect      1
13                Ponds      1
14             Himalaya      1
15                   DP      1
16             ColorBar      1
17              Brother      1


In [None]:
# Price distribution for sale_price
fig_price_distribution = px.histogram(df, x='sale_price', nbins=50, title='Distribution of Sale Prices')
fig_price_distribution.show()

In [None]:
# Price distribution for market_price
fig_market_price_distribution = px.histogram(df, x='market_price', nbins=50, title='Distribution of Market Prices')
fig_market_price_distribution.show()


In [None]:
# Scatter plot to see relationship between sale and market price
fig_price_scatter = px.scatter(df, x='market_price', y='sale_price', color='category',
                               title='Sale Price vs. Market Price by Category')
fig_price_scatter.show()

### **Task 6 - Analysis on Sales**

In [None]:
print(Back.GREEN + Style.BRIGHT + "Analysis on Sales Figure" + Style.RESET_ALL)
print("Minimum Sales Price: " + Fore.RED + Style.BRIGHT + str(df["sale_price"].min()) + Style.RESET_ALL)
print("Maximum Sales Price: " + Fore.RED + Style.BRIGHT + str(df["sale_price"].max()) + Style.RESET_ALL)
print("Average Sales Price: " + Fore.RED + Style.BRIGHT + str(round(df["sale_price"].mean())) + Style.RESET_ALL)
print("Median Sales Price: " + Fore.RED + Style.BRIGHT + str(round(df["sale_price"].median())) + Style.RESET_ALL)

[42m[1mAnalysis on Sales Figure[0m
Minimum Sales Price: [31m[1m2[0m
Maximum Sales Price: [31m[1m12500[0m
Average Sales Price: [31m[1m323[0m
Median Sales Price: [31m[1m190[0m


### **Task 8 - Bucket Price Analysis**

In [None]:
range_val = [
    ['1-10',1, 10],
 ['11-25', 11, 25],
  ['26-50', 26, 50],
   ['51-100',51, 100],
    ['101-150', 101, 150],
     ['151-200', 151, 200],
      ['201-300',201, 300],
       ['301-400', 301, 400],
        ['401-500', 401, 500],
         ['501-1000',501, 1000],
          ['1001-1500', 1001, 1500],
           ['1501-2000', 1501, 2000],
              ['2001-3000',2001, 3000],
               ['3001-5000', 3001, 5000],
                ['5001-10000', 5001, 10000],
                 ['10001-15000',10001, 15000]]

In [None]:
# Defining the ranges
range_data = pd.DataFrame(range_val, columns = ["range", "start_price", "end_price"])
# Defining a column that would the store the count of the products between the given price range
range_data["product_count"] = ""

In [None]:
range_data

Unnamed: 0,range,start_price,end_price,product_count
0,1-10,1,10,
1,11-25,11,25,
2,26-50,26,50,
3,51-100,51,100,
4,101-150,101,150,
5,151-200,151,200,
6,201-300,201,300,
7,301-400,301,400,
8,401-500,401,500,
9,501-1000,501,1000,


In [None]:
for idx, rows in range_data.iterrows():
  range_data.at[idx, "product_count"] = len(df["product"][(df["sale_price"]>=rows["start_price"]) & (df["sale_price"]<=rows["end_price"])])

In [None]:
range_data

Unnamed: 0,range,start_price,end_price,product_count
0,1-10,1,10,178
1,11-25,11,25,689
2,26-50,26,50,2232
3,51-100,51,100,4654
4,101-150,101,150,3661
5,151-200,151,200,3196
6,201-300,201,300,4568
7,301-400,301,400,2555
8,401-500,401,500,1693
9,501-1000,501,1000,2773


**Visualize**

In [None]:
fig = px.bar(range_data, x = "range", y = "product_count", color = "range", title = "Analysis based on price range via product count")
fig.show()

## **Building Regression Model**

In [None]:
import statsmodels.api as sm

In [None]:
# Define independent and dependent variables
x = df["sale_price"]
y = df["market_price"]

# Add a constant term to the independent variables
x = sm.add_constant(x)

In [None]:
# Create a model
model = sm.OLS(y, x).fit()

In [None]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           market_price   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.932
Method:                 Least Squares   F-statistic:                 3.753e+05
Date:                Sun, 01 Jun 2025   Prob (F-statistic):               0.00
Time:                        05:54:51   Log-Likelihood:            -1.7756e+05
No. Observations:               27555   AIC:                         3.551e+05
Df Residuals:                   27553   BIC:                         3.551e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          9.6622      1.100      8.785      0.0

In [None]:
# Predict the market price of a product based on sales price
# For example: sale_price = 100
predicted_market_price = model.predict([1, 100])[0]
print(predicted_market_price)

125.13134610766177


In [None]:
predicted_market_price

np.float64(125.13134610766177)

<hr>

# **Step 5 - Building the Recommendation Engine 🎞️**

**Ever scrolled through Amazon and felt like they can read your mind? That's the magic of recommendation systems!  Basically, it's a super-powered filter that picks up on what you search for or buy, then uses that info to suggest similar things you might like. It's like having a personal shopping buddy who remembers your interests and whispers "Hey, check this out!" whenever you're browsing. Pretty cool, right?**

**Ever notice how YouTube seems to know exactly what you want to watch next? That's because it's got a recommendation system working behind the scenes, like a super-smart friend suggesting videos based on what you've watched before. It's the same deal with Netflix – it learns your taste in movies and genres, then whispers in your ear (well, the recommendation bar) with suggestions you might love.**

**This recommendation system magic isn't just for entertainment. It's everywhere!  Scrolling through Facebook or Instagram? Boom, recommendations for new friends and accounts to follow pop up.  Shopping online?  Amazon, BigBasket, and other sites use your past searches and purchases to show you ads and products that might catch your eye. It's like having a personal shopping buddy who remembers what you like and says "Hey, check this out!"**

**So, the next time you see those eerily perfect recommendations, remember – it's not magic, it's just a clever system that helps you discover new things and maybe even find that perfect product (or next binge-worthy video).**

<hr>

**Types of Recommendation System**
  * **`Demographic Filtering`: Imagine you walk into a movie store blindfolded. The salesperson, armed with only your age and maybe your favorite color, recommends the "blockbuster hits" everyone's raving about. That's kind of how demographic filtering works. It uses broad categories like age, gender, or location to suggest movies that are generally popular with similar groups. While it can be a good starting point, it doesn't account for your unique tastes. It's like getting a generic recommendation instead of a friend suggesting a hidden gem they know you'll love.**


  * **`Content Based Filtering` : Ever notice how after watching a funny cat video on YouTube, you get bombarded with suggestions for more feline frolics? That's content-based filtering at work! This system is like a detective, looking at the clues – things like genre, director, or actors for movies – to find items that are similar to what you liked before. The idea is that if you enjoyed something, you'd probably enjoy something else with similar characteristics. It's a good way to discover hidden gems within a category you already love, but it might not introduce you to entirely new things outside your comfort zone.**

  * **`Collabrative Filtering` - Imagine you're at a party and hit it off with someone who has amazing taste in movies. They rave about this hidden gem you've never heard of, and you know you gotta check it out because you trusted their other picks. That's the magic of collaborative filtering! Unlike content-based systems that focus on the movie itself, this one is all about finding users with similar tastes to you. It's like having a secret network of movie buddies who recommend things they know you'll love, based on what they've enjoyed themselves. Pretty cool, right?**

<hr>

# **Step 6 - Demographic Filtering based Recommendation**

**Clearing out the null values**

In [None]:
df = df.dropna()

**Making the function for the recommendation**

In [None]:
# Here col is used for recommendation
# Sorting will be used for sorting it ascending & descending
def demographic_recommedation(col = "rating", sort_type = False):
  rated_data = df.copy()

  if rated_data[col].dtype == "O":
    col = "rating"

  rated_data = rated_data.sort_values(by = col, ascending = sort_type)
  return rated_data[['product', 'brand', 'sale_price', 'rating']].head(10)

**Implement the function**

In [None]:
demographic_recommedation(col = "sale_price", sort_type = True)

Unnamed: 0,product,brand,sale_price,rating
21312,Serum,Livon,3,2.5
14603,50-50 Timepass Biscuits,Britannia,5,3.9
27413,Layer Cake - Orange,Winkies,5,4.1
22072,Tiger Chocolate Cream Biscuits,Britannia,5,4.2
6014,Good Day Butter Cookies,Britannia,5,4.1
18290,Sugar Coated Chocolate,Cadbury Gems,5,4.2
17640,Hand Wash - Moisture Shield,Savlon,5,4.4
24671,Fulltoss Thai Sriracha,Parle,5,4.1
22178,Tiger Elaichi Cream Biscuits,Britannia,5,4.2
16551,Biscuits - Magix Kreams Choc,Parle,5,3.9


**Here, we can see that the top product is having a very bad rating, so we will filter down the rating by using a threshold value, using which we can select those particular products that are having good rating. For the same, the average of the rating columns can be a good threshold values**

<hr>

**Recreating the function, but using threshold**

In [None]:
df["rating"].mean()

3.9430762698370576

In [None]:
# Here col is used for recommendation
# Sorting will be used for sorting it ascending & descending
def demographic_recommedation(col = "rating", sort_type = False):
  rated_data = df.copy().loc[df['rating']>=3.5]

  if rated_data[col].dtype == "O":
    col = "rating"

  rated_data = rated_data.sort_values(by = col, ascending = sort_type)
  return rated_data[['product', 'brand', 'sale_price', 'rating']].head(10)

In [None]:
demographic_recommedation(col = "sale_price", sort_type = True)

Unnamed: 0,product,brand,sale_price,rating
16551,Biscuits - Magix Kreams Choc,Parle,5,3.9
2761,Orbit Sugar-Free Chewing Gum - Lemon & Lime,Wrigleys,5,4.2
15926,Dreams Cup Cake - Choco,Elite,5,3.9
27490,50-50 Timepass Salted Biscuits,Britannia,5,4.2
14538,Cadbury Perk - Chocolate Bar,Cadbury,5,4.2
19202,Bounce Biscuits - Choco Creme,Sunfeast,5,4.2
2978,Sugar Free Chewing Gum - Mixed Fruit,Orbit,5,4.2
14603,50-50 Timepass Biscuits,Britannia,5,3.9
19538,Layer Cake - Chocolate,Winkies,5,4.2
17640,Hand Wash - Moisture Shield,Savlon,5,4.4


**This recommender better than before now, as there is no rating that is below than 2.5**

<hr>

# **Step 7 - Content Based Recommendation**

* **Using other features such as Category, Sub Category, Brand, Type and Description for much better Recommendation.**
* **We will be using NLP here to extract useful info from the features especially Description so let's understand TF-IDF before using it.**

<hr>

* **Term Frequency (TF) - Rare Words for framing up the recommendation**
  * **Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. There are multiple measures, or ways, of defining frequency: Number of times the word appears in a document (raw count).**

  * **Term frequency adjusted for the length of the document (raw count of occurences divided by number of words in the document). Logarithmically scaled frequency (e.g. log(1 + raw count)). Boolean frequency (e.g. 1 if the term occurs, or 0 if the term does not occur, in the document).**

* **Term frequency (TF) is basically how often a word shows up in a single document. There are a few ways to measure it:**

  * **Simple count: This is just the number of times the word appears in the document. Like counting how many times you say "the" in your essay.**
  * **Adjusted for length: This takes the number of times the word shows up and divides it by the total number of words in the document. This way, you can compare the importance of a word in short documents vs long documents. Imagine you're looking at two recipes, one a one-sentence instruction and another a full page. You'd expect "bake" to show up more in the longer recipe, but this adjusted measure would help you see if it's really important to both recipes.**
  * **Logarithmic scale: This is a more advanced way of counting that considers how much the word count really matters. For example, finding "pizza" twice in a document is probably a bigger deal than finding "the" twenty times. This scaling helps even things out a bit.**
  * **Boolean: This is a simple yes or no - either the word is in the document or it isn't. This isn't very informative by itself, but it can be useful for some computer science tasks.**

<hr>

* **IDF - Looking up for the common words in given data**
  * **Imagine you have a giant collection of books (the corpus). In a single book (a document), some words will be used very often, like "the" or "and". These aren't very helpful because they appear everywhere.**

  * **Inverse document frequency (IDF) helps us find interesting words. It considers how rare a word is across the entire collection of books. Rare words are more interesting because they tell us more about the specific book we're looking at.&**

  * **Here's a simple way to think about IDF:**

    * **Common words (like "the") show up in many books (documents) - low IDF score (not interesting).**
    * **Uncommon words show up in only a few books - high IDF score (more interesting).**

**So,IDF basically tells us how special a word is to a particular book compared to the whole collection.**

In [None]:
import re
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

<hr>

**Implementing the TFIDF Vectorizers**

In [None]:
tfidf = TfidfVectorizer(stop_words= "english")

tfidf_matrix = tfidf.fit_transform(df["description"])

In [None]:
tfidf_matrix.shape

(27555, 28667)

**Similarity Score, tell that how similar two particular vectors are: 1. - complletely similar if the results is 0 not similar**

In [None]:
sim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
pd.DataFrame(sim_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27545,27546,27547,27548,27549,27550,27551,27552,27553,27554
0,1.000000,0.015793,0.009983,0.000000,0.161977,0.019915,0.158756,0.000000,0.000000,0.000000,...,0.016159,0.000000,0.000000,0.052554,0.00000,0.000000,0.029093,0.010760,0.011197,0.000000
1,0.015793,1.000000,0.006983,0.164667,0.000000,0.022942,0.000000,0.105604,0.026234,0.049868,...,0.023936,0.016750,0.000000,0.027744,0.00000,0.000000,0.000000,0.000000,0.000000,0.000000
2,0.009983,0.006983,1.000000,0.022150,0.017539,0.010203,0.000000,0.000000,0.004542,0.000000,...,0.015835,0.011587,0.000000,0.018384,0.00000,0.014780,0.005542,0.006268,0.000000,0.000000
3,0.000000,0.164667,0.022150,1.000000,0.024366,0.061693,0.000000,0.018892,0.006046,0.033071,...,0.050943,0.011493,0.008513,0.013141,0.05364,0.004068,0.000000,0.000000,0.033400,0.002418
4,0.161977,0.000000,0.017539,0.024366,1.000000,0.039197,0.222488,0.265517,0.000000,0.000000,...,0.000000,0.000000,0.006494,0.053084,0.00000,0.010434,0.000000,0.007866,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27550,0.000000,0.000000,0.014780,0.004068,0.010434,0.008962,0.000000,0.000000,0.005557,0.000000,...,0.000000,0.000000,0.000000,0.015390,0.00000,1.000000,0.000000,0.000000,0.021852,0.000000
27551,0.029093,0.000000,0.005542,0.000000,0.000000,0.000000,0.000000,0.000000,0.011547,0.000000,...,0.000000,0.016894,0.000000,0.002885,0.00000,0.000000,1.000000,0.018246,0.014424,0.043180
27552,0.010760,0.000000,0.006268,0.000000,0.007866,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.016228,0.005460,0.007823,0.00000,0.000000,0.018246,1.000000,0.000000,0.000000
27553,0.011197,0.000000,0.000000,0.033400,0.000000,0.020520,0.000000,0.000000,0.009284,0.000000,...,0.000000,0.000000,0.012649,0.036288,0.00000,0.021852,0.014424,0.000000,1.000000,0.000000


In [None]:
indexes = pd.Series(df.index, index = df["product"]).drop_duplicates()

In [None]:
indexes

Unnamed: 0_level_0,0
product,Unnamed: 1_level_1
Garlic Oil - Vegetarian Capsule 500 mg,0
Water Bottle - Orange,1
"Brass Angle Deep - Plain, No.2",2
Cereal Flip Lid Container/Storage Jar - Assorted Colour,3
Creme Soft Soap - For Hands & Body,4
...,...
"Wottagirl! Perfume Spray - Heaven, Classic",27550
Rosemary,27551
Peri-Peri Sweet Potato Chips,27552
Green Tea - Pure Original,27553


## **Making Recommendation Engine**

In [None]:
def recommend(product, sim_scores = sim_matrix):
  # REtrieve the row for the similarity matrx, this row will contain the similarity
  # Score between the product description and all the other products
  idx = indexes[product]

  similarity_score = list(enumerate(sim_matrix[idx]))

  # We have the sorted the similarity for score in reverse / descending
  similarity_score = sorted(similarity_score, key = lambda x: x[1], reverse = True)

  # we working with the top 10 recommedations
  similarity_score = similarity_score[1:11]

  product_indices = [i[0] for i in similarity_score]

  return df["product"].iloc[product_indices]

In [None]:
recommend("Brass Angle Deep - Plain, No.2")

Unnamed: 0,product
1676,Brass Nanda Stand Goblets - No.1
2161,Brass Kachua Stand Deepam - No.1
2755,"Brass Angle Deep Stand - Plain, No.2"
5399,"Brass Lakshmi Deepam - Plain, No.2"
6519,Brass Kuber Deepam - No.1
10503,Brass Kuber Deepam - No.2
11225,"Brass Angle Deep Stand - Plain, No.3"
11503,"Brass Angle Deep Stand - Plain, No.1"
12698,Brass Kachua Stand Deepam - No.2
18571,Brass Kuber Deepam - No.3
