# 1. intro

## 1.1 about dataset

<p>this dataseet is used for market basket analysis, product recommendations and store optimization.
but i'll use it for improving my data analysis skill.
</p>

## 1.2 dataset link

<p>we are going to use <a href="https://www.kaggle.com/datasets/bhavikjikadara/grocery-store-dataset">this<a /> dataset</p>

# 2. understand dataset

## 2.1 import needed libraries

In [86]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Union

## 2.2 read dataset

In [2]:
df = pd.read_csv('./GroceryDataset.csv')

## 2.3 getting some information

### 2.3.1 see some rows, and rename all columns

In [3]:
df = df.rename(
    columns={
        'Price': 'price', 
        'Sub Category':"sub_category", 
        'Discount':'discount',
        'Rating':'rating',
        'Title':'title',
        'Currency':'currency',
        'Feature':'feature',
        'Product Description':'description'
    }
)
df.head()

Unnamed: 0,sub_category,price,discount,rating,title,currency,feature,description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count\nIndividually wrapped\nMade in and I...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Prese...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake\nCertified Kosh...",A cake the dessert epicure will die for!To the...


### 2.3.2 understand some features types

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1757 entries, 0 to 1756
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   sub_category  1757 non-null   object
 1   price         1754 non-null   object
 2   discount      1757 non-null   object
 3   rating        682 non-null    object
 4   title         1757 non-null   object
 5   currency      1752 non-null   object
 6   feature       1739 non-null   object
 7   description   1715 non-null   object
dtypes: object(8)
memory usage: 109.9+ KB


### 2.3.3 features name

In [5]:
df.columns

Index(['sub_category', 'price', 'discount', 'rating', 'title', 'currency',
       'feature', 'description'],
      dtype='object')

### 2.3.4 dataset shape

In [6]:
df.shape

(1757, 8)

### 2.3.4 check for missed datas

In [7]:
df.isnull().sum()

sub_category       0
price              3
discount           0
rating          1075
title              0
currency           5
feature           18
description       42
dtype: int64

## 2.4 analysis results

1. **feature analysis**:
    - *sub-ategory*: This column categorizes the grocery items into subcategories, providing a detailed classification for easier analysis and organization.
    - *price*: Represents the monetary value of the grocery item, indicating its cost or retail price in the specified currency.
    - *discount*: Reflects any discounts or promotional offers applicable to the respective grocery item, providing insights into pricing strategies.
    - *rating*: Indicates customer satisfaction or product quality based on user ratings, offering a measure of the overall perceived value of the grocery item.
    - *title*: Describes the name or title of the grocery item, providing a concise identifier for easy reference and understanding.
    - *currency*: Specifies the currency in which the prices are denominated, facilitating proper interpretation and comparison of monetary values.
    - *feature*: Includes features or characteristics of the grocery item, offering additional information about its unique attributes or selling points.
    - *description*: Provides a detailed textual description of the grocery item, offering comprehensive information about its specifications, uses, and other relevant details. This column is handy for understanding product details beyond what is captured in other columns.


2. **dataset**:
    - dataset contains 8 categorical rows 
    - some rows such as Price, Discount, Rating shall be numerical and i'll do it in feature engineering part
    - we need a new feature for riviews

3. **missed datas**:
    - *price*: i'll fill them using KNN, .
    - *rating*: it's actuly normal, peaple are tired eanugh for not rating what they buy
    - *currency*: it's actully important, i may remove the rows which has null currency
    - *feature*: we dont need this row, it is only a categorical column for users to know what they are buying
    - *description*: we dont need this row, it is only a categorical column for users to know what they are buying

## 2.5 getting unique values of dataset

### 2.5.1 total unique values

In [8]:
pd.DataFrame(df.nunique(), columns=['unique valuescount'])

Unnamed: 0,unique valuescount
sub_category,19
price,184
discount,42
rating,483
title,1484
currency,1
feature,1401
description,1435


### 2.5.2 getting all features unique values

In [9]:
def get_categorical_features(dataFrame):
    categorical_features = []
    for i in dataFrame:
        if not pd.api.types.is_numeric_dtype(dataFrame[i].dtypes):
            categorical_features.append(i)
    return categorical_features

In [10]:
# we first save features name
categorical_features = get_categorical_features(df)
categorical_features

['sub_category',
 'price',
 'discount',
 'rating',
 'title',
 'currency',
 'feature',
 'description']

In [11]:
def get_categorical_unique(name:str, dataFrame:pd.DataFrame):
    print("#"*10, name, "#"*10)
    print(pd.DataFrame(dataFrame[name].unique(), columns=["unique values"]), end="\n\n\n")

In [12]:
get_categorical_unique("sub_category", df)
get_categorical_unique("currency", df)

########## sub_category ##########
                   unique values
0              Bakery & Desserts
1              Beverages & Water
2                      Breakfast
3                          Candy
4              Cleaning Supplies
5                         Coffee
6                           Deli
7                         Floral
8                   Gift Baskets
9                      Household
10    Kirkland Signature Grocery
11  Laundry Detergent & Supplies
12                Meat & Seafood
13                       Organic
14            Pantry & Dry Goods
15      Paper & Plastic Products
16                       Poultry
17                       Seafood
18                        Snacks


########## currency ##########
  unique values
0             $
1           NaN




## 2.6 analysis categorical features result

- the only unique value in currency is $
- we've got 19 unique values in sub_category

# 3. feature engineering

In [13]:
df.head()

Unnamed: 0,sub_category,price,discount,rating,title,currency,feature,description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count\nIndividually wrapped\nMade in and I...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Prese...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake\nCertified Kosh...",A cake the dessert epicure will die for!To the...


In [89]:
def to_float_price(str_list: List[str]) -> np.ndarray:
    """
    convert a list of string prices to float values.
    
    this function handles various price formats:
    - NaN values are converted to 0.0
    - prices with commas have commas removed
    - prices with "through-" format are averaged
    - regular prices with currency symbols are converted to float
    """
    numbers = []
    
    for i, price_str in enumerate(str_list):
        if pd.isna(price_str):
            numbers.append(0.0)
            continue
            
        if "," in price_str:
            price_str = price_str.replace(",", "")
            
        if "through-" in price_str:
            prices = price_str.split("through-")
            price_sum = sum(float(p[1:]) for p in prices)
            average_price = price_sum / len(prices)
            numbers.append(average_price)
            continue
            
        try:
            num = float(price_str[1:].strip())
            numbers.append(num)
        except (ValueError, IndexError):
            p = df.price[i]
            num = float(str(p[1:]).strip())
            numbers.append(num)
    
    return np.array(numbers, dtype=float)

In [90]:
to_float_price(df['price'])

array([ 56.99, 159.99,  44.99, ...,  22.99,  17.49,  21.99])