## **Data Assessment Summary**

### Dataset Overview
- The dataset **`blinkit_grocery_data.xlsx`** contains **8,523 rows** and **12 columns**.
- Column names are not in a standard format and will be cleaned (e.g., `Item Fat Content` → `item_fat_content`) for consistency and ease of analysis.

---

### Column Description

1. **item_fat_content**  
   Indicates the fat category of an item (e.g., Regular, Low Fat).

2. **item_identifier**  
   Unique identifier for each product.  
   Contains **1,559 unique values**, which is expected since the same item can be sold across multiple outlets.

3. **item_type**  
   Describes the category of the item.  
   Contains **15 distinct categories**, including *Fruits and Vegetables, Dairy, Snack Foods, Household,* and *Others*.

4. **outlet_establishment_year**  
   Represents the year an outlet was established.  
   Stored as an integer; not converted to datetime as month/day information is unavailable.

5. **outlet_identifier**  
   Unique identifier for outlets.  
   The dataset contains **10 distinct outlets**.

6. **outlet_location_type**  
   Categorizes outlet locations into **Tier 1, Tier 2, and Tier 3**.

7. **outlet_size**  
   Indicates the physical size of the outlet: **Small, Medium, or High**.

8. **outlet_type**  
   Describes the type of outlet:  
   *Grocery Store, Supermarket Type1, Type2, and Type3*.

9. **item_visibility**  
   Represents how visible an item is within the store.  
   Values typically range between **0 and 1**, where higher values indicate greater visibility.

10. **item_weight**  
    Weight of the item.  
    Contains missing values that require treatment.

11. **sales**  
    Total sales value generated by a specific item at a particular outlet.

12. **rating**  
    Customer satisfaction score for an item at a specific outlet.

---

### Data Quality Assessment

#### Dirty Data Issues
- The **item_fat_content** column contains inconsistent labels such as  
  `Regular`, `reg`, `Low Fat`, `low fat`, and `LF`.  
  These values will be standardized into consistent categories (*Regular* and *Low Fat*).

- The **item_weight** column contains approximately **17% missing values**.  
  Since weight is an item-level attribute:
  - Missing values were imputed using the **median weight of the same item_identifier**.
  - If item-level data was unavailable, the **median weight by item_type** was used.
  - The column was retained for completeness, although it is not expected to be a primary driver of sales.

#### Messy Data Issues
- No structural or formatting issues were identified.
- The dataset is well-structured and suitable for analysis after cleaning.

---

### Assessment Conclusion
The dataset contains common real-world data quality issues such as inconsistent categorical values and missing numerical data.  
After standardizing categorical fields and handling missing values, the data will be suitable for exploratory data analysis, SQL-based analysis, and dashboard development.


## Data Import

In [3]:
import numpy as np
import pandas as pd

## Data Load

In [5]:
data = pd.read_excel(r"C:\Users\rudra\OneDrive\Documents\GitHub\blinkit-end-to-end-data-analysis\data\raw\blinkit_grocery data.xlsx")

In [6]:
data

Unnamed: 0,Item Fat Content,Item Identifier,Item Type,Outlet Establishment Year,Outlet Identifier,Outlet Location Type,Outlet Size,Outlet Type,Item Visibility,Item Weight,Sales,Rating
0,Regular,FDX32,Fruits and Vegetables,2012,OUT049,Tier 1,Medium,Supermarket Type1,0.100014,15.10,145.4786,5.0
1,Low Fat,NCB42,Health and Hygiene,2022,OUT018,Tier 3,Medium,Supermarket Type2,0.008596,11.80,115.3492,5.0
2,Regular,FDR28,Frozen Foods,2016,OUT046,Tier 1,Small,Supermarket Type1,0.025896,13.85,165.0210,5.0
3,Regular,FDL50,Canned,2014,OUT013,Tier 3,High,Supermarket Type1,0.042278,12.15,126.5046,5.0
4,Low Fat,DRI25,Soft Drinks,2015,OUT045,Tier 2,Small,Supermarket Type1,0.033970,19.60,55.1614,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...
8518,low fat,NCT53,Health and Hygiene,2018,OUT027,Tier 3,Medium,Supermarket Type3,0.000000,,164.5526,4.0
8519,low fat,FDN09,Snack Foods,2018,OUT027,Tier 3,Medium,Supermarket Type3,0.034706,,241.6828,4.0
8520,low fat,DRE13,Soft Drinks,2018,OUT027,Tier 3,Medium,Supermarket Type3,0.027571,,86.6198,4.0
8521,reg,FDT50,Dairy,2018,OUT027,Tier 3,Medium,Supermarket Type3,0.107715,,97.8752,4.0


## Data Health Checking

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item Fat Content           8523 non-null   object 
 1   Item Identifier            8523 non-null   object 
 2   Item Type                  8523 non-null   object 
 3   Outlet Establishment Year  8523 non-null   int64  
 4   Outlet Identifier          8523 non-null   object 
 5   Outlet Location Type       8523 non-null   object 
 6   Outlet Size                8523 non-null   object 
 7   Outlet Type                8523 non-null   object 
 8   Item Visibility            8523 non-null   float64
 9   Item Weight                7060 non-null   float64
 10  Sales                      8523 non-null   float64
 11  Rating                     8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [9]:
data.describe()

Unnamed: 0,Outlet Establishment Year,Item Visibility,Item Weight,Sales,Rating
count,8523.0,8523.0,7060.0,8523.0,8523.0
mean,2016.450546,0.066132,12.857645,140.992783,3.965857
std,3.189396,0.051598,4.643456,62.275067,0.605651
min,2011.0,0.0,4.555,31.29,1.0
25%,2014.0,0.026989,8.77375,93.8265,4.0
50%,2016.0,0.053931,12.6,143.0128,4.0
75%,2018.0,0.094585,16.85,185.6437,4.2
max,2022.0,0.328391,21.35,266.8884,5.0


In [10]:
data.isnull().sum()

Item Fat Content                0
Item Identifier                 0
Item Type                       0
Outlet Establishment Year       0
Outlet Identifier               0
Outlet Location Type            0
Outlet Size                     0
Outlet Type                     0
Item Visibility                 0
Item Weight                  1463
Sales                           0
Rating                          0
dtype: int64

In [11]:
data.nunique()

Item Fat Content                5
Item Identifier              1559
Item Type                      16
Outlet Establishment Year       9
Outlet Identifier              10
Outlet Location Type            3
Outlet Size                     3
Outlet Type                     4
Item Visibility              7880
Item Weight                   415
Sales                        5938
Rating                         39
dtype: int64

In [12]:
data['Item Fat Content'].unique()

array(['Regular', 'Low Fat', 'low fat', 'LF', 'reg'], dtype=object)

In [13]:
data['Item Type'].unique()

array(['Fruits and Vegetables', 'Health and Hygiene', 'Frozen Foods',
       'Canned', 'Soft Drinks', 'Household', 'Snack Foods', 'Meat',
       'Breads', 'Hard Drinks', 'Others', 'Dairy', 'Breakfast',
       'Baking Goods', 'Seafood', 'Starchy Foods'], dtype=object)

In [14]:
data['Outlet Location Type'].unique()

array(['Tier 1', 'Tier 3', 'Tier 2'], dtype=object)

In [15]:
data['Outlet Size'].unique()

array(['Medium', 'Small', 'High'], dtype=object)

In [16]:
data['Outlet Type'].unique()

array(['Supermarket Type1', 'Supermarket Type2', 'Grocery Store',
       'Supermarket Type3'], dtype=object)

In [43]:
(data['Item Weight'].isnull().sum() / data.shape[0])*100

17.165317376510618