In [1]:
import warnings 
warnings.filterwarnings('ignore')

#### Importing Libraries and Datasets

In [2]:
import pandas as pd

In [3]:
#Datasets
products= pd.read_csv('../Project 1/data/products.csv')
orders= pd.read_csv('../Project 1/data/orders.csv')
order_products= pd.read_csv('../Project 1/data/order_products.csv')
aisles= pd.read_csv('../Project 1/data/aisles.csv')

### Dataset Description
* **products.csv:** Contains details on the products available in the store. Fields include the unique product ID, the aisle in which each product is stored, and the department it belongs to, providing a structured overview of product locations.
* **orders.csv:** This dataset records order-specific data. Key fields include unique order and user IDs, the day of the week on which each product was ordered, the maximum number of orders, order frequency by hour, and the number of days since the prior order, giving insights into customer ordering habits and timing.
* **order_products.csv:** Provides a breakdown of product details within each order, including order and product IDs, the count of items within each order, and reorder frequency, helping to reveal purchasing patterns and repeat product preferences.
* **aisles.csv:** Contains data on aisles, including unique aisle IDs and their names, offering an additional layer of product location within the store's layout.

### products.csv Description

In [20]:
# Displaying the first 5 rows of the dataset
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


### Key Column Description:
* **product_id:** Unique numerical identifier for each product (49,688 total), ensures distinct reference across datasets.
* **product_name:** Unique string-based name for each product, but unsuitable for categorical analysis due to high cardinality.
* **aisle_id:** Numerical identifier for aisles, ideal for categorical analysis as it groups multiple products by location.
* **department_id:** Numerical identifier for product departments, well-suited for categorical analysis and departmental comparisons.

In [24]:
#Calculating total unique values for each column
products.nunique()

product_id       49688
product_name     49688
aisle_id           134
department_id       21
dtype: int64

### Key Insights:
* **product_id:** Unique numerical identifier for 49,688 products, used for efficient referencing.
* **product_name:** Equal to product_id, gives a unique name to each product.
* **aisle_id:** Categorical feature despite being numerical, as it groups products by aisle.
* **department_id:** Categorical feature, representing departments that encompass multiple aisles and products.

In [28]:
#Calculating the total missing values for each column
products.isna().sum()

product_id       0
product_name     0
aisle_id         0
department_id    0
dtype: int64

**There are no null or missing values in any columns of this dataset, ensuring data completeness and simplifying the preprocessing steps. This quality facilitates smoother analysis and modeling without the need for imputation or data cleaning related to missing values.**

In [32]:
#Calculating if there is any duplicate values in the dataset
products.duplicated().sum()

0

**There are no duplicate values in any columns of the dataset, ensuring completeness and simplifying the preprocessing steps. This quality facilitates analysis and modeling without the need for data cleaning related to duplicate values.**

### aisles.csv Description

In [21]:
# Displaying the first 5 rows of the dataset
aisles.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


#### Key Column Description:
* **aisle_id:** Numerical values treated as categorical, representing unique aisle identifiers with multiple products per aisle.
* **aisle:** String names corresponding to aisle_ids, each aisle having a unique name.

In [25]:
#Calculating total unique values for each column
aisles.nunique()

aisle_id    134
aisle       134
dtype: int64

* **aisle_id and aisle:** both of them have the same number of unique values indicating each aisle_id re uqinue to an aisle. There are a total of 134 unique values.
* There is no need to perform exploratory data analysis as this data only gives information about the names and IDs of the aisles.

In [29]:
#Calculating the total missing values for each column
aisles.isna().sum()

aisle_id    0
aisle       0
dtype: int64

**There are no null or missing values in any columns of this dataset, ensuring data completeness and simplifying the preprocessing steps. This quality facilitates smoother analysis and modeling without the need for imputation or data cleaning related to missing values.** 

In [33]:
#Calculating if there is any duplicate values in the dataset
aisles.duplicated().sum()

0

**There are no duplicate values in any columns of the dataset, ensuring completeness and simplifying the preprocessing steps. This quality facilitates analysis and modeling without the need for data cleaning related to duplicate values.**

### orders.csv Description

In [22]:
# Displaying the first 5 rows of the dataset
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


### Key Column Description: 
orders.csv dataset contains the following features:
* **order_id:** This feature contains the order ID for each order placed by a user. Each ID is unique to an order regardless whether the same items were ordered more than once.
* **user_id:** This feature contains the IDs of the users of the shopping site. Each user has a unique ID. IDs are seemingly repeating itself meaning that a user may have ordered multiple times.
* **eval_set:** This feature is irrelevant as it divides the dataset into three categories "prior", "train", and "test", which do not hold any relevance to the topic of concern. 
* **order_number:** This feature informs about the number of times a user has ordered from the store. This feature is important as it describes a user's engagement with the store, in turn, giving insights about their trust on the store.
* **order_dow:** This feature informs on which day of the week the order was placed. This feature informs about when the store can be busy, requiring higher employee and delivery availability.
* **order_hour_of_day:** This feature informs at which hour of the day the purchase has been made. This informs about at what time of the day, the store requires more employees to meet the demands of the orders.
* **days_since_prior_order:** This feature informs about how the time span between two purchases of each user. This informs about how often a product is required by users to ensure availability.

In [26]:
#Calculating total unique values for each column
orders.nunique()

order_id                  3421083
user_id                    206209
eval_set                        3
order_number                  100
order_dow                       7
order_hour_of_day              24
days_since_prior_order         31
dtype: int64

### Key Insights:
* **order_id:** 3,421,083 unique orders, indicating a high volume of transactions.
* **user_id:** 206,209 unique users, suggesting multiple orders per user.
* **order_number:** 100 unique values, possibly indicating the max number of items per order.
* **order_dow:** 7 unique values, representing the 7 days of the week.
* **order_hour_of_day:** 24 unique values, indicating purchases throughout the day.
* **days_since_prior_order:** 32 unique values, reflecting varying time gaps between orders.

In [30]:
#Calculating the total missing values for each column
orders.isna().sum()

order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

### Key Insights:

* There are no missing values in the features except in days_since_prior_order.
* There are a total of 206209 missing values in the dataset.
* However these values cannot be dropped as at number of missing data is too large. Moreover, a person could be ordering for the first time or after a very long time which has not been registered.

In [34]:
#Calculating if there is any duplicate values in the dataset
orders.duplicated().sum()

0

**There are no duplicate values in any columns of the dataset, ensuring completeness and simplifying the preprocessing steps. This quality facilitates analysis and modeling without the need for data cleaning related to duplicate values.**

### order_products.csv Description

In [23]:
# Displaying the first 5 rows of the dataset
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


#### Key Column Description:
* **order_id:** This feature contains the order ID for each order placed by a user. The repititon of the numbers may indicate the relation with product_id where the the repitions represent which products are present in the particular 
* **product_id:** This is a numerical feature that are unique for each product. 
* **add_to_cart_order:** This feature informs about the priority of the product in the user's cart. The value depicts at which level the product is in the cart.
* **reordered:** This feature informs whether a particular product was reordered or not. This is a categorical feature.

In [27]:
#Calculating total unique values for each column
order_products.nunique()

order_id             3214874
product_id             49677
add_to_cart_order        145
reordered                  2
dtype: int64

#### Key Insights:
* **order_id:** This feature has a total of 3214874 unique values indicating there could be repeating values present.
* **product_id:** This feature has a total of 49677 unique values. So, it is considered a numerical feature.
* **add_to_cart_order:** This feature has 145 unique values which could be considered a categorical variable.
* **Reordered:** This has 2 unique features indicating it is a categorical variable. 

In [31]:
#Calculating the total missing values for each column
order_products.isna().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

**There are no null or missing values in any columns of this dataset, ensuring data completeness and simplifying the preprocessing steps. This quality facilitates smoother analysis and modeling without the need for imputation or data cleaning related to missing values.**

In [35]:
#Calculating if there is any duplicate values in the dataset
order_products.duplicated().sum()

0

**There are no duplicate values in any columns of the dataset, ensuring completeness and simplifying the preprocessing steps. This quality facilitates analysis and modeling without the need for data cleaning related to duplicate values.**

## Conclusion

The dataset contains over 3.2 million unique orders and 49,677 products, with each order linked to one or more products. Features like add_to_cart_order and reordered help analyze product prioritization and repeat purchases. Temporal features (order day, hour, and time since last order) offer insights into user behavior, though days_since_prior_order has 206,209 missing values, likely due to first-time or infrequent orders. Overall, the data is comprehensive for understanding user purchasing patterns, with missing values in days_since_prior_order not significantly affecting analysis.