### in the name of Allah
# Shopping Cart Analysis and Recommender System based on ARM (Association Rule Mining)
------------------
## TASK I: Data Preprocessing

This task focuses on preparing the Instacart e-commerce dataset for Association Rule Mining analysis. The preprocessing pipeline ensures data quality and creates a manageable subset for efficient pattern mining. We address common data issues including missing values and irrelevant transactions, while focusing on multi-item baskets essential for meaningful association rule discovery. By sampling 20,000 users, we balance computational feasibility with sufficient data coverage for robust analysis.

### **Step 1: Loading the Data**
Load CSV files:
- **pandas** for small files: `aisles.csv`, `departments.csv`, `products.csv`
- **dask** for large files: `order_products__train.csv`, `orders.csv`, `order_products__prior.csv`

### **Step 2: Data Cleaning Functions**
Define functions:
- **`remove_nulls(df)`**: Remove rows with missing values.
- **`filter_single_item_orders(df)`**: Remove orders with only one product.
- **`filter_by_order_ids(df, order_ids_set)`**: Filter by order IDs.

### **Step 3: Preprocessing Execution**
1. **Remove nulls** from all tables.
2. **Remove single-item orders** from order products data.
3. **Sample 20,000 users** randomly from orders.
4. **Extract orders** for sampled users.
5. **Filter order products** to include only sampled orders.

### **Step 4: Saving Cleaned Data**
Save cleaned data to `./processed_data/`:
- **`aisles_cleaned.csv`**, **`products_cleaned.csv`**, **`departments_cleaned.csv`**
- **`orders_sampled.csv`**: Orders of 20,000 sampled users.
- **`order_products_train_sampled.csv`**, **`order_products_prior_sampled.csv`**
- **`order_products_combined.csv`**: Combined data for basket analysis.

This preprocessing ensures clean, multi-item basket data ready for Association Rule Mining in subsequent tasks.

In [12]:
import os
import numpy as np
import pandas as pd
import dask.dataframe as dd

# ============================================================================
# LOADING DATA FROM CSV FILES
# ============================================================================

# Load small metadata files using pandas
aisles = pd.read_csv('aisles.csv')
products = pd.read_csv('products.csv')
departments = pd.read_csv('departments.csv')

# Load large transactional files using dask
orders = dd.read_csv('orders.csv')
order_products_train = dd.read_csv('order_products__train.csv')
order_products_prior = dd.read_csv('order_products__prior.csv')

print("Data loading completed.")
print(f"Orders: {orders.shape[0].compute():,} rows")
print(f"Order Products Prior: {order_products_prior.shape[0].compute():,} rows")
print(f"Order Products Train: {order_products_train.shape[0].compute():,} rows")

Data loading completed.
Orders: 3,421,083 rows
Order Products Prior: 32,434,489 rows
Order Products Train: 1,384,617 rows


In [None]:
# ============================================================================
# DATA PREPROCESSING FUNCTIONS
# ============================================================================

def remove_nulls(df):
    """Remove rows with null values from dataframe."""
    return df.dropna()


def filter_single_item_orders(df):
    """
    Remove orders that contain only one product item.
    For basket analysis, we need at least two items per order.
    """
    # Count products per order using groupby and size
    item_counts = df.groupby('order_id').size().reset_index()
    item_counts = item_counts.rename(columns={0: 'item_count'})
    
    # Get orders with more than 1 item & filter original dataframe
    multi_item_orders = item_counts[item_counts['item_count'] > 1]
    return df[df['order_id'].isin(multi_item_orders['order_id'])]


def filter_by_order_ids(df, order_ids_set):
    """
    Filter dataframe based on a set of order_ids
    >>> used for 2000 selected users.
    """
    return df[df['order_id'].isin(order_ids_set)]

In [14]:
# ============================================================================
# TASK 1: DATA PREPROCESSING EXECUTION
# ============================================================================

print("\n" + "="*50)
print("TASK 1: DATA PREPROCESSING")
print("="*50)

# Step 1: Remove null values
print("\n1. Removing null values...")
aisles_cleaned = remove_nulls(aisles)
products_cleaned = remove_nulls(products)
departments_cleaned = remove_nulls(departments)
orders_cleaned = remove_nulls(orders).persist()
order_products_train_cleaned = remove_nulls(order_products_train).persist()
order_products_prior_cleaned = remove_nulls(order_products_prior).persist()

# Step 2: Remove single-item orders
print("2. Removing single-item orders...")
order_products_train_filtered = filter_single_item_orders(order_products_train_cleaned).persist()
order_products_prior_filtered = filter_single_item_orders(order_products_prior_cleaned).persist()

# Step 3: Sample 20,000 users
print("3. Sampling 20,000 random users...")
unique_users_count = orders_cleaned['user_id'].nunique().compute()
frac_value = min(20000 / unique_users_count, 1.0)

sampled_users = orders_cleaned['user_id'].drop_duplicates().sample(
    frac=frac_value, 
    random_state=42
)

# Step 4: Get orders for sampled users
sampled_users_list = sampled_users.compute().tolist()
orders_sampled = orders_cleaned[orders_cleaned['user_id'].isin(sampled_users_list)]

# Get order IDs
sampled_order_ids = orders_sampled['order_id'].compute().tolist()
order_ids_set = set(sampled_order_ids)

# Step 5: Filter order products
print("4. Filtering order products...")
order_products_train_sampled = order_products_train_filtered.map_partitions(
    filter_by_order_ids, 
    order_ids_set,
    meta=order_products_train_filtered._meta
).persist()

order_products_prior_sampled = order_products_prior_filtered.map_partitions(
    filter_by_order_ids,
    order_ids_set,
    meta=order_products_prior_filtered._meta
).persist()

print("\nPreprocessing completed!")


TASK 1: DATA PREPROCESSING

1. Removing null values...
2. Removing single-item orders...
3. Sampling 20,000 random users...
4. Filtering order products...

Preprocessing completed!


In [15]:
# ============================================================================
# SAVING PROCESSED DATA
# ============================================================================

output_folder = './processed_data'
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

print(f"\nSaving data to {output_folder}/")

# Save metadata files
aisles_cleaned.to_csv(f'{output_folder}/aisles_cleaned.csv', index=False)
products_cleaned.to_csv(f'{output_folder}/products_cleaned.csv', index=False)
departments_cleaned.to_csv(f'{output_folder}/departments_cleaned.csv', index=False)

# Save transactional data (need to compute dask dataframes first)
orders_sampled_computed = orders_sampled.compute()
orders_sampled_computed.to_csv(f'{output_folder}/orders_sampled.csv', index=False)

order_products_train_sampled_computed = order_products_train_sampled.compute()
order_products_train_sampled_computed.to_csv(
    f'{output_folder}/order_products_train_sampled.csv', 
    index=False
)

order_products_prior_sampled_computed = order_products_prior_sampled.compute()
order_products_prior_sampled_computed.to_csv(
    f'{output_folder}/order_products_prior_sampled.csv', 
    index=False
)

# Save combined data for basket analysis
order_products_combined = dd.concat([order_products_train_sampled, order_products_prior_sampled])
order_products_combined_computed = order_products_combined.compute()
order_products_combined_computed.to_csv(
    f'{output_folder}/order_products_combined.csv', 
    index=False
)

# Final summary
print("\n" + "="*50)
print("TASK 1 SUMMARY")
print("="*50)
print(f"Sampled users: {len(sampled_users_list):,}")
print(f"Sampled orders: {orders_sampled_computed.shape[0]:,}")
print(f"Total products for basket analysis: {order_products_combined_computed.shape[0]:,}")
print("\n✓ Task 1 completed - Data ready for basket analysis (Task 2)")
print("="*50)



Saving data to ./processed_data/

TASK 1 SUMMARY
Sampled users: 20,000
Sampled orders: 312,778
Total products for basket analysis: 3,083,695

✓ Task 1 completed - Data ready for basket analysis (Task 2)


## TASK II: Base Demand Calculation and Initial Strategy

In this task, we calculate the base demand for each product, define an initial pricing, advertising strategy and Demand and Profit functions. This includes:

1. **Calculate base demand**: The base demand for each product is calculated as the total quantity divided by the number of active months. If no active months are available, total quantity is used.
2. **Assign cost**: Based on the median price of the products, we assign a cost of 5 or 10 units.
3. **Define initial strategy**: The initial price is set to the average price, and advertising budget is initialized.
4. **Save results**: The results are saved for the next simulation stage.
5. **Demand Function (demand_func)**: Calculates the demand based on price, average price of other sellers, advertising budget, and social influence.
6. **Profit Function (profit_func)**: Calculates the profit for each seller based on the demand, price, and advertising budget.

In [16]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

# Load the cleaned data and extract only the necessary columns related to 'order_id' and 'product_id'
cleaned_data = pd.read_csv(f'{output_folder}/cleaned_data.csv')
order_products_train_cleaned = cleaned_data[['order_id', 'product_id']]  # Only the relevant columns

# Group by `order_id` to create a list of products for each order
grouped_orders = order_products_train_cleaned.groupby('order_id')['product_id'].apply(list).reset_index(name='product_list')

# Create one-hot encoding for the products in each order using TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit_transform(grouped_orders['product_list'])
one_hot_encoded_df = pd.DataFrame(te_ary, columns=te.columns_)

# Display the first few rows of and save the One-Hot Encoded table to verify the transformation
print(one_hot_encoded_df.head())
one_hot_encoded_df.to_csv(f'{output_folder}/one_hot_encoded_df.csv', index=False)

FileNotFoundError: [Errno 2] No such file or directory: './processed_data/cleaned_data.csv'