## Store demand forecasting using the concept of store attractiveness

Ranking stores based on their properties can indeed make sense, especially if you believe that these properties have a significant impact on the store's performance and customer demand. The rank effectively becomes a summary metric that captures the essence of multiple features, which can be particularly useful if there's a complex interplay between different store attributes that's difficult to model directly.

Using this ranking as a qualitative measure in a demand forecast model can be beneficial. It allows you to incorporate a wide range of store-specific factors into the model without overwhelming it with too many individual features, which could lead to overfitting or increased model complexity.

The rank can serve as a proxy for the 'quality' or 'attractiveness' of a store from a customer's perspective. So, the store rank can be a useful feature in your demand forecast model, but it should be used thoughtfully and in conjunction with other relevant sales and item-related features.

## Sample data creation

Let's create a DataFrame with 100 stores, each with unique IDs and randomly generated features. It then ranks the stores based on their properties after reducing the dimensionality of the features and scaling the results to a 0-1 range. The stores are sorted by their rank in descending order, so you can see which stores are ranked higher based on their properties.

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample random data for stores
num_stores = 100  # Number of stores
store_ids = np.arange(1, num_stores + 1)

# Geofeatures
locations = np.random.choice(['Urban', 'Suburban', 'Rural'], num_stores)
federal_states = np.random.choice(['State1', 'State2', 'State3'], num_stores)
population_densities = np.random.uniform(100, 1000, num_stores)  # Example range
shopping_mall_nearby = np.random.choice([True, False], num_stores)
other_retail_stores_nearby = np.random.choice([True, False], num_stores)

# Store-related features
construction_years = np.random.randint(1990, 2022, num_stores)
store_sizes = np.random.uniform(1000, 5000, num_stores)  # Example range in square meters
store_concepts = np.random.choice(['Concept1', 'Concept2', 'Concept3'], num_stores)
number_of_car_parks = np.random.randint(50, 200, num_stores)

# Create a DataFrame
store_data = pd.DataFrame({
'StoreID': store_ids,
'Location': locations,
'FederalState': federal_states,
'PopulationDensity': population_densities,
'ShoppingMallNearby': shopping_mall_nearby,
'OtherRetailStoresNearby': other_retail_stores_nearby,
'ConstructionYear': construction_years,
'StoreSize': store_sizes,
'StoreConcept': store_concepts,
'NumberOfCarParks': number_of_car_parks
})

# Convert categorical features to numerical values
store_data_encoded = pd.get_dummies(store_data, columns=['Location', 'FederalState', 'StoreConcept'])

# Perform PCA to reduce to 1 component
pca = PCA(n_components=1)
store_features_reduced = pca.fit_transform(store_data_encoded.drop('StoreID', axis=1))

# Use Min-Max scaling to get a rank value between 0 and 1
scaler = MinMaxScaler(feature_range=(0, 1))
store_ranks = scaler.fit_transform(store_features_reduced)

# Add the rank values to the original DataFrame
store_data['Rank'] = store_ranks

In [2]:
# Display the DataFrame sorted by rank
store_data.sort_values(by='Rank', ascending=False)

Unnamed: 0,StoreID,Location,FederalState,PopulationDensity,ShoppingMallNearby,OtherRetailStoresNearby,ConstructionYear,StoreSize,StoreConcept,NumberOfCarParks,Rank
86,87,Rural,State3,672.699356,True,True,2015,1173.603133,Concept1,96,1.000000
38,39,Urban,State3,946.470938,False,False,2017,1223.484619,Concept1,160,0.988750
19,20,Suburban,State2,172.767994,True,False,1991,1213.941187,Concept2,79,0.986179
91,92,Urban,State2,915.488909,False,False,2018,1301.738188,Concept2,127,0.968068
27,28,Rural,State1,818.510612,True,False,1990,1333.137645,Concept1,194,0.959235
...,...,...,...,...,...,...,...,...,...,...,...
42,43,Rural,State2,708.121105,False,False,2009,4874.607776,Concept3,193,0.031962
36,37,Rural,State2,413.799389,False,False,1991,4896.992834,Concept1,92,0.024165
83,84,Urban,State1,377.254713,False,False,1998,4911.602675,Concept1,57,0.020097
95,96,Urban,State2,702.031654,False,True,2010,4983.325500,Concept3,105,0.003462


In [4]:
# Generate sample random data for sales and product-related features
num_products = 10  # Number of products
product_ids = np.arange(1, num_products + 1)
weekly_sales = np.random.uniform(1000, 5000, (num_stores, num_products))
monthly_sales = np.random.uniform(4000, 20000, (num_stores, num_products))
unit_prices = np.random.uniform(1, 100, (num_stores, num_products))
units_sold = np.random.randint(10, 100, (num_stores, num_products))
troc_values = np.random.uniform(0.5, 1.5, (num_stores, num_products))
product_classes = np.random.choice(['Class1', 'Class2', 'Class3'], (num_stores, num_products))
product_ingredients = np.random.choice(['Ingredient1', 'Ingredient2', 'Ingredient3'], (num_stores, num_products))

# Create a DataFrame for sales and product-related features
# Corrected creation of the DataFrame for sales and product-related features
sales_data = pd.DataFrame({
'StoreID': np.repeat(store_ids, num_products),
'ProductID': np.tile(product_ids, num_stores),
'WeeklySales': weekly_sales.flatten(),
'MonthlySales': monthly_sales.flatten(),
'UnitPrice': unit_prices.flatten(),
'UnitsSold': units_sold.flatten(),
'TROC': troc_values.flatten(),
'ProductClass': product_classes.flatten(),  # Removed tiling
'ProductIngredients': product_ingredients.flatten()  # Removed tiling
})

## Train simple regression model

In [5]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Convert categorical features to numerical values
sales_data_encoded = pd.get_dummies(sales_data, columns=['ProductClass', 'ProductIngredients'])

# Merge the store ranks with the sales data
full_data = pd.merge(sales_data_encoded, store_data[['StoreID', 'Rank']], on='StoreID')

# Define the target variable (demand) and features
X = full_data.drop(['StoreID', 'ProductID', 'UnitsSold'], axis=1)  # Exclude identifiers and the target variable
y = full_data['UnitsSold']  # Demand

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict demand on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Display the coefficients of the model
coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coefficients)

Mean Squared Error: 695.2943697267051
                                Coefficient
WeeklySales                        0.000806
MonthlySales                      -0.000127
UnitPrice                         -0.034481
TROC                              -0.395004
ProductClass_Class1                1.370191
ProductClass_Class2                0.535589
ProductClass_Class3               -1.905780
ProductIngredients_Ingredient1    -2.179465
ProductIngredients_Ingredient2     0.678476
ProductIngredients_Ingredient3     1.500988
Rank                               3.712348


## Allocation

In [6]:
# Allocation of boxes based on regions and store ranks. Assume 'region_data' is a DataFrame containing the region for each store
region_data = pd.DataFrame({'StoreID': store_ids, 'Region': np.random.choice(['A', 'B', 'C'], num_stores)})

# Merge region information with the full data
full_data_with_region = pd.merge(full_data, region_data, on='StoreID')

# Total boxes available for each region
boxes_per_region = {'A': 1000, 'B': 900, 'C': 700}

# Filter full_data_with_region to only include test set rows
test_set_ids = X_test.index
full_data_with_region_test = full_data_with_region[full_data_with_region.index.isin(test_set_ids)]

# Calculate the weighted demand for the test set
full_data_with_region_test['WeightedDemand'] = full_data_with_region_test['Rank'] * y_pred

# Allocate boxes to stores based on weighted demand
def allocate_boxes(region_group):
  total_weighted_demand = region_group['WeightedDemand'].sum()
  region_group['AllocatedBoxes'] = np.floor(region_group['WeightedDemand'] / total_weighted_demand * boxes_per_region[region_group.name]).astype(int)
  remainder_boxes = boxes_per_region[region_group.name] - region_group['AllocatedBoxes'].sum()
  fractional_parts = region_group['WeightedDemand'] % 1
  stores_sorted_by_fractional = fractional_parts.sort_values(ascending=False).index

  # Distribute the remaining boxes to stores with the highest fractional parts
  for i in range(remainder_boxes):
    store_id = stores_sorted_by_fractional[i]
    region_group.loc[store_id, 'AllocatedBoxes'] += 1

  return region_group

# Apply the allocation function to each region group
allocation_result_test = full_data_with_region_test.groupby('Region').apply(allocate_boxes)

# Display the allocation result
allocation_result_test[['StoreID', 'Region', 'AllocatedBoxes']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  full_data_with_region_test['WeightedDemand'] = full_data_with_region_test['Rank'] * y_pred


Unnamed: 0_level_0,Unnamed: 1_level_0,StoreID,Region,AllocatedBoxes
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,44,5,A,38
A,54,6,A,33
A,55,6,A,33
A,59,6,A,33
A,70,8,A,6
...,...,...,...,...
C,977,98,C,13
C,978,98,C,12
C,985,99,C,5
C,986,99,C,6
