# Task
Impact Stores, popularly known as “The Excellent Store”, is a leading indigenous chain of stores with headquarters in Gbagi, Oyo, Nigeria. At the core of their business is a strong sense of excellence and entrepreneurial value. And this is evident in all their 1,500 products, available to all segments of the population at customer-friendly prices, across 10 stores in different cities of Nigeria. The CEO of the company, Chief A. A. Babatunji, plans to expand the chain of stores to more Nigerian cities in 2025. However, as the COVID19 restrictions have affected the retail business, he sees the need to better understand which products return higher profits at specific stores so as to inform the expansion plan. You have been engaged as the new Retail Data Analyst to build a predictive model and find out the profit returns on each product at a particular store. The scenario he sees is where a brand of juice sold for N250 in one of his store branches may also be sold at N230 at another store within Chief Babatunji's chain of stores. He needs to therefore understand what type of product, market clusters and store type (location, age, size) will give more profit returns as he plans to expand to more cities in the country.

## This model is built to predict the Item store returns for each store

## Loading the dataset




In [21]:
import pandas as pd

train_df = pd.read_csv('/train.csv')
test_df = pd.read_csv('/train.csv')
sample_submission_df = pd.read_csv('/SampleSubmission.csv')

train_df
test_df
sample_submission_df

Unnamed: 0,Item_Store_ID,Item_Store_Returns
0,DRA59_BABATUNJI010,100
1,DRA59_BABATUNJI013,100
2,DRB01_BABATUNJI013,100
3,DRB13_BABATUNJI010,100
4,DRB13_BABATUNJI013,100
...,...,...
3527,NCZ42_BABATUNJI010,100
3528,NCZ42_BABATUNJI013,100
3529,NCZ42_BABATUNJI049,100
3530,NCZ53_BABATUNJI010,100


## Preprocess data


Handling missing values, encoding categorical features, and scaling numerical features


In [22]:
## Handling missing values
for col in train_df.columns:
    if train_df[col].isnull().any():
        if train_df[col].dtype in ['int64', 'float64']:
            mean_val = train_df[col].mean()
            train_df[col].fillna(mean_val, inplace=True)
            if col in test_df.columns:
                test_df[col].fillna(mean_val, inplace=True)
        else:
            mode_val = train_df[col].mode()[0]
            train_df[col].fillna(mode_val, inplace=True)
            if col in test_df.columns:
                test_df[col].fillna(mode_val, inplace=True)


train_df.isnull().sum()
test_df.isnull().sum()

Unnamed: 0,0
Item_ID,0
Store_ID,0
Item_Store_ID,0
Item_Weight,0
Item_Sugar_Content,0
Item_Visibility,0
Item_Type,0
Item_Price,0
Store_Start_Year,0
Store_Size,0


In [23]:
# Calculating the store age and extract the item type from the Item_ID column for both train and test dataframes.

combined_df = pd.concat([train_df.drop('Item_Store_Returns', axis=1), test_df.drop('Item_Store_Returns', axis=1)], ignore_index=True)

categorical_cols = combined_df.select_dtypes(include=['object']).columns
combined_df = pd.get_dummies(combined_df, columns=categorical_cols, dummy_na=False)

train_processed = combined_df.iloc[:len(train_df)]
test_processed = combined_df.iloc[len(train_df):]

train_processed['Item_Store_Returns'] = train_df['Item_Store_Returns']
test_processed['Item_Store_Returns'] = test_df['Item_Store_Returns']

train_processed.head()
test_processed.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_Price,Store_Start_Year,Item_ID_DRA12,Item_ID_DRA24,Item_ID_DRA59,Item_ID_DRB01,Item_ID_DRB13,Item_ID_DRB24,...,Store_Size_Medium,Store_Size_Small,Store_Location_Type_Cluster 1,Store_Location_Type_Cluster 2,Store_Location_Type_Cluster 3,Store_Type_Grocery Store,Store_Type_Supermarket Type1,Store_Type_Supermarket Type2,Store_Type_Supermarket Type3,Item_Store_Returns
4990,11.6,0.068535,357.54,2005,True,False,False,False,False,False,...,True,False,False,False,True,True,False,False,False,
4991,11.6,0.040912,355.79,1994,True,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,
4992,11.6,0.041178,350.79,2014,True,False,False,False,False,False,...,True,False,False,True,False,False,True,False,False,
4993,11.6,0.041113,355.04,2016,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,
4994,11.6,0.0,354.79,2011,True,False,False,False,False,False,...,False,True,False,True,False,False,True,False,False,


In [5]:
from sklearn.preprocessing import StandardScaler

numerical_cols = ['Item_Weight', 'Item_Visibility', 'Item_Price', 'Store_Start_Year']

scaler = StandardScaler()
train_processed[numerical_cols] = scaler.fit_transform(train_processed[numerical_cols])
test_processed[numerical_cols] = scaler.transform(test_processed[numerical_cols])

train_processed.head()
test_processed.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_processed[numerical_cols] = scaler.fit_transform(train_processed[numerical_cols])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_processed[numerical_cols] = scaler.transform(test_processed[numerical_cols])


Unnamed: 0,Item_Weight,Item_Visibility,Item_Price,Store_Start_Year,Item_ID_DRA12,Item_ID_DRA24,Item_ID_DRA59,Item_ID_DRB01,Item_ID_DRB13,Item_ID_DRB24,...,Store_Size_Medium,Store_Size_Small,Store_Location_Type_Cluster 1,Store_Location_Type_Cluster 2,Store_Location_Type_Cluster 3,Store_Type_Grocery Store,Store_Type_Supermarket Type1,Store_Type_Supermarket Type2,Store_Type_Supermarket Type3,Item_Store_Returns
0,-0.303799,0.030513,-0.287047,0.026132,True,False,False,False,False,False,...,True,False,False,False,True,True,False,False,False,709.08
1,-0.303799,-0.490159,-0.301708,-1.301998,True,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,6381.69
2,-0.303799,-0.485151,-0.343596,1.112784,True,False,False,False,False,False,...,True,False,False,True,False,False,True,False,False,6381.69
3,-0.303799,-0.486372,-0.307991,1.354262,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,2127.23
4,-0.303799,-1.261307,-0.310086,0.750567,True,False,False,False,False,False,...,False,True,False,True,False,False,True,False,False,2481.77


Unnamed: 0,Item_Weight,Item_Visibility,Item_Price,Store_Start_Year,Item_ID_DRA12,Item_ID_DRA24,Item_ID_DRA59,Item_ID_DRB01,Item_ID_DRB13,Item_ID_DRB24,...,Store_Size_Medium,Store_Size_Small,Store_Location_Type_Cluster 1,Store_Location_Type_Cluster 2,Store_Location_Type_Cluster 3,Store_Type_Grocery Store,Store_Type_Supermarket Type1,Store_Type_Supermarket Type2,Store_Type_Supermarket Type3,Item_Store_Returns
4990,-0.303799,0.030513,-0.287047,0.026132,True,False,False,False,False,False,...,True,False,False,False,True,True,False,False,False,
4991,-0.303799,-0.490159,-0.301708,-1.301998,True,False,False,False,False,False,...,False,False,False,False,True,False,True,False,False,
4992,-0.303799,-0.485151,-0.343596,1.112784,True,False,False,False,False,False,...,True,False,False,True,False,False,True,False,False,
4993,-0.303799,-0.486372,-0.307991,1.354262,True,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,
4994,-0.303799,-1.261307,-0.310086,0.750567,True,False,False,False,False,False,...,False,True,False,True,False,False,True,False,False,


## Feature engineering

Calculating the store age and extracting the item type from the Item_ID column for both train and test dataframes.


**Reasoning**:
Calculate the store age and extract the item type from the Item_ID column for both train and test dataframes.



In [13]:
train_df['Store_Age'] = 2025 - train_df['Store_Start_Year']
test_df['Store_Age'] = 2025 - test_df['Store_Start_Year']

train_df['Item_Type_Combined'] = train_df['Item_ID'].apply(lambda x: x[:2])
test_df['Item_Type_Combined'] = test_df['Item_ID'].apply(lambda x: x[:2])

train_df.head()
test_df.head()

Unnamed: 0,Item_ID,Store_ID,Item_Store_ID,Item_Weight,Item_Sugar_Content,Item_Visibility,Item_Type,Item_Price,Store_Start_Year,Store_Size,Store_Location_Type,Store_Type,Item_Store_Returns,Store_Age,Item_Type_Combined
0,DRA12,BABATUNJI010,DRA12_BABATUNJI010,11.6,Low Sugar,0.068535,Soft Drinks,357.54,2005,Medium,Cluster 3,Grocery Store,709.08,20,DR
1,DRA12,BABATUNJI013,DRA12_BABATUNJI013,11.6,Low Sugar,0.040912,Soft Drinks,355.79,1994,High,Cluster 3,Supermarket Type1,6381.69,31,DR
2,DRA12,BABATUNJI017,DRA12_BABATUNJI017,11.6,Low Sugar,0.041178,Soft Drinks,350.79,2014,Medium,Cluster 2,Supermarket Type1,6381.69,11,DR
3,DRA12,BABATUNJI018,DRA12_BABATUNJI018,11.6,Low Sugar,0.041113,Soft Drinks,355.04,2016,Medium,Cluster 3,Supermarket Type2,2127.23,9,DR
4,DRA12,BABATUNJI035,DRA12_BABATUNJI035,11.6,Ultra Low Sugar,0.0,Soft Drinks,354.79,2011,Small,Cluster 2,Supermarket Type1,2481.77,14,DR


## Model selection and training

Choosing the Decision Tree Regressor model and train it on the prepared data.


In [7]:
from sklearn.tree import DecisionTreeRegressor

X_train = train_processed.drop('Item_Store_Returns', axis=1)
y_train = train_processed['Item_Store_Returns']

model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

## Prediction

Using the trained model to make predictions on the test data.


In [12]:
test_predictions = model.predict(test_processed.drop('Item_Store_Returns', axis=1))

In [20]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Calculating predictions on the training data to evaluate the model
y_train_pred = model.predict(X_train)

# Calculating the Root Mean Squared Error (RMSE) on the training data
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))

rmse

np.float64(2.2300290883277154e-14)

## Generating submission file

In [11]:
submission_df.to_csv('submission.csv', index=False)