# **#Store-Sales-Prediction**

Problem Statement -
You are opening a new Store at a particular location. Now, Given the Store Location, Area, Size and other
params. Predict the overall revenue/Sale generation of the Store.

Dataset Details- The data has 8523 rows of 12 variables.

Dataset Description -
 Variable - Description
1. Item_Identifier- Unique product ID
2. Item_Weight- Weight of product
3. Item_Fat_Content - Whether the product is low fat or not
4. Item_Visibility - The % of total display area of all products in a store allocated to the particular product
5. Item_Type - The category to which the product belongs
6. Item_MRP - Maximum Retail Price (list price) of the product
7. Outlet_Identifier - Unique store ID
8. Outlet_Establishment_Year- The year in which store was established
9. Outlet_Size - The size of the store in terms of ground area covered
10. Outlet_Location_Type- The type of city in which the store is located
11. Outlet_Type- Whether the outlet is just a grocery store or some sort of supermarket
12.  Item_Outlet_Sales - Sales of the product in the particulat store. This is the outcome variable to be
predicted.

Dataset Link :
https://drive.google.com/drive/folders/1-WqRLkzYFJJMe-e_QeVUAMqpvesZ7cft?usp=sharing

In [None]:
#Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import xgboost as xgb
import lightgbm as lgb
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold



In [None]:
#load our dataset
df = pd.read_csv('/content/Train.csv')

In [None]:
#checking the rows and columns
df.shape

(8523, 12)

In [None]:
#first five rows
df.head()


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
#general info ot our dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
#checking the null values in dataset
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [None]:
#We will fill the null values with zero
df1 =  df.fillna(0)

In [None]:
#checking the null values again
df1.isnull().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

In [None]:
#Now we want to check the uniques in each columns
df1.nunique()

Item_Identifier              1559
Item_Weight                   416
Item_Fat_Content                5
Item_Visibility              7880
Item_Type                      16
Item_MRP                     5938
Outlet_Identifier              10
Outlet_Establishment_Year       9
Outlet_Size                     4
Outlet_Location_Type            3
Outlet_Type                     4
Item_Outlet_Sales            3493
dtype: int64

In [None]:
#We use describe function to get genral info of each numerical column
df1.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,8523.0,8523.0,8523.0,8523.0,8523.0
mean,10.65059,0.066132,140.992782,1997.831867,2181.288914
std,6.431899,0.051598,62.275067,8.37176,1706.499616
min,0.0,0.0,31.29,1985.0,33.29
25%,6.65,0.026989,93.8265,1987.0,834.2474
50%,11.0,0.053931,143.0128,1999.0,1794.331
75%,16.0,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [None]:
# Handle categorical variables using one-hot encoding
df_encoded = pd.get_dummies(df1, columns=['Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Location_Type','Outlet_Type','Outlet_Size'],drop_first = True)

In [None]:
# Scale numerical features using Min-Max scaling
scaler = MinMaxScaler()
df_encoded[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']]= scaler.fit_transform(df_encoded[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']])

In [None]:
#split the data into training and testing sets
X = df_encoded.drop(columns=['Item_Identifier','Item_Outlet_Sales'])
y = df_encoded['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **1 Linear Regression model**

In [None]:
# Initialize and train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)


In [None]:
# Predict using the Linear Regression model
lr_predictions = lr_model.predict(X_test)

# Evaluate the model
lr_mse = mean_squared_error(y_test, lr_predictions)
lr_rmse = np.sqrt(lr_mse)
lr_r2 = r2_score(y_test, lr_predictions)

print(f"Linear Regression - RMSE: {lr_rmse}, R-squared: {lr_r2}")


Linear Regression - RMSE: 1069.5209145014253, R-squared: 0.5791436408349269


#**2 Decision Tree Regressor**

In [None]:
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train, y_train)

In [None]:
dt_predictions = dt_model.predict(X_test)

# Evaluate the model
dt_mse = mean_squared_error(y_test, dt_predictions)
dt_rmse = np.sqrt(dt_mse)
dt_r2 = r2_score(y_test, dt_predictions)

print(f"Decision Tree - RMSE: {dt_rmse}, R-squared: {dt_r2}")

Decision Tree - RMSE: 1476.656377809992, R-squared: 0.1977416952052058


Decision Tree Regressor using hyperparameter tunning

In [None]:
# Define the Decision Tree regressor
dt_model = DecisionTreeRegressor()

# Define hyperparameters and their possible values for tuning
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform grid search cross-validation to find the best hyperparameters
grid_search = GridSearchCV(dt_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params_dt = grid_search.best_params_

# Train a Decision Tree model with the best hyperparameters
best_dt_model = DecisionTreeRegressor(**best_params_dt)
best_dt_model.fit(X_train, y_train)


In [None]:
# Predict using the tuned Decision Tree model
dt_predictions = best_dt_model.predict(X_test)

# Calculate RMSE and R-squared
dt_mse = mean_squared_error(y_test, dt_predictions)
dt_rmse = np.sqrt(dt_mse)
dt_r2 = r2_score(y_test, dt_predictions)

print(f"Tuned Decision Tree - RMSE: {dt_rmse}, R-squared: {dt_r2}")


Tuned Decision Tree - RMSE: 1125.264973237588, R-squared: 0.5341297994603899


# **3 Random Forest Regressor**

In [None]:
# Initialize and train the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict using the Random Forest Regressor
rf_predictions = rf_model.predict(X_test)

# Calculate RMSE and R-squared
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_predictions)

print(f"Random Forest - RMSE: {rf_rmse}, R-squared: {rf_r2}")


Random Forest - RMSE: 1089.679237526427, R-squared: 0.5631295372674103


Random Forest Regressor using hyperparameter tunning

In [None]:
# Define the Random Forest regressor
rf_model = RandomForestRegressor()

# Define hyperparameters and their possible values for tuning
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform randomized search cross-validation to find the best hyperparameters
random_search = RandomizedSearchCV(rf_model, param_distributions=param_dist, n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params_rf = random_search.best_params_

# Train a Random Forest model with the best hyperparameters
best_rf_model = RandomForestRegressor(**best_params_rf)
best_rf_model.fit(X_train, y_train)


  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


In [None]:
# Predict using the tuned Random Forest model
rf_predictions = best_rf_model.predict(X_test)

# Calculate RMSE and R-squared
rf_mse = mean_squared_error(y_test, rf_predictions)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_predictions)

print(f"Tuned Random Forest - RMSE: {rf_rmse}, R-squared: {rf_r2}")


Tuned Random Forest - RMSE: 1037.4149881890714, R-squared: 0.6040317474654893


# **4 XGBoost Regressor**


In [None]:
import xgboost as xgb

# Initialize and train the XGBoost Regressor
xgb_model = xgb.XGBRegressor(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

# Predict using the XGBoost Regressor
xgb_predictions = xgb_model.predict(X_test)

# Calculate RMSE and R-squared
xgb_mse = mean_squared_error(y_test, xgb_predictions)
xgb_rmse = np.sqrt(xgb_mse)
xgb_r2 = r2_score(y_test, xgb_predictions)

print(f"XGBoost - RMSE: {xgb_rmse}, R-squared: {xgb_r2}")


XGBoost - RMSE: 1118.5539848499925, R-squared: 0.5396700528857475


XGBoost Regressor using hyperparameter tunning

In [None]:
# Define the XGBoost regressor
xgb_model = xgb.XGBRegressor()

# Define hyperparameters and their possible values for tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 3, 5]
}

# Perform grid search cross-validation to find the best hyperparameters
grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params_xgb = grid_search.best_params_

# Train an XGBoost model with the best hyperparameters
best_xgb_model = xgb.XGBRegressor(**best_params_xgb)
best_xgb_model.fit(X_train, y_train)


In [None]:
# Predict using the tuned XGBoost model
xgb_predictions = best_xgb_model.predict(X_test)

# Calculate RMSE and R-squared
xgb_mse = mean_squared_error(y_test, xgb_predictions)
xgb_rmse = np.sqrt(xgb_mse)
xgb_r2 = r2_score(y_test, xgb_predictions)

print(f"Tuned XGBoost - RMSE: {xgb_rmse}, R-squared: {xgb_r2}")


Tuned XGBoost - RMSE: 1034.5173772421506, R-squared: 0.6062406216459273


#  5.neural network model

In [None]:
import tensorflow as tf
from tensorflow import keras

# Define a basic neural network model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(1)  # Output layer with 1 neuron for regression
])

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)

# Predict using the neural network
nn_predictions = model.predict(X_test).flatten()

# Calculate RMSE and R-squared
nn_mse = mean_squared_error(y_test, nn_predictions)
nn_rmse = np.sqrt(nn_mse)
nn_r2 = r2_score(y_test, nn_predictions)

print(f"Neural Network - RMSE: {nn_rmse}, R-squared: {nn_r2}")


Neural Network - RMSE: 1027.7469983395122, R-squared: 0.6113776580657142


Neural network model using hyperparameter tuning

In [None]:
import tensorflow as tf
from tensorflow import keras

# Define a function to create a Keras model for tuning
def create_model(activation='relu', optimizer='adam'):
    model = keras.Sequential([
        keras.layers.Dense(64, activation=activation, input_shape=(X_train.shape[1],)),
        keras.layers.Dense(32, activation=activation),
        keras.layers.Dense(1)  # Output layer with 1 neuron for regression
    ])
    model.compile(optimizer=optimizer, loss='mean_squared_error')
    return model

# Create a KerasRegressor wrapper for use with GridSearchCV
nn_model = keras.wrappers.scikit_learn.KerasRegressor(build_fn=create_model, verbose=0)

# Define hyperparameters and their possible values for tuning
param_grid = {
    'activation': ['relu', 'tanh'],
    'optimizer': ['adam', 'rmsprop'],
    'epochs': [50, 100],
    'batch_size': [32, 64]
}

# Perform grid search cross-validation to find the best hyperparameters
grid_search = GridSearchCV(nn_model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params_nn = grid_search.best_params_

# Create and train a Neural Network model with the best hyperparameters
best_nn_model = create_model(activation=best_params_nn['activation'], optimizer=best_params_nn['optimizer'])
best_nn_model.fit(X_train, y_train, epochs=best_params_nn['epochs'], batch_size=best_params_nn['batch_size'], verbose=0)


  nn_model = keras.wrappers.scikit_learn.KerasRegressor(build_fn=create_model, verbose=0)


<keras.callbacks.History at 0x7eabada9d270>

In [None]:
# Predict using the tuned Neural Network model
nn_predictions = best_nn_model.predict(X_test).flatten()

# Calculate RMSE and R-squared
nn_mse = mean_squared_error(y_test, nn_predictions)
nn_rmse = np.sqrt(nn_mse)
nn_r2 = r2_score(y_test, nn_predictions)

print(f"Tuned Neural Network - RMSE: {nn_rmse}, R-squared: {nn_r2}")


Tuned Neural Network - RMSE: 1020.1235829454636, R-squared: 0.6171215653684979


In [None]:
import pandas as pd

# Create a DataFrame to store model results
model_results = pd.DataFrame(columns=['Model', 'RMSE', 'R-squared'])

# Linear Regression
model_results.loc[0] = ['Linear Regression', lr_rmse, lr_r2]

# Decision Tree
model_results.loc[1] = ['Decision Tree', dt_rmse, dt_r2]

# Random Forest
model_results.loc[2] = ['Random Forest', rf_rmse, rf_r2]

# XGBoost
model_results.loc[3] = ['XGBoost', xgb_rmse, xgb_r2]

# Neural Network
model_results.loc[4] = ['Neural Network', nn_rmse, nn_r2]

# Display the results
print(model_results)


               Model         RMSE  R-squared
0  Linear Regression  1069.520915   0.579144
1      Decision Tree  1125.264973   0.534130
2      Random Forest  1037.414988   0.604032
3            XGBoost  1034.517377   0.606241
4     Neural Network  1020.123583   0.617122


In [None]:
import pandas as pd

# Create a DataFrame to store the model results
model_results = pd.DataFrame(columns=['Model', 'RMSE', 'R-squared'])

# Linear Regression
model_results.loc[0] = ['Linear Regression', lr_rmse, lr_r2]

# Decision Tree (with Hyperparameter Tuning)
model_results.loc[1] = ['Tuned Decision Tree', dt_rmse, dt_r2]

# Random Forest (with Hyperparameter Tuning)
model_results.loc[2] = ['Tuned Random Forest', rf_rmse, rf_r2]

# XGBoost (with Hyperparameter Tuning)
model_results.loc[3] = ['Tuned XGBoost', xgb_rmse, xgb_r2]

# Neural Network (Keras) with Hyperparameter Tuning
model_results.loc[4] = ['Tuned Neural Network', nn_rmse, nn_r2]

# Display the results
print(model_results)


                  Model         RMSE  R-squared
0     Linear Regression  1069.520915   0.579144
1   Tuned Decision Tree  1125.264973   0.534130
2   Tuned Random Forest  1037.414988   0.604032
3         Tuned XGBoost  1034.517377   0.606241
4  Tuned Neural Network  1020.123583   0.617122


#Conclusion of model based on the train set :

The best model among those considered is the "Tuned Neural Network" with hyperparameter tuning, as it achieves the lowest RMSE (Root Mean Squared Error) and the highest R-squared value. Here's why:

1. **Lowest RMSE**: The RMSE of the "Tuned Neural Network" is the lowest among all models, indicating that it has the smallest average prediction error on the test data. A lower RMSE indicates better predictive accuracy.

2. **Highest R-squared**: The R-squared value of the "Tuned Neural Network" is the highest among all models. R-squared measures the proportion of the variance in the target variable that is predictable from the independent variables. A higher R-squared suggests that the model explains a larger portion of the variance in the sales data, indicating a better fit.

3. **Hyperparameter Tuning**: The use of hyperparameter tuning for the "Tuned Neural Network" likely contributed to its improved performance. Tuning allows the model to find the optimal combination of hyperparameters, resulting in better predictive power.

4. **Overall Performance**: The combination of low RMSE and high R-squared makes the "Tuned Neural Network" the best choice for accurate sales prediction. It strikes a balance between minimizing prediction errors and explaining the variability in sales data.

In summary, the "Tuned Neural Network" is the preferred model due to its superior performance in terms of both RMSE and R-squared, indicating better predictive accuracy and model fit.