<a href="https://colab.research.google.com/github/vincebarokie/sales-prediction/blob/main/sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project : Sales Prediction

This project will be a sales prediction for food items sold at various stores. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in increasing sales.

## Data Loading and Cleanup

* Load the CSV file to a DataFrame
* Explore and do necessary data preparation and cleanup

### Mounting and Loading

* Mount the drive
* Import libraries
* Load CSV as DataFrame

In [1]:
# install required packages
%pip install datawig



In [2]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# import libraries
import pandas as pd
import numpy as np
import datawig

In [4]:
# load sales_predictions.csv file to df
filename = '/content/drive/MyDrive/CodingDojo_DS/Project/sales_predictions.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### DataFrame Shape

Get how many rows and columns are in the DataFrame

In [5]:
# How many rows and columns? 
df.shape

(8523, 12)

### DataFrame Info

* Get the columns and each dtypes
* Find how many columns have missing values

In [6]:
# dtypes of each variable
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier              8523 non-null object
Item_Weight                  7060 non-null float64
Item_Fat_Content             8523 non-null object
Item_Visibility              8523 non-null float64
Item_Type                    8523 non-null object
Item_MRP                     8523 non-null float64
Outlet_Identifier            8523 non-null object
Outlet_Establishment_Year    8523 non-null int64
Outlet_Size                  6113 non-null object
Outlet_Location_Type         8523 non-null object
Outlet_Type                  8523 non-null object
Item_Outlet_Sales            8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


### Explore and Remove Duplicates

* drop all duplicate rows in the DataFrame and keep first existing values if any

In [7]:
# drop duplicates if there are any
cols = df.columns.to_list()
# df = df.drop_duplicates(subset=cols, keep='first')
# df.info()
df.duplicated(subset=cols, keep='first').sum()

0

### Identify And Address Missing Values

As we can see there are missing values in a couple of columns below > **Item_Weight** and **Outlet_Size**

Take note that **Outlet_Size** is a categorical varibale

In [8]:
# find which variables has missing values
df.isnull().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

#### Missing Numerical Variable (MCAR)
Item_Weight is considered as MCAR as there are no found relations between other variables, thus we will be using `Single Value Imputation` and we will be using the `Mean` value as replacement of missing values for this variable.

In [9]:
# replacing Item_Weight missing values
iw_mean = df['Item_Weight'].mean()
df['Item_Weight'] = df['Item_Weight'].fillna(iw_mean)
df.isnull().sum()


Item_Identifier                 0
Item_Weight                     0
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

#### Imputation Using Datawig

Impute missing values in a dataframe 

In [10]:
# initialize a SimpleImputer model
df_train, df_test = datawig.utils.random_split(df)
imputer = datawig.SimpleImputer(
    input_columns = cols,
    output_column='Outlet_Size',
    output_path = 'imputer_model'
)

# Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

# Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
2022-04-10 19:17:59,318 [INFO]  NumExpr defaulting to 2 threads.
2022-04-10 19:18:01,629 [INFO]  
2022-04-10 19:18:04,793 [INFO]  Epoch[0] Batch [0-138]	Speed: 706.06 samples/sec	cross-entropy=0.583714	Outlet_Size-accuracy=0.787770
2022-04-10 19:18:09,460 [INFO]  Epoch[0] Train-cross-entropy=0.419880
2022-04-10 19:18:09,465 [INFO]  Epoch[0] Train-Outlet_Size-accuracy=0.880888
2022-04-10 19:18:09,472 [INFO]  Epoch[0] Time cost=7.831
2022-04-10 19:18:09,490 [INFO]  Saved checkpoint to "imputer_model/model-0000.params"
2022-04-10 19:18:10,751 [INFO]  Epoch[0] Validation-cross-entropy=0.189466
2022-04-10 19:18:10,761 [INFO]  Epoch[0] Validation-Outlet_Size-accuracy=0.985887
2022-04-10 19:18:15,862 [INFO]  Epoc

In [11]:
# analyze imputed dataframe 
os_missing = imputed['Outlet_Size'].isna()
imputed.loc[os_missing, :]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Size_imputed,Outlet_Size_imputed_proba
4167,FDT40,5.985,Low Fat,0.095990,Frozen Foods,127.3678,OUT045,2002,,Tier 2,Supermarket Type1,508.6712,Small,0.963192
619,FDO19,17.700,Regular,0.016630,Fruits and Vegetables,48.1034,OUT045,2002,,Tier 2,Supermarket Type1,534.6374,Small,0.938289
7759,NCK30,14.850,Low Fat,0.102066,Household,254.2698,OUT010,1998,,Tier 3,Grocery Store,1775.6886,Small,0.998052
4376,FDZ02,6.905,Regular,0.063851,Dairy,97.2726,OUT010,1998,,Tier 3,Grocery Store,195.7452,Small,0.999012
3044,FDR26,20.700,Low Fat,0.071700,Dairy,177.6028,OUT010,1998,,Tier 3,Grocery Store,531.3084,Small,0.999617
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1406,DRG15,6.130,Low Fat,0.076892,Dairy,61.5536,OUT045,2002,,Tier 2,Supermarket Type1,796.2968,Small,0.978833
6899,FDX52,11.500,Regular,0.042088,Frozen Foods,192.6820,OUT045,2002,,Tier 2,Supermarket Type1,3861.6400,Small,0.972943
6400,FDT43,16.350,Low Fat,0.034393,Fruits and Vegetables,50.8324,OUT010,1998,,Tier 3,Grocery Store,155.7972,Small,0.998384
1520,FDF12,8.235,Low Fat,0.082595,Baking Goods,149.1076,OUT045,2002,,Tier 2,Supermarket Type1,1182.4608,Small,0.967243


In [12]:
# assign imputed dataframe as clean dataframe and drop other uneccessary columns and rename Outlet_Size_imputed as Outlet_Size
cleaned_df = imputed.drop(columns = ['Outlet_Size','Outlet_Size_imputed_proba'])
cleaned_df = cleaned_df.rename(columns={'Outlet_Size_imputed':'Outlet_Size'})
cleaned_df.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
Outlet_Size                  0
dtype: int64

### Fix Incosistent categories of Data

* Check for inconsistencies
* Fix inconsistencies

In [13]:
# check inconsistencies in the categorical values
cat_df = cleaned_df.select_dtypes(include=[np.object])
for col in list(cat_df):
  print(col + ":")
  print(" > Count of unique values:" + str(len(cat_df[col].unique())))
  print(cat_df[col].unique())

Item_Identifier:
 > Count of unique values:1069
['FDT40' 'FDT09' 'FDQ25' ... 'FDT43' 'FDF12' 'FDQ10']
Item_Fat_Content:
 > Count of unique values:5
['Low Fat' 'Regular' 'LF' 'low fat' 'reg']
Item_Type:
 > Count of unique values:16
['Frozen Foods' 'Snack Foods' 'Canned' 'Fruits and Vegetables' 'Meat'
 'Soft Drinks' 'Household' 'Hard Drinks' 'Others' 'Dairy' 'Baking Goods'
 'Breads' 'Health and Hygiene' 'Starchy Foods' 'Breakfast' 'Seafood']
Outlet_Identifier:
 > Count of unique values:10
['OUT045' 'OUT049' 'OUT035' 'OUT046' 'OUT018' 'OUT019' 'OUT010' 'OUT013'
 'OUT017' 'OUT027']
Outlet_Location_Type:
 > Count of unique values:3
['Tier 2' 'Tier 1' 'Tier 3']
Outlet_Type:
 > Count of unique values:4
['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3']
Outlet_Size:
 > Count of unique values:3
['Small' 'Medium' 'High']


In [14]:
# fix inconsistencies found in Item_Fat_Content column
cleaned_df['Item_Fat_Content'] = cleaned_df['Item_Fat_Content'].replace(['low fat', 'LF', 'reg'],['Low Fat','Low Fat','Regular'])
cleaned_df['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

### Obtain the summarry statistics

* Summary statistics of each numerical column

In [15]:
# summary statistics of each numerical columns
df.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,8523.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.226124,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,9.31,0.026989,93.8265,1987.0,834.2474
50%,12.857645,0.053931,143.0128,1999.0,1794.331
75%,16.0,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648
