# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [47]:
%run data_prep_exercise.ipynb


Frequency of Category of Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

Frequency of Category of Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

Frequency of Category of Outlet_Size
Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

Frequency of Category of Outlet_Location_Type
Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

Frequency of Category of Outlet_Type
Supermar

In [48]:
%run feature_engineering_exercise.ipynb


Frequency of Category of Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

Frequency of Category of Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

Frequency of Category of Outlet_Size
Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

Frequency of Category of Outlet_Location_Type
Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

Frequency of Category of Outlet_Type
Supermar

In [49]:
data.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content             float64
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier              int32
Outlet_Establishment_Year      int64
Outlet_Size                    int64
Outlet_Location_Type           int32
Outlet_Type                    int64
Item_Outlet_Sales            float64
Item_Type_Combined           float64
Outlet_Years                   int64
Baking Goods                   uint8
Breads                         uint8
Breakfast                      uint8
Canned                         uint8
Dairy                          uint8
Frozen Foods                   uint8
Fruits and Vegetables          uint8
Hard Drinks                    uint8
Health and Hygiene             uint8
Household                      uint8
Meat                           uint8
Others                         uint8
Seafood                        uint8
S

In [94]:
cat_feats = data.dtypes[data.dtypes == 'object'].index.tolist()
data.drop(cat_feats,axis=1,inplace=True)

In [95]:
data.isna().sum()

Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
Item_Type_Combined           0
Outlet_Years                 0
Baking Goods                 0
Breads                       0
Breakfast                    0
Canned                       0
Dairy                        0
Frozen Foods                 0
Fruits and Vegetables        0
Hard Drinks                  0
Health and Hygiene           0
Household                    0
Meat                         0
Others                       0
Seafood                      0
Snack Foods                  0
Soft Drinks                  0
Starchy Foods                0
dtype: int64

In [65]:
data['Item_Fat_Content'].fillna(data['Item_Fat_Content'].mode()[0], inplace = True)
data['Item_Type_Combined'].fillna(data['Item_Type_Combined'].mode()[0], inplace = True)

In [66]:
data.isna().sum()

Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
Item_Type_Combined           0
Outlet_Years                 0
Baking Goods                 0
Breads                       0
Breakfast                    0
Canned                       0
Dairy                        0
Frozen Foods                 0
Fruits and Vegetables        0
Hard Drinks                  0
Health and Hygiene           0
Household                    0
Meat                         0
Others                       0
Seafood                      0
Snack Foods                  0
Soft Drinks                  0
Starchy Foods                0
dtype: int64

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [51]:
from sklearn.linear_model import LinearRegression

In [52]:
y = data.Item_Outlet_Sales


In [74]:
from sklearn.utils import shuffle
df = data.loc[ : , data.columns != 'Item_Outlet_Sales']
X, y = df, y
X, y = shuffle(X, y, random_state=27)
print (X)
print (y)

      Item_Weight  Item_Fat_Content  Item_Visibility  Item_MRP  \
7377    12.600000               0.0         0.096194  210.8612   
3078    12.600000               1.0         0.031599  172.9764   
6505    12.987880               0.0         0.052555  190.3504   
4686    17.350000               0.0         0.056032  102.3016   
3093    11.300000               0.0         0.000000  245.2118   
...           ...               ...              ...       ...   
7192     8.260000               0.0         0.034544  116.0834   
4848     8.710000               1.0         0.072155  183.3924   
3912    12.850000               0.0         0.030790  254.9040   
3768    13.224769               1.0         0.129669  206.8638   
5139    12.277108               0.0         0.110736   35.2874   

      Outlet_Identifier  Outlet_Establishment_Year  Outlet_Size  \
7377                 35                       2004            0   
3078                 45                       2002            1   
6505  

In [84]:
#linear regression
# creating linear regression
reg = LinearRegression()
reg.fit(X,y)

LinearRegression()

In [85]:
print (reg.coef_)

[ 4.08264822e+00 -3.95428497e+01  6.61770158e+01  1.59100469e-01
 -1.90603085e+00 -8.04844844e+11  5.42651972e+01  4.04793649e+00
 -1.89879976e+01  6.66247610e+01 -8.04844844e+11  2.26054349e+10
  2.26054349e+10  2.26054347e+10  2.26054349e+10  2.26054348e+10
  2.26054347e+10  2.26054349e+10  2.26054350e+10  2.26054349e+10
  2.26054349e+10  2.26054348e+10  2.26054349e+10  2.26054346e+10
  2.26054348e+10  2.26054349e+10  2.26054347e+10]


In [86]:
reg.score(X,y) #R^2

0.002930968171742432

## Task
Split your data in 80% train set and 20% test set.

In [80]:
import sklearn.model_selection as model_selection
X_train,X_test, y_train, y_test = model_selection.train_test_split(X, y\
                                                                   ,train_size=0.80,test_size=0.20,random_state=42)
print(f'{len(X_train)} training samples and {len(X_test)} test samples')

6818 training samples and 1705 test samples


## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [96]:
# Make the folds
k_folds = 5

X_folds, y_folds = np.array_split(X, k_folds), np.array_split(y, k_folds)

fold_sizes = ', '.join([str(len(f)) for f in X_folds])
print(f'The folds are of type {type(X_folds)} and contain {fold_sizes} data points')

# List that will accumulate test performance on each fold
cv_r2 = []

for i in range(k_folds):
    # Make the train/test set for this fold
    X_test = X_folds[i]
    y_test = y_folds[i]
    X_train = [X_folds[j] for j in range(k_folds) if j != i]
    y_train = [y_folds[j] for j in range(k_folds) if j != i]
    X_train = np.concatenate(X_train)
    y_train = np.concatenate(y_train)

The folds are of type <class 'list'> and contain 1705, 1705, 1705, 1704, 1704 data points


In [97]:
# List that will accumulate test performance on each fold
cv_r2 = []
from sklearn.metrics import r2_score

for i in range(k_folds):
    # Make the train/test set for this fold
    X_test = X_folds[i]
    y_test = y_folds[i]
    X_train = [X_folds[j] for j in range(k_folds) if j != i]
    y_train = [y_folds[j] for j in range(k_folds) if j != i]
    X_train = np.concatenate(X_train)
    y_train = np.concatenate(y_train)

    # Train the model
    reg.fit(X_train, y_train)

    # Evaluate the model on the test set
    y_pred = reg.predict(X_test)
    r2 = r2_score(y_test, y_pred)

    cv_r2.append(r2)

print(cv_r2)

[-0.005001250469073737, -0.007326294565279889, -0.004798950023951809, -0.0020255984930457327, 0.00012593928602278037]


In [98]:
from sklearn.model_selection import cross_val_score

cv_r2 = cross_val_score(lr, X, y, cv=k_folds, scoring='r2')
print(f'Cross-validated R^2\nMean:\t{cv_r2.mean()}\nStd.:\t{cv_r2.std()}')

Cross-validated R^2
Mean:	-0.0038052308530656777
Std.:	0.002586001561377152


In [99]:
from sklearn.linear_model import ElasticNet
#model with 2 hyperparameters
#combination of lasso and ridge regression penalties

from sklearn.preprocessing import StandardScaler

# Scale features since weight magnitudes will effect regularization weight penalties
X_scaled = StandardScaler().fit_transform(X)

# Hyperparameter settings we want to try
alphas = [0.001, 0.01, 0.1, 1] #lambda (controls effect of penalty)
l1_ratios = [0, 0.25, 0.5, 0.75, 1] #lasso vs ridge penalty (0=ridge, 1=lasso)

# Keep track of the best hyperparameters found so far
best_r2 = -np.inf
best_alpha = None
best_l1_ratio = None

In [91]:
for alpha in alphas:
    for l1_ratio in l1_ratios:
        model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=27)
        cv_r2 = cross_val_score(model, X_scaled, y, cv=k_folds, scoring='r2')
        if cv_r2.mean() > best_r2:
            best_r2 = cv_r2.mean()
            best_alpha = alpha
            best_l1_ratio = l1_ratio

print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}\nAlpha:\t{best_alpha}\nL1 ratio:\t{best_l1_ratio}')

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


The best hyperparameter settings achieve a cross-validated R^2 of: -0.0009523110631030285
Alpha:	1
L1 ratio:	0


  model = cd_fast.enet_coordinate_descent(


## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.