# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data_num.csv')

In [3]:
cols_num = df.dtypes[df.dtypes != object].index.tolist()
df = df[cols_num]

In [4]:
df.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Item_Outlet_Sales,Item_Weight_missing,Outlet_Size_missing,Year_Op
0,9.3,1,0.016047,249.8092,1999,2,1,3735.138,0,0,22
1,5.92,2,0.019278,48.2692,2009,2,3,443.4228,0,0,12
2,17.5,1,0.01676,141.618,1999,2,1,2097.27,0,0,22
3,19.2,2,0.0,182.095,1998,0,3,732.38,0,1,23
4,8.93,0,0.0,53.8614,1987,3,3,994.7052,0,0,34


We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [5]:
X = df[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']].values
y = df['Item_Outlet_Sales'].values

In [6]:
print(X.shape)
print(y.shape)

(8523, 4)
(8523,)


In [7]:
# scale
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [8]:
# linear regression
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X,y)

LinearRegression()

In [27]:
y_pred = linreg.predict(X)

In [31]:
# Linear Regression score (baseline)
linreg.score(X, y)

0.34232626984602577

## Task
Split your data in 80% train set and 20% test set.

In [10]:
# train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [11]:
# parameters
params = {'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}

In [12]:
# Ridge grid search
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
clf_ridge = GridSearchCV(estimator=Ridge(), param_grid=params)
clf_ridge.fit(X_train, y_train)

GridSearchCV(estimator=Ridge(),
             param_grid={'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                                   0.9]})

In [13]:
# view the accuracy score
clf_ridge.best_score_

0.34413874500926755

In [14]:
# view the best parameters for the model found using grid search
print('Best alpha: ', clf_ridge.best_estimator_.alpha)

Best alpha:  0.1


In [15]:
# Lasso grid search
from sklearn.linear_model import Lasso
clf_lasso = GridSearchCV(estimator=Lasso(), param_grid=params, n_jobs=-1)
clf_lasso.fit(X_train, y_train)

GridSearchCV(estimator=Lasso(), n_jobs=-1,
             param_grid={'alpha': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
                                   0.9]})

In [16]:
# view the accuracy score
clf_lasso.best_score_

0.3441374897075871

In [17]:
# view the best parameters for the model found using grid search
print('Best alpha: ', clf_lasso.best_estimator_.alpha)

Best alpha:  0.1


## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [40]:
# Ridge
y_pred_ridge = clf_ridge.predict(X_test)

from sklearn.metrics import r2_score
r2_ridge = r2_score(y_test, y_pred_ridge)
r2_ridge

0.33207596239555404

In [39]:
# Lasso
y_pred_lasso = clf_lasso.predict(X_test)
r2_lasso = r2_score(y_test, y_pred_lasso)
r2_lasso

0.3320878802379684