# Perdiction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

---------------------

### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('data_num.csv')

In [4]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Item_Weight_missing,Broad_Item_Type,Outlet_Size_missing,Year_Op
0,FDA15,9.3,1,0.016047,Dairy,249.8092,OUT049,1999,2,1,Supermarket Type1,3735.138,0,Food,0,22
1,DRC01,5.92,2,0.019278,Soft Drinks,48.2692,OUT018,2009,2,3,Supermarket Type2,443.4228,0,Drinks,0,12
2,FDN15,17.5,1,0.01676,Meat,141.618,OUT049,1999,2,1,Supermarket Type1,2097.27,0,Food,0,22
3,FDX07,19.2,2,0.0,Fruits and Vegetables,182.095,OUT010,1998,0,3,Grocery Store,732.38,0,Food,1,23
4,NCD19,8.93,0,0.0,Household,53.8614,OUT013,1987,3,3,Supermarket Type1,994.7052,0,Non-Consumables,0,34


In [7]:
data.shape

(8523, 16)

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                8523 non-null   float64
 2   Item_Fat_Content           8523 non-null   int64  
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                8523 non-null   int64  
 9   Outlet_Location_Type       8523 non-null   int64  
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
 12  Item_Weight_missing        8523 non-null   int64  
 13  Broad_Item_Type            8523 non-null   objec

In [5]:
X = data[['Item_Weight', 'Item_MRP', 'Item_Visibility', 'Year_Op']]
y = data['Item_Outlet_Sales']

We have covered how to prepare a dataset and the process of feature engineering two weeks ago. In addition, we have already created Lasso and Ridge regressions on Monday. Today, we will be working with the ensemble methods. 

-------------------------
### Model Building: Ensemble Models

Try out the different ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)
- **Note:** Spend some time on the documention for each of these models.
- **Note:** As you spend time on this challenge, it is suggested to review how each of these models work and how they compare to each other.

Calculate the **mean squared error** on the test set. Explore how different parameters of the model affect the results and the performance of the model. (*Stretch: Create a visualization to display this information*)

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Lasso and Ridge Regression models from Monday.

**Questions to answer:**
- Which ensemble model performed the best? 

In [13]:
import xgboost as xgb
from sklearn import metrics

In [6]:
from sklearn.model_selection import train_test_split

In [16]:
# split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

In [22]:
# XGBoost parameters
clf = xgb.XGBRegressor()

In [14]:
# transform data type to DMatrix for XGBoost
data_dmatrix = xgb.DMatrix(data=X, label=y)

In [18]:
from sklearn.model_selection import GridSearchCV

In [26]:
hyparams = {
    'n_estimators': [15, 20, 25],
    'max_depth': [1, 2, 3, 6],
    'learning_rate': [0.2, 0.3, 0.4, 0.5]
}

In [27]:
model_to_fit = GridSearchCV(estimator=clf, param_grid=hyparams, n_jobs=-1)

In [28]:
model_to_fit.fit(X_train, y_train)

GridSearchCV(estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=None,
                                    num_parallel_tree=None, random_state=None,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameters=None,
      

In [29]:
model_to_fit.best_estimator_

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.3, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=20, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [30]:
best_model = model_to_fit

In [31]:
y_pred = best_model.predict(X_test)



In [33]:
MSE = metrics.mean_squared_error(y_test, y_pred)
print(f'MSE: {MSE}')

MSE: 1521323.0883286567


In [34]:
r2 = metrics.r2_score(y_test, y_pred)
r2

0.46307456249570256