# Predicting BigMart Sales
Provided by Analytics Vidhya, we have been participating in the BigMart Sales Practice Problem which began on May 25th, 2016, and ends on December 31st, 2018. Data scientists at BigMart have “collected 2013 sales data for 1559 products across 10 stores in different cities”, which participants will use to build a model to predict product sales by store. With this, BigMart will try to gain understanding of product and store properties that lead to increased sales.

Using data to increase profitability is now an intuitive and common practice in business; the type, number, and even physical placement of products is no longer arbitrary, rather, it is determined by data. Therefore, this study echoes the common business approach to increasing profitability: data-driven decision making.


_**Problem Statement**: “The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store. Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.”_


## Quantitative Questions
### What we are trying to learn:
1. What properties of store are key in determining store profitability?
2. What properties of products are key in determining product profitability?
3. What products are most profitable in each location?
    What category of products are most profitable in each location?
5. What types of location are most profitable?


### Specific hypotheses we will be testing:

This is our initial list of hypotheses we planned to test:
1. Item visibility affects the sale of the product
    Ex: the more exposed the item, the more sales it produces
2. Outlet size and outlet location type affects the profitability of a store
 Ex: the larger the store, the more profits
3. Profitability of a store is dependent on the product(s) it offers and its location
4. Item that is placed at bigger shelf or shelf with height of average people is more likely catch customers’ attention and thus should have more sales
5. Outlet with longer history is more likely to have higher profitability because it has impacted for the area for longer time
6. Stores that are located in tier 1 cities or urban areas should have higher sales because people who live in these areas tend to have higher levels of income

After further inspection of the structure of the data, we must exclude some of them. Below shows our final hypotheses included in the analysis:

1. Hypothesis 1
2. Hypothesis 2
3. Hypothesis 3


In [None]:
# Include required visualizations for this section

## Data Info
As mentioned above, the data for this challenge was collected and provided to participants by BigMart. Data scientists at BigMart “collected 2013 sales data for 1559 products across 10 stores in different cities.” Store and product include defined attributes, such as ID’s, category type, visibility, etc. The training dataset has 8,523 rows, and testing dataset has 5,681 rows. The training data has both input and output variables. Our task is to predict the sales for the testing dataset.

** -Any important features on the dataset that are worth mentioning? Problems/bias with something?-**

## Methods
We will first start with baseline model, which requires no prediction. We simply calculate mean values for a given category and compare with each other.

Then we will try linear regression model if both dependent and independent variables we are interested in are continuous and can fit linear regression well.


Finally, we will use decision tree model to predict sales. The data it split into testing and training. This allows us to train our program with training data and then test our program with testing data. We will keep adjusting predictors to find the point when the program has the highest accuracy predicting sales.

In [None]:
# Just including decision tree? Or are we including all the models we have submitted and compare w/ visuals like a4?

## Results

## Discussion

-Include future steps-

In [1]:
# Refer to notebook 4

import numpy as np
import pandas as pd
import seaborn as sns # for visualiation
import urllib.request # to load data
from scipy import stats # ANOVA
from scipy.stats import ttest_ind # t-tests
import statsmodels.formula.api as smf # linear modeling
import matplotlib.pyplot as plt # plotting
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline 

In [6]:
# Prep data
# %run data-prep.ipynb

In [7]:
# print(train_features.shape)
# print(train_outcome.shape)
# print(test_features.shape)
# train_features.columns

In [21]:
# item visibility, profitability, store location, store size, products
all_data = pd.read_csv('data/train.csv')
all_data.head()
# print(np.shape(all_data))


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [15]:
data = all_data.copy()
data = data[['Item_Outlet_Sales', 'Item_Visibility', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Outlet_Establishment_Year']]