# Project 3 Sample Code



Since we don't expect you to learn an entirely new approach to predictive modeling for this project, we instead encourage you to work with the models you already are familiar with in Project 1. 

For this project the easiest approach will simply be to condense the time series data into new features (e.g., engineer a feature for last month's sales, rolling averages, etc.) which would then allow you to treat each row as it's own independent data point. 

You can then simply use this month's sales data as the label, drop it from your dataframe and run a regression on it. 

This is certainly not the only approach you can take, and we highly encourage you experimenting with alternatives. But if you're stuck, this will give you a framework for getting started.

In [None]:
# import libraries 
import numpy as np 
import pandas as pd  

Let's first import our data.

In [None]:
# import the data 
dataunits = pd.read_csv('data/BrandTotalUnits.csv')
datasales = pd.read_csv('data/BrandTotalSales.csv')
datasales.head(20)

In [None]:
datasales.info()

So first issue is the data in its current form isn't really useful to us, so let's do some conversion of our data.

In [None]:
#First convert our months to datetime
dataunits['Months'] = pd.to_datetime(dataunits['Months'])
#Total units is too large currently to convert to a float
#need to trim it first then convert to float
dataunits['Total Units'] = dataunits['Total Units'].str.replace(',','')
dataunits['Total Units'] = dataunits['Total Units'].str[:8]
dataunits['Total Units'] = pd.to_numeric(dataunits['Total Units'])


dataunits.info()

In [None]:
#First convert our months to datetime
datasales['Months'] = pd.to_datetime(datasales['Months'])
#Total units is too large currently to convert to a float
#need to trim it first then convert to float
datasales['Total Sales ($)'] = datasales['Total Sales ($)'].str.replace(',','')
datasales['Total Sales ($)'] = datasales['Total Sales ($)'].str[:8]
datasales['Total Sales ($)'] = pd.to_numeric(datasales['Total Sales ($)'])


datasales.info()

## TimeSeries Feature Engineering 

So there's a number of ways of approaching this but given the complexity of multiple brands with overlapping time intervals what seems to work easiest for me is breaking the dataset up by brand, engineering the features you want for each brand, and then reassembling the new dataframe. 

In [None]:
brands = dataunits["Brands"].unique()
brands

In [None]:
for brand in brands:
    ...
#once you've successfully completed your feature engineering for 
#a single brand you can try wrapping it in a for loop to engineer 
#all brand features

For now I'l attempt to construct some features on a single brand

In [None]:
units = dataunits[dataunits.Brands == '101 Cannabis Co.']


In [None]:
units

### Feature Engineering

We'll now create two features based on sales history. I'm going to take last month's sales, as well as a rolling average of sales for the last three months.

In [None]:
# creating new dataframe from consumption column
#data_historic = units[['Total Units']]
# inserting new column with yesterday's consumption values
units.loc[:,'Previous Month'] = units.loc[:,'Total Units'].shift(+1)
# inserting another column with difference between yesterday and day before yesterday's consumption values.

units.loc[:,'Rolling Average'] = (units.loc[:,'Total Units'].shift(+1) + units.loc[:,'Total Units'].shift(+2) + units.loc[:,'Total Units'].shift(+3))/3


units

### Merging Data 

Now that we have only one brand to work with at a time, it's relatively trivial to merge our datasets and pull features from the other datasets. You can use this example.

In [None]:
sales = datasales[datasales.Brand == '101 Cannabis Co.']

sales

In [None]:
units = units.merge(sales, left_on='Months', right_on='Months')

In [None]:
units = units.drop(['Brand'], 1)

In [None]:
units.head()

So now I have a dataframe with merged features and engineered features. I now want to read in some brand specific features to augment my dataset. 

## Brand Features Engineering

Let's see what we have here!

In [None]:
branddetails = pd.read_csv('data/BrandDetails.csv')

In [None]:
branddetails = branddetails[branddetails.Brand == '101 Cannabis Co.']

branddetails.head()

I have a theory that it's important to determine if a company offers inhaleable and edible products as part of their product inventory so I'm going to create binary categorical features.

In [None]:
value = 0
value1 = 0

if 'Inhaleables' in branddetails['Category L1'].values:
    value = 1
if 'Edibles' in branddetails['Category L1'].values:
    value1 = 1
 
units['Inhaleables'] = value
units['Edible'] = value1


units

I also believe that a total count of the number of products the brand offers is also a useful feature to include. Fortunately that's easy enough to determine!

In [None]:
productcount = (branddetails.Brand == '101 Cannabis Co.').count()

productcount

In [None]:
units['ProdCount'] = productcount

units.head()

The result is starting to look like a pretty darn good dataframe! We now have merged and engineered timeseries features, along with brand-level features included in our dataframe. 

To complete this work the next steps will be to: 

1. finalize our feature selection plan
2. consolidate these steps into a concise for loop for all brands and then append them into a single dataframe
3. finalize an imputation strategy
4. You can then treat the dataset like a typical regression problem where 'TotalSales' or 'TotalUnits' can be the label you predict on
5. As always report your metrics! (and speaking of metrics, I found this handy-dandy helper functin that spits out a bunch of useful ones for you...)

In [None]:
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
    # Regression metrics
    explained_variance=metrics.explained_variance_score(y_true, y_pred)
    mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred) 
    mse=metrics.mean_squared_error(y_true, y_pred) 
    mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
    median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
    r2=metrics.r2_score(y_true, y_pred)
    print('explained_variance: ', round(explained_variance,4))    
    print('mean_squared_log_error: ', round(mean_squared_log_error,4))
    print('r2: ', round(r2,4))
    print('MAE: ', round(mean_absolute_error,4))
    print('MSE: ', round(mse,4))
    print('RMSE: ', round(np.sqrt(mse),4))