# Predicting (new) item categories

In this lab, we'll try to find a better way to categorize items than the categories they already have

First, we need to read in the items.csv and sales_train.csv files:

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd

items = pd.read_csv('./data/kaggle-sales/items.csv.gz')
items.head()

In [None]:
sales = pd.read_csv('./data/kaggle-sales/sales_train.csv.gz', parse_dates=['date'])
sales.head()

The only features we have for items right now is the description. Let's add some aggregate sales data mean_price and mean_volume. To do this, we'll first have to reformat sales data to be keyed by item_id and the year.

Follow a similar process to the feature engineering lab in order to do this.

In particular, we need to:

 - add a 'year' column
 - group by item_id and year, computing the (mean) price and (sum) txns
 - create a MultiIndex on item_id and year and reindex

In [None]:
sales['year'] = sales.date.dt.year
g = sales.groupby('item_id year'.split())
item_sales = pd.concat([
    g.item_price.mean().rename('price'),
    g.item_cnt_day.sum().rename('txns'),
], axis=1)
item_sales.head()

In [None]:
index = pd.MultiIndex.from_product(
    [
        sorted(sales['item_id'].unique()),
        np.r_[sales['year'].min(): sales['year'].max() + 1]
    ], names=['item_id', 'year']
)
item_sales = item_sales.reindex(index)
item_sales.head()

Fill missing txns with 0:

In [None]:
item_sales = item_sales.fillna({'txns': 0})
item_sales.head()

Now we can calculate the mean price and mean transactions per year by grouping *only* by item_id and aggregating with mean:

In [None]:
item_sales = item_sales.groupby(level=0).mean()
item_sales.head()

Create an item_features dataframe that includes the name, price, and transactions:

Now we'll encode our text features as before:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

vectorizer = CountVectorizer()
transformer = TfidfTransformer()

text_features = vectorizer.fit_transform(data.item_name)
text_features = transformer.fit_transform(text_features)
text_features

In [None]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=10)
truncated_text_features = svd.fit_transform(text_features)

Finally, we create our features dataframe with:

- price
- txns
- text features

In [None]:
features = pd.concat([
    data['price txns'.split()],
    pd.DataFrame(truncated_text_features, columns=[f'text_{i}' for i in range(10)])
], axis=1)
features.head()

## Scaling 

Use a StandardScaler to scale the features

# Use DBSCAN with eps=0.1 to find a clustering

# Interpreting results

Use a RandomForestClassifier to evaluate feature importance for the clustering chosen. 

Remember to drop any sample with a cluster of -1

Create a plot of the feature importance from the classifier

# Analyzing with a RandomForestRegressor

One of the things we might like to know is how predictive of the # of transactions each of our item features is. For this, we can train a RandomForestRegressor and get its feature_importances.

Train a RandomForestRegressor on the feature data, using txns as the target column

Create a plot of the feature importance from the regressor