Thank you BEN LEBOVITZ for sharing your Notebook https://www.kaggle.com/code/beezus666/k-means-and-feature-importance-for-articles

# Overview of this notebook
1. Group users and articles in a bunch of k-means clusters
2. Do a simple random forest to take a peek at the features and see what's useful for articles
    1. In the DF that does this, I set up y = number of times an article was bought
    2. This is set up as a very simple regression problem just to usee feature importance to see what features the RF was finding useful

The concept of clustering is key to recommender systems. Using cuML seems to be pretty good compared to CPU based solutions. I have a ways to go on this, but maybe it'll give someone else good ideas too.

Bonus... switch between cuML and cuDF and pandas/xgb/scikit/etc where a GPU will help.

# Overview of this notebook
1. Implement a simple Random Forest to take a peek at the features and see what's useful for articles
    1. Create Dataset that y = number of times an article was bought
    2. This is set up as a very simple regression problem just to usee feature importance to see what features the RF was finding useful


In [None]:
import numpy as np
import pandas as pd 
pd.options.plotting.backend = "matplotlib"
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import gc
import cudf
from fastai.tabular.core import add_datepart #fails because of some issue with weeks in cudf?
import cupy as cp
from cuml.cluster import KMeans
from cuml.datasets import make_blobs


# Load and group data
1. Creating a count of how many times items were bought from the cusomer CSV so that we can use it later in the articles.

2. Main idea is to use a simple Random Forest to predict how many times an article  will sell. 


First need to build the feature of number of times sold from the transaction data.

In [None]:
#some nice ideas on reducing memory: https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/discussion/308635
transactions = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv', parse_dates=['t_dat'])
transactions['customer_id'] = transactions['customer_id'].str[-16:].str.hex_to_int().astype('int64')
transactions['article_id'] = transactions.article_id.astype('int32')
transactions.t_dat = cudf.to_datetime(transactions.t_dat)
transactions = transactions[['t_dat','customer_id','article_id']]
#transactions.to_parquet('train.pqt',index=False)
print( transactions.shape )
transactions.head()

In [None]:
tmp = transactions.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index()
tmp.columns = ['customer_id','article_id','ct']
tmp.head(4)

In [None]:
transactions = transactions.merge(tmp,on=['customer_id','article_id'],how='left')
transactions = transactions.sort_values(['ct','t_dat'],ascending=False)
transactions = transactions.drop_duplicates(['customer_id','article_id'])
transactions = transactions.sort_values(['ct','t_dat'],ascending=False)


In [None]:
transactions.sample(3)

**transformed t_date into specific columns like Year, Month, Day, Day of Week, Day of Year, is month end, is month sart**

In [None]:
transactions['year'] = transactions['t_dat'].dt.year
transactions['month'] = transactions['t_dat'].dt.month
transactions['day'] = transactions['t_dat'].dt.day
transactions['dayofweek'] = transactions['t_dat'].dt.dayofweek
transactions['dayofyear'] = transactions['t_dat'].dt.dayofyear
transactions['is_month_end'] = transactions['t_dat'].dt.is_month_end
transactions['is_month_start'] = transactions['t_dat'].dt.is_month_start
transactions.drop(columns=['t_dat'], inplace = True)

transactions.tail()

We use Customer Id and encoded it, save it in cust_cat_df

In [None]:
transactions['cust_cat']= transactions['customer_id'].astype('category')
transactions['cat_codes'] = transactions['cust_cat'].cat.codes 
cust_cat_df = transactions[['customer_id', 'cust_cat', 'cat_codes']] #save them to put them back together later
print(cust_cat_df.dtypes)
cust_cat_df.head()

In [None]:
transactions.drop(columns=['cust_cat', 'cat_codes'], inplace = True)
transactions.dtypes

# Put number sold into articles DF
So now all grouped and formatted, put it into the articles DF

Also need to foramt everythin numerically so that the models that I'm trying to use will take the DF

In [None]:
# Get a list of all unique article ids
articles = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
articles.drop(columns=['detail_desc'], inplace = True)
articles.shape, articles.dtypes
articles

In [None]:
articles.head(2)

In [None]:
cat_names= articles.select_dtypes(include=['object']).columns
cont_names = articles.select_dtypes(include=['int64']).columns
obj_names = articles.select_dtypes(include=['object']).columns

for i in cat_names: articles[i+'_cat']=articles[i].astype('category')
for i in obj_names: articles.drop(columns=[i], inplace = True)

articles.dtypes

In [None]:
articles.head(2)

In [None]:
times_bought = transactions[['article_id', 'ct']]
times_bought = times_bought.groupby('article_id', as_index = False).sum()
times_bought.head()

In [None]:
articles = articles.merge(times_bought,  how='left', on='article_id')
articles['ct'] = articles['ct'].fillna(0)
articles.head()

In [None]:
articles.select_dtypes(include=['category']).columns

In [None]:
cat_names= articles.select_dtypes(include=['category']).columns
article_cat_df = cudf.DataFrame()

for i in cat_names: 
    articles[i+'_cat_code'] = articles[i].cat.codes
    
    #save them to put them back together later
    article_cat_df[i] = articles[i]    
    article_cat_df[i+'cat_code'] =articles[i+'_cat_code']

In [None]:
articles

In [None]:
article_cat_df

In [None]:
articles.columns

In [None]:
#something with cudf that this needs to be in a loop, very fast anyway
for i in cat_names:
    articles.drop(columns=[i], inplace = True)
articles.dtypes

In [None]:
#kmeans can't handle integers, so convert to float
int64s = articles.select_dtypes(include=['int64']).columns
for i in int64s:
    articles[i] = articles[i].astype(float)

# Simple random forest to predict # of sales
Mistake from earlier version... we want to do a simple RF BEFORE doing k-means so that we can figure out what features are important to pass to k-means..

All we're doing here is using RF to predcit the number of sales (the "ct" column).

In [None]:
# prepare X and y for random forest below
cols_list = articles.columns
cols_list = cols_list.to_list()
cols_list.remove('ct')

In [None]:
# convert to pandas, use scikit learn random forest to see what features are useful
# unfortunately cuML doesn't have feature importance in their random forest yet...
X = articles[cols_list].to_pandas()
y = articles['ct'].to_pandas()

In [None]:
articles.shape

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot

model = RandomForestRegressor()
# fit the model
model.fit(X_train, y_train)
# get importance

In [None]:
y_pred = model.predict(X_test)

In [None]:
sum((y_test - y_pred) ** 2) / len(y_pred)

In [None]:
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, y_pred))

In [None]:
fi_plot_df = pd.DataFrame({'cols':X.columns, 'imp':model.feature_importances_}).sort_values('imp', ascending=False)    
fi_plot_df.plot(kind="barh", x = 'cols')