# Amazon product ratings prediction

Mission:    Create a web that'll recommend the top 5 products based on predicted ratings.

*italicized text*# Import libraries and my own functions

In [2]:
import sys
import os

import numpy as np
import pandas as pd

# Get the current working directory
current_dir = os.getcwd()

# Construct the path to the 'src' directory
src_dir = os.path.join(current_dir, 'src')

# Add 'src' directory to the sys.path
sys.path.append(src_dir)

from training import *
from utils import *
from EDA import *

# Show all columns (don't replace some with "...")
pd.set_option('display.max_columns', None)

ModuleNotFoundError: No module named 'training'

# Import data

## Download dataset from kaggle

In [3]:
#!pip install kaggle

In [4]:
os.listdir()

['.config', 'sample_data']

In [5]:
import kaggle
import pandas as pd
import zipfile
import os


data_path = 'amazon_reviews.csv'

if not 'amazon_reviews.csv' in os.listdir():
    # Download dataset
    kaggle.api.dataset_download_files('rogate16/amazon-reviews-2018-full-dataset', path='.', unzip=True)

    # Assuming the file is a CSV, find the downloaded CSV file
    for file in os.listdir('.'):
        if file.endswith('.csv'):
            data_path = file
            break

# Load into Pandas DataFrame
df = pd.read_csv(data_path)
df

OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.

In [None]:
df.head(10)

In [None]:
df_small = df.sample(10000)
orig_cols = df.columns

# Some domain information and prior research

## on the dataset

About Dataset
This dataset was collected from an open-source Amazon reviews made available by Jianmo Ni

Preprocess
The data was originally in JSON, and divided into metadata and reviews. I converted the data into data frame and then join both the metadata and the reviews, before converting it to CSV file. No further process was done afterwards.

Content
This dataset contains full reviews from Amazon in 2018, consists of 500000+ reviews from 100000+ users. The columns are pretty much self-explanatory, such as userName, itemName, rating, reviewText, etc

Task
This dataset can be used to build a recommender system, since it has the user-item-rating information. This can also be used for NLP tasks, using the reviewText column.

The "Amazon Reviews 2018 Full Dataset" on Kaggle is a comprehensive collection of Amazon product reviews from the year 2018. This dataset is extensive and is likely to include various features such as the text of the reviews, ratings, product information, user details, and timestamps. It is typically used for analysis in natural language processing, sentiment analysis, and building recommendation systems.

## ## Should we check if there's any time trend in here?

Research on predicting the helpfulness of Amazon product reviews indicates that the most effective features often include content-based aspects of the reviews. These features span categories like lexical, structural, semantic, syntactic, and metadata elements. Key features identified as impactful include review length, unigrams (single words), product ratings, and the degree of detail, which is a function of review length and n-grams. Additionally, features like sentiment analysis, readability, and text surface elements (like the number of words, sentences, and use of punctuation) also play significant roles. The selection of these features, however, can be domain- and platform-dependent, and the effectiveness of individual features or combinations can vary​​.

## ## A medium article: Amazon Review Rating Prediction with NLP
https://medium.com/data-science-lab-spring-2021/amazon-review-rating-prediction-with-nlp-28a4acdd4352


## #  decided to consider reviews written by verified purchasers to decrease the risk of fraudulent reviews with dubious ratings

## # Only star_rating, review_headline, and review_body columns were considered to reduce feature complexity.

## # review_headline and review_body were concatenated and delimited by a space to further reduce feature complexity

## # Normalize the dataset by converting all the characters to lowercase.

## # convert all whitespace and punctuation into a single space to get rid of any inconsistencies

## #  regex to write a de-contract method that essentially finds and replaces the apostrophe-letter format into a full word. We made an observation that replacing “n’t” to “not” is not viable in all cases. It works for “isn’t” ⇒ “is not” but will break “can’t” ⇒ “ca not”. We created special cases for these situations.

## # remove stopwords to further denoise the input. We used NLTK’s stopwords package to provide us with the list of stopwords. Here, we made an adjustment to avoid the removal of certain negation stop words, namely“not” and “no”, since they do indeed influence sentence meaning. A product that is “not good” is certainly different from a “good” product.

## # Lemmatization: pre-computed embeddings seem to be calculated without stemming so we decided against stemming in preprocessing as well.  

## ## Treating Numbers: considered converting alphanumeric numbers into English words for the sake of consistency. decided we would just keep the original format instead.

### # Stemming: could provide some inaccurate results.

## ## Embeddings: little significant difference in model performance for the simpler encoding schemes like Bag of Words and TF-IDF. However, the pre-computed word-based embeddings performed the best, specifically BERT, which is a pre-computed NLP model from Google that had to be optimized via stochastic gradient descent.

## ## Modeling:

## ### tried a few classification models, although these were quickly proven to be vastly inferior to regression models, based on the nature of the project. Our unsupervised deep learning models employed a variety of the aforementioned embeddings.m

## ### used Root-Mean-Square Error (RMSE) as our loss metric, which would tell us on average how many stars away our label was from the actual value

## ### normalized the labels to be from 0 to 1 instead of 1 to 5 by dividing all ratings by five. This means a label of 0.2 equals 1 star, 0.4 equals 2 stars, etc. An RMSE value of 0.1 suggests our labels are predicting a half star away from their actual value, on average. As we will talk about in the shortcomings section, it is virtually impossible to get a test RMSE value close to 0 based on the nature of the problem.

## ### LightGBM: Baseline for regression RMSE, we encoded the review text with TF-IDF and fit an untuned Light GBM Regression model. . The RMSE value on the test set was 0.178, aka an average of 0.89 stars away from the actual review value

## ### Catboost: We decided to implement a Bag of Words model as we were curious about how well such a model would predict the rating of a review...  a CatBoost regression model on our augmented training dataset for 100 iterations, which gave us an RMSE of about 0.17 on the test dataset, which was comparable to the RMSE we received from one of our better-performing models.

## ### ReLU: we achieved an RMSE of 0.173 on the test set. We could add l1 and l2 regularization or dropout layers to discourage overfitting, but we decided against this as this model will act as a baseline for comparison of other models.

## ### 1D Convolution Layer: achieved an RMSE of 0.160 on the test set.

## ### LSTM/GRU: We achieved 0.142 testing loss

## ### BERT: Once applied to the test set, the RMSE loss was 0.136

## ## Shortcomings:
## ### different people associate star ratings with different sentiment polarities, especially for the 2-, 3-, and 4-star ratings
## ###  we divided all ratings by five in order to standardize true labels between zero and one. However, we realized afterward that this approach improperly restricts predicted ratings that exceed five stars
## ### For the less advanced models, double negation and mixed sentiment was sometimes not factored into the predicted label as much as it should have been. Models that accounted for bi-directional representation did the best with this topic.
## ### Capitalization: The sentence “I did NOT like the product” should likely be labeled with a lower rating than the sentence “I did not like the product.”

## ### Conclustion: Future work could build on these models by increasing the training set, improving pre-processing, and accounting for the shortcomings listed above as best as possible. One could also download a more sizable version of BERT, although computing time would rapidly increase.

## ## Potential features:
## # Sentiment of review
https://towardsdatascience.com/predicting-sentiment-of-amazon-product-reviews-6370f466fa73
https://www.kaggle.com/code/imdevskp/amazon-reviews-sentiment-analysis-and-prediction

## # Mean ratings that connected users gave to this item - but can we rely on having other ratings?
https://www.kaggle.com/code/tsefongwon/graph-analysis-of-amazon-customer-buying-habits/notebook

## # Visual features of the item
## # Cosine Similarity of this item to other items - but can we rely on having other ratings?
https://www.kaggle.com/code/tsefongwon/graph-analysis-of-amazon-customer-buying-habits/notebook

## # prior multi-classification of negative (1-2), neutral (3) and positive (4-5) reviews:
https://medium.com/@jenny6449/predict-the-ratings-of-amazon-products-based-on-customer-reviews-using-machine-learning-b035bcb1c17e

# Define target

In [None]:
df

In [None]:
target = 'rating'
plot_target_bar(df, target)

## # Target is unblanaced towrads 5 ratings - more than 70%

# Define the problem

In [None]:
n = 5
one_user = df.iloc[n].userName
df.iloc[n]

In [None]:
df[df['userName'] == one_user].T

## When given userName, recommend 5 highest predicted ratings.
## The predicted ratings are based on the user's history of ratings, which includes features on the items the user has rated before.
## Assumption 1: verified userNames uniquely identify users (otherwise the user would have to manually fill in his ratings history when asking for recommendation).
## Assumption 2: The user doesn't need to us recommend on items she has already rated as good, so we'll not recommend on them.
## Assumption 3: We cannot trust unverified ratings since an amazon seller can give low rating to his rivals items, and do so many times  
## ** If userName doesn't exist (doesn't have rating's history), recommend on top 5 rated items for all users (add rating votes as a rating weight, 0 votes is 1, 1 votes is 2, 2 votes is 3, etc.)
## ** The userNames "Amazon Customer" and "Kindel Customers" contain 7% of all ratings. While they are verified, they do not sound like legitimate users (not by name or number of review). Even though, we will respond to their recommendation request if they do so.  

#### # Since we are basing are model on historical recommendations, we'll drop items with less than 5 ratings


In [None]:
itemName_n_ratings = df.groupby('itemName').size()
itemName_n_ratings.value_counts(normalize=True).sort_index()

In [None]:
item_n_ratings_threshold = 5
itemName_n_ratings.value_counts(normalize=True)[itemName_n_ratings.value_counts(normalize=True).sort_index().index<item_n_ratings_threshold].sum()

In [None]:
items_with_history = itemName_n_ratings[itemName_n_ratings>=item_n_ratings_threshold].index.tolist()
df = df[df.itemName.isin(items_with_history)]

In [None]:
userNames_n_ratings = df.groupby('userName').size()
userNames_n_ratings.value_counts(normalize=True).sort_index()

#### # Same for users - we have too much items, we'll drop items with less than 6 ratings


In [None]:
users_n_ratings_threshold = 6
userNames_n_ratings.value_counts(normalize=True)[userNames_n_ratings.value_counts(normalize=True).sort_index().index<users_n_ratings_threshold].sum()

In [None]:
users_with_history = userNames_n_ratings[userNames_n_ratings>=users_n_ratings_threshold].index.tolist()
df = df[df.userName.isin(users_with_history)]

In [None]:
df.shape, df[['userName', 'itemName']].nunique()

# Split to Train, Validation, Test

#### # Split the dataset with 80% for train, 10% for val and 10% for test

In [None]:
target

In [None]:
test_size=0.1
equal_val_test_size=True
X_train, X_test, X_val, y_train, y_test, y_val = split_dataset(df, target_col=target, the_test_size=test_size, equal_val_test_size=equal_val_test_size)

train = pd.concat([X_train, y_train], axis=1)
val = pd.concat([X_val, y_val], axis=1)
test = pd.concat([X_test, y_test], axis=1)

In [None]:
train['userName'].nunique(), val['userName'].nunique(), test['userName'].nunique(),

#### # Data is not stratified on userName - which is not good since we want to predict ratings for each user using his past ratings

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Set the split sizes for validation and test
val_size = 0.1
test_size = 0.1

# Group the data by user
grouped = df.groupby('userName')

# Split each group and store in a list
train_list, val_list, test_list = [], [], []
for name, group in grouped:
        user_train_val, user_test = train_test_split(group, test_size=test_size, random_state=42, shuffle=True)
        user_train, user_val = train_test_split(user_train_val, test_size=val_size/(1-test_size), random_state=42, shuffle=True)

        train_list.append(user_train)
        val_list.append(user_val)
        test_list.append(user_test)

# Concatenate all splits
train = pd.concat(train_list)
val = pd.concat(val_list)
test = pd.concat(test_list)

In [None]:
train['userName'].nunique(), val['userName'].nunique(), test['userName'].nunique(),

In [None]:
train.shape[0]/len(df), val.shape[0]/len(df), test.shape[0]/len(df)

In [None]:
userNames_n_ratings = train.groupby('userName').size()
userNames_n_ratings.value_counts(normalize=True).sort_index()

In [None]:
train.groupby(['itemName']).size().value_counts(normalize=True)

In [None]:
train['itemName'].value_counts().plot()

#### # We still have a long tail of item reviews, over 50% items with only one review. But that's OK -  if a user recommended has a similair user that rated a unique infrequent item, we'll still try to recommend on that item.    

# Exploratory  Data Analysis, pre-processing (and some feature engineering)

In [None]:
train

## Delve into the features

In [None]:
train.describe(include='all')

### 1. Users

In [None]:
train.groupby(['userName']).size().describe()

#### # There are 55k "unique" users, and a mean of 6.6 ratings per user and a median of 4 (right skewed)

#### # More precisely, the mean number of ratings per user is 4.2, with median of 2 (skewed to the right) - long tail of high amount of ratings, most users don't rate more than 3 items.

In [None]:
users_frequencies = get_col_frequencies(train, col_name='userName', sort_index=False)
users_frequencies

In [None]:
train[train.userName.isin(['Amazon Customer', 'Kindle Customer'])]['verified'].mean()

#### # there are 8% ratings Amazon and Kindel, and 97% of them are verified

In [None]:
train['verified'].mean()

#### # Drop the 5% unverified users: we cannot trust these ratings as true ratings

In [None]:
train = train[train['verified'] == True]
val = val[val['verified'] == True]
test = test[test['verified'] == True]
assert train['verified'].mean() == 1

#### # Drop verified col, as it contain same information for all

In [None]:
train = train.drop(columns='verified')
val = val.drop(columns='verified')
test = test.drop(columns='verified')
assert not 'verified' in train

In [None]:
users_frequencies = get_col_frequencies(train, col_name='userName', sort_index=False)
users_frequencies

In [None]:
users_frequencies[users_frequencies.index.str.lower().str.contains('amazon|kindle')]

#### # There are still alot (8.5%) of seemingly non-unique users as Amazon Customer and Kindle Customer. Since they're verified we'll not delete those - but we'll disregard the history of the userNames "Amazon Customer" and "Kindel Customers" if they ever try to use our app for recommendations. The rest of the usernames have much lower number of ratings, and therefore are accepted as unique().   

In [None]:
assert train['userName'].isna().sum() == 0
assert val['userName'].isna().sum() == 0
assert test['userName'].isna().sum() == 0

#### # There are no missing values for userName

### 2. Verified - dropped after keeping only verified ratings

### 3. itemName

In [None]:
train.itemName.describe(include='all')

In [None]:
train.groupby('itemName').size().describe()

#### # There are 88k "unique" items, with average of 3.9 ratings per image (median of 2, ratings per items is right skewed)

#### # Let's drop complete duplicated rows - those are technical problems for sure, we should disregard duplicate reviews given on same item by same user in same date

In [None]:
train.duplicated().sum()

In [None]:
train = train.drop_duplicates()
val = val.drop_duplicates()
test = test.drop_duplicates()
assert train.duplicated().sum() == 0

In [None]:
probably_duplicate_ratings_col =['userName', 'description', 'image', 'brand', 'feature',
                                 'category', 'price', 'rating', 'reviewTime', 'summary', 'reviewText',
                                 'vote']
pd.concat([train, y_train], axis=1).duplicated(subset=probably_duplicate_ratings_col).sum()

#### # drop ratings with same exact rating by same user in same day for same item features

In [None]:
train = train.drop_duplicates(subset=probably_duplicate_ratings_col)
assert train.duplicated(subset=probably_duplicate_ratings_col).sum() == 0

val = val.drop_duplicates(subset=probably_duplicate_ratings_col)
test = test.drop_duplicates(subset=probably_duplicate_ratings_col)

In [None]:
train['itemName'].isna().sum()

#### # Drop 5 items without name - we can't recommend on it

In [None]:
train = train.dropna(subset='itemName')
assert train['itemName'].isna().sum() == 0

val = val.dropna(subset='itemName')
test = test.dropna(subset='itemName')

In [None]:
train.itemName.describe(include='all')

#### # There are still 88k "unique" items.
#### # Examine more columns that might be identifiers for unique item: brand and price  

In [None]:
col = 'itemName'
groupby_col = 'brand'
get_col_unique_counts_on_groupby_col(train, col, groupby_col, sort_index=True)

#### # 55% of all items have only one brand (no competition with other brands)

In [None]:
train['reviewText'][train['reviewText'].str.lower().str.contains('price').fillna(True)]

In [None]:
21511/len(train)

#### # 7% of reviews talk also about prices - this should be a factor for identifying an item

#### Create item_id from itemName, brand and price combinations

In [None]:
train['item_id'] = train['brand'].fillna('NA') + "_" + train['itemName'] + "_" + train['price'].fillna('NA')
train = move_cols_to_first(train, ['userName', 'item_id'])

val['item_id'] = val['brand'].fillna('NA') + "_" + val['itemName'] + "_" + val['price'].fillna('NA')
val = move_cols_to_first(val, ['userName', 'item_id'])
test['item_id'] = test['brand'].fillna('NA') + "_" + test['itemName'] + "_" + test['price'].fillna('NA')
test = move_cols_to_first(test, ['userName', 'item_id'])

train

In [None]:
train.item_id.describe(include='all')

In [None]:
assert train.item_id.isna().sum() == 0
assert train.item_id.str.contains('NA_NA').sum() == 0

#### # Now there are almost 89k unique items, and no NA's

### 4. description

In [None]:
train['description'].describe(include='all')

#### # Create description_n_sentences - some descriptions are made of list of different descriptions

In [None]:
import ast

train['description_n_sentences'] = train['description'].fillna('[]').apply(lambda x: len(ast.literal_eval(x)))

val['description_n_sentences'] = val['description'].fillna('[]').apply(lambda x: len(ast.literal_eval(x)))
test['description_n_sentences'] = test['description'].fillna('[]').apply(lambda x: len(ast.literal_eval(x)))

get_col_frequencies(train, 'description_n_sentences')

In [None]:
train['description'].isna().sum()

#### # fill description NA's with NA string

In [None]:
train['description'] = train['description'].fillna("NA")
val['description'] = val['description'].fillna("NA")
test['description'] = test['description'].fillna("NA")
assert train['description'].isna().sum() == 0

#### # Create description_len

In [None]:
train['description_len'] = train['description'].str.len()

val['description_len'] = val['description'].str.len()
test['description_len'] = test['description'].str.len()

get_col_frequencies(train, 'description_len')

In [None]:
train[[col for col in train if 'description' in col]].describe(include='all')

#### # The median description review has 1 sentence (mean=2) and 509 characters (mean=749), both are right skewed

In [None]:
train['description_len'].hist(bins=100, density=True)
plt.xlim(0, 10000)
plt.show()

In [None]:
train['description_n_sentences'].hist(bins=100, density=True)
plt.xlim(0, 30)
plt.show()

### 5. image

In [None]:
train['image']

#### # Create n_images - some image are made of list of different images

In [None]:
train['n_images'] = train['image'].apply(lambda x: len(ast.literal_eval(x)))

val['n_images'] = val['image'].apply(lambda x: len(ast.literal_eval(x)))
test['n_images'] = test['image'].apply(lambda x: len(ast.literal_eval(x)))

get_col_frequencies(train, col_name='n_images')

In [None]:
train['n_images'].hist(bins=100, density=True)
plt.xlim(0, 30)
plt.show()

#### # The n_images mode is 6

In [None]:
assert train['image'].isna().sum() == 0

### 6. brand

In [None]:
train['brand'].describe()

#### # There are 21k brands! we'll group them later on by price (luxury, budget, etc.)

In [None]:
get_col_frequencies(train, col_name='brand', sort_index=False)

#### # The leading brand, KONG, is only in 1% of all ratings

In [None]:
get_col_unique_counts_on_groupby_col(train, col='item_id', groupby_col='brand')

In [None]:
train['brand'].isna().sum()

In [None]:
1560/len(train)

#### # There are 0.5% missing brands (2180). Let's a column for brand_isna, and fill them with NA string  

In [None]:
train['brand_isna'] = train['brand'].isna()*1

val['brand_isna'] = val['brand'].isna()*1
test['brand_isna'] = test['brand'].isna()*1

In [None]:
train['brand_isna'].describe()

In [None]:
train['brand'] = train['brand'].fillna('NA')

val['brand'] = val['brand'].fillna('NA')
test['brand'] = test['brand'].fillna('NA')

assert train['brand'].isna().sum() == 0

### 7. feature

In [None]:
train['feature']

In [None]:
assert train['feature'].isna().sum() == 0

#### # Create n_features - some feature are made of list of different features

In [None]:
train['n_features'] = train['feature'].apply(lambda x: len(ast.literal_eval(x)))

val['n_features'] = val['feature'].apply(lambda x: len(ast.literal_eval(x)))
test['n_features'] = test['feature'].apply(lambda x: len(ast.literal_eval(x)))

get_col_frequencies(train, col_name='n_features')

In [None]:
#### # Create feature_len
train['feature_len'] = train['feature'].str.len()

val['feature_len'] = val['feature'].str.len()
test['feature_len'] = test['feature'].str.len()

get_col_frequencies(train, col_name='feature_len')

In [None]:
train['n_features'].describe()

#### # the Median number of features per rating is 5, and the mean  5.1

In [None]:
train['n_features'].hist(bins=100, density=True)
plt.xlim(0, 30)
plt.show()

### 8. category

In [None]:
get_col_frequencies(train, col_name='category', sort_index=False)

#### # Most reviews (35%) are about Pet_Supplies!

#### # Move categories less than 1% categories to other categories, based on similarity to products coming up in google search of amazon <category name>

In [None]:
small_cats_to_big_cats_mapper = {'Appliances':'Industrial_and_Scientific', 'Industrial_and_Scientific':'Office_Products','AMAZON_FASHION':'Arts_Crafts_and_Sewing', 'Luxury_Beauty':'Arts_Crafts_and_Sewing', 'All_Beauty':'Arts_Crafts_and_Sewing', 'Software':'Video_Games' ,'Digital_Music':'Musical_Instruments'}

train['category'] = train['category'].map(small_cats_to_big_cats_mapper).fillna(train['category'])
train['category'] = train['category'].map(small_cats_to_big_cats_mapper).fillna(train['category'])

val['category'] = val['category'].map(small_cats_to_big_cats_mapper).fillna(val['category'])
test['category'] = test['category'].map(small_cats_to_big_cats_mapper).fillna(test['category'])

In [None]:
get_col_frequencies(train, col_name='category', sort_index=False)

In [None]:
assert sum(get_col_frequencies(train, col_name='category', sort_index=False)['pct']<0.01) == 0

In [None]:
train['category'].nunique()

#### # One-hot encode the categories

In [None]:
train = pd.concat([train, pd.get_dummies(train['category'], prefix='category', drop_first=True)], axis=1)

val = pd.concat([val, pd.get_dummies(val['category'], prefix='category', drop_first=True)], axis=1)
test = pd.concat([test, pd.get_dummies(test['category'], prefix='category', drop_first=True)], axis=1)

In [None]:
train.shape

### 11. price

In [None]:
train['price']

In [None]:
train['price'].dtypes

#### # change price dtype to float

In [None]:
train['price'] = train['price'].str.replace("$", "")
train['price'] = pd.to_numeric(train['price'], errors='coerce').astype(float)

val['price'] = val['price'].str.replace("$", "")
val['price'] = pd.to_numeric(val['price'], errors='coerce').astype(float)
test['price'] = test['price'].str.replace("$", "")
test['price'] = pd.to_numeric(test['price'], errors='coerce').astype(float)

In [None]:
train['price'].describe()

In [None]:
train['price'].isna().sum()/len(train)

#### # 17% of ratings don't contain price data
#### # Let's mark those ratins with an indicator - prica_na, and see if we can impute them somehow (using the train data)

In [None]:
train['price_isna'] = train['price'].isna()*1

val['price_isna'] = val['price'].isna()*1
test['price_isna'] = test['price'].isna()*1

In [None]:
train['brand_itemName'] = train['brand'] + "_" + train['itemName']

val['brand_itemName'] = val['brand'] + "_" + val['itemName']
test['brand_itemName'] = test['brand'] + "_" + test['itemName']

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming train has columns 'item_id', 'price', and 'reviewTime' where 'reviewTime' is a date

# Convert 'reviewTime' to datetime
train['reviewTime'] = pd.to_datetime(train['reviewTime'])

# Filter out rows where 'price' is missing
filtered_train = train.dropna(subset=['price'])

# Group by 'item_id' and 'reviewTime', then calculate the mean price
grouped_train = filtered_train.groupby(['brand_itemName', 'reviewTime'])['price'].mean().reset_index()

# Plotting
for item in grouped_train['brand_itemName'].unique()[:100]:
    item_train = grouped_train[grouped_train['brand_itemName'] == item]
    plt.plot(item_train['reviewTime'], item_train['price'], label=item)

plt.xlabel('Review Time')
plt.ylabel('Price')
plt.title('Price Change Over Time for Each Item')
plt.show()

#### # It seems prices don't change over date, so we can safetly use the train averages of all periods to fill missing prices

In [None]:
import pandas as pd

# Calculate the average price for each item
average_prices_dict = train.dropna(subset='price').groupby('brand_itemName')['price'].mean().to_dict()

# Fill missing prices using map and fillna
train['price'] = train['price'].fillna(train['brand_itemName'].map(average_prices_dict))
val['price'] = val['price'].fillna(val['brand_itemName'].map(average_prices_dict))
test['price'] = test['price'].fillna(test['brand_itemName'].map(average_prices_dict))

In [None]:
train['price'].isna().sum()/len(train)

#### # still alot of NA's. let's see if we can fill those with means of ['category', 'brand'] pricees

In [None]:
train_group_cat_brand_price = train.groupby(['category', 'brand'])['price'].agg(['mean', 'std'])
train_group_cat_brand_price

In [None]:
train_group_cat_brand_price['mean'].isna().sum()

#### # there are still some NA's for the category+brand price means, will fill them with the category means:

In [None]:
category_price_means = train.groupby(['category']).price.mean().to_dict()

# Extract category level from the multi-index of train_group_cat_brand_price
categories = train_group_cat_brand_price.index.get_level_values('category')

# Map the category means to fill missing mean prices
train_group_cat_brand_price['mean'] = train_group_cat_brand_price['mean'].fillna(pd.Series(categories, index=train_group_cat_brand_price.index).map(category_price_means))

In [None]:
assert train_group_cat_brand_price['mean'].isna().sum() == 0

#### # Now examine train_group_cat_brand_price coffiecient of variations to see if they make a good approximation of the prices

In [None]:
train_group_cat_brand_price['CV'] = train_group_cat_brand_price['mean'] / train_group_cat_brand_price['std']
train_group_cat_brand_price

In [None]:
train_group_cat_brand_price['CV'].describe().round(2)

In [None]:
import numpy as np


cv_values = train_group_cat_brand_price['CV']
# Replace inf values with NaN and then drop them
cv_values = np.array(cv_values)  # Ensure cv_values is a NumPy array
cv_values[np.isinf(cv_values)] = np.nan
cv_values_cleaned = cv_values[~np.isnan(cv_values)]

plt.figure(figsize=(10, 6))
sns.boxplot(cv_values)
plt.title('Box Plot of Coefficient of Variation (CV)')
plt.xlabel('CV')
plt.ylim(0, 10)
plt.show()

#### It looks like Coefficients of Variation of the prices of the category brands groups is high enough - most prices in groups are close to the mean price.
#### # Fill the missing prices with those prices

In [None]:
# Convert the group means to a dictionary with a MultiIndex
mean_price_dict = train_group_cat_brand_price['mean'].to_dict()

# Create a MultiIndex in your original DataFrame for mapping
train['category_brand'] = pd.MultiIndex.from_frame(train[['category', 'brand']])

val['category_brand'] = pd.MultiIndex.from_frame(val[['category', 'brand']])
test['category_brand'] = pd.MultiIndex.from_frame(test[['category', 'brand']])

# Map the means and fill in missing values
train['price'] = train['price'].fillna(train['category_brand'].map(mean_price_dict))

val['price'] = val['price'].fillna(val['category_brand'].map(mean_price_dict))
test['price'] = test['price'].fillna(test['category_brand'].map(mean_price_dict))

# Optionally, you can drop the 'category_brand' column if it's no longer needed
train.drop(['category_brand','brand_itemName'], axis=1, inplace=True, errors='ignore')

val.drop(['category_brand','brand_itemName'], axis=1, inplace=True, errors='ignore')
test.drop(['category_brand','brand_itemName'], axis=1, inplace=True, errors='ignore')

In [None]:
assert train['price'].isna().sum()/len(train) == 0

In [None]:
val['price'].isna().sum()/len(val)

#### # Fill the reminaing val and test NA's with train's category_price_means

In [None]:
val['price'] = val['price'].fillna(val['category'].map(category_price_means))
test['price'] = test['price'].fillna(test['category'].map(category_price_means))
assert val['price'].isna().sum()/len(val) == 0
assert test['price'].isna().sum()/len(test) == 0

#### # Now, let's Group brands to brand_price_group

In [None]:
brand_price_group_mapper = pd.qcut(train.groupby('brand')['price'].mean(), 10).rename('brand_price_group').to_dict()
train['brand_price_group'] = train['brand'].map(brand_price_group_mapper)

val['brand_price_group'] = val['brand'].map(brand_price_group_mapper)
test['brand_price_group'] = test['brand'].map(brand_price_group_mapper)

train['brand_price_group']

In [None]:
assert train['brand_price_group'].isna().sum() == 0

In [None]:
val['brand_price_group'].isna().sum(), test['brand_price_group'].isna().sum()

In [None]:
train_brand_price_mode = train['brand_price_group'].mode()[0]
train_brand_price_mode

#### # fillna's in val and test (brand that don't exist in train) with train brand_price_group mode

In [None]:
val['brand_price_group'] = val['brand_price_group'].fillna(train_brand_price_mode)
test['brand_price_group'] = test['brand_price_group'].fillna(train_brand_price_mode)

assert val['brand_price_group'].isna().sum() == 0
assert test['brand_price_group'].isna().sum() == 0

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x='brand_price_group', data=train.sort_values(by='brand_price_group'))

plt.tight_layout()

#### # One-hot encode the 10 brand_price_group

In [None]:
train = pd.concat([train, pd.get_dummies(train['brand_price_group'], prefix='brand_price', drop_first=True)], axis=1)

val = pd.concat([val, pd.get_dummies(val['brand_price_group'], prefix='brand_price', drop_first=True)], axis=1)
test = pd.concat([test, pd.get_dummies(test['brand_price_group'], prefix='brand_price', drop_first=True)], axis=1)

In [None]:
train

### 12. reviewTime

In [None]:
train['reviewTime'] = pd.to_datetime(train['reviewTime'])

val['reviewTime'] = pd.to_datetime(val['reviewTime'])
test['reviewTime'] = pd.to_datetime(test['reviewTime'])

In [None]:
train['reviewTime'].describe()

In [None]:
print(train['reviewTime'].min(), train['reviewTime'].max(),

val['reviewTime'].min(), val['reviewTime'].max(),
test['reviewTime'].min(), test['reviewTime'].max())

#### # There are  277 days in reviewTime, from Jan 2018 to Cct 10.
#### # We saw the prices don't change in different dates.
#### # Not sure we'll do something with those dates - we'll recommend based on all historical ratings, regardless of date

### 13. summary

In [None]:
train['summary']

In [None]:
train['summary'].isna().sum()

#### # add a column for summary_isna and fill NA's with ''

In [None]:
train['summary_isna'] = train['summary'].isna()*1

val['summary_isna'] = val['summary'].isna()*1
test['summary_isna'] = test['summary'].isna()*1

In [None]:
train['summary'] = train['summary'].fillna("")

val['summary'] = val['summary'].fillna("")
test['summary'] = test['summary'].fillna("")
assert train['summary'].isna().sum() == 0

#### # Create numeric features for the summary - summary_len, summary_n_words - even though we might not use this data in the end

In [None]:
train['summary_len'] = train['summary'].str.len()

val['summary_len'] = val['summary'].str.len()
test['summary_len'] = test['summary'].str.len()

In [None]:
train['summary_n_words'] = train['summary'].str.split().str.len()

val['summary_n_words'] = val['summary'].str.split().str.len()
test['summary_n_words'] = test['summary'].str.split().str.len()

In [None]:
train[[col for col in train if 'summary' in col]].describe()

#### # 2% of ratings don't have a summary, the median summary has 12 characters and 2 words - most probably that's just "X stars"

### 14. reviewText

In [None]:
train['reviewText']

In [None]:
train['reviewText'].describe()

In [None]:
train['reviewText'].isna().sum()

#### # add a column for reviewText_isna and fill NA's with ''

In [None]:
train['reviewText_isna'] = train['reviewText'].isna()*1

val['reviewText_isna'] = val['reviewText'].isna()*1
test['reviewText_isna'] = test['reviewText'].isna()*1

In [None]:
train['reviewText'] = train['reviewText'].fillna("")

val['reviewText'] = val['reviewText'].fillna("")
test['reviewText'] = test['reviewText'].fillna("")
assert train['reviewText'].isna().sum() == 0

#### # Create numeric features for the summary - summary_len, summary_n_words - even though we might not use this data in the end

In [None]:
train['reviewText_len'] = train['reviewText'].str.len()

val['reviewText_len'] = val['reviewText'].str.len()
test['reviewText_len'] = test['reviewText'].str.len()

In [None]:
train['reviewText_n_words'] = train['reviewText'].str.split().str.len()

val['reviewText_n_words'] = val['reviewText'].str.split().str.len()
test['reviewText_n_words'] = test['reviewText'].str.split().str.len()

In [None]:
train[[col for col in train if 'reviewText' in col]].describe()

#### # 3% of reviewText are missing, and the median reviewText have 70 characters in 13 words

### 15. vote

In [None]:
df['vote'].describe()

In [None]:
df['vote'].isna().sum()

#### # votes is a different feature that the other ones - we might use it as weights in the model

## Handle Outliers

In [None]:
train_statistics = train.drop(columns=target).describe(include='all').T
train_statistics

In [None]:
orig_cols

In [None]:
numeric_cols_no_target = [col for col in train.describe().columns if target not in col if col in train.columns]
numeric_cols_no_target

In [None]:

# Add outlier column indicator, having 1 for outlier rows
train_numeric_features = numeric_cols_no_target  # When none, assume train dataset and find all relevent columns
train_n_uniques = train[numeric_cols_no_target].nunique()
train_numeric_features = train_n_uniques[train_n_uniques>2].index.tolist()
train, train_outiler_cols = add_outlier_indicators_on_features(train, train_statistics,
                                                               X_train_numeric_features=train_numeric_features,
                                                                 outlier_col_suffix=outlier_col_suffix)

# if outliers exist, update outlier statistics to train_statistics
if len(train_outiler_cols) > 0:
    train_statistics = add_new_features_statistics_to_train_statistics(train, train_statistics, train_outiler_cols)

# Apply outlier indicators on validation and test

# get train outlier columns
train_outiler_cols = get_train_features_with_suffix(train_statistics, the_suffix=outlier_col_suffix)
# if outliers exist in train, add outlier indicators to val and test in those specific features
if len(train_outiler_cols) > 0:
    add_outlier_indicators_on_features_fn = partial(add_outlier_indicators_on_features,
                                                    the_train_statistics=train_statistics,
                                                    X_train_numeric_features=train_outiler_cols,
                                                    outlier_col_suffix=outlier_col_suffix)
    val, _ = add_outlier_indicators_on_features_fn(val)
    test, _ = add_outlier_indicators_on_features_fn(test)

    # Validate outliers detection: Test if train outlier statistics are different from val outlier statistics
    remove_suffix = False
    train_outlier_cols = get_train_features_with_suffix(train_statistics, the_suffix=outlier_col_suffix,
                                                        remove_suffix=remove_suffix)
    remove_suffix = True
    train_orig_outlier_cols = get_train_features_with_suffix(train_statistics, the_suffix=outlier_col_suffix,
                                                             remove_suffix=remove_suffix)
    train_outliers = train.loc[(train[train_outlier_cols] == 1).any(axis=1), train_orig_outlier_cols]
    val_outliers = val.loc[(val[train_outlier_cols] == 1).any(axis=1), train_orig_outlier_cols]
    print(f"\n# The train outliers:\n {train_outliers}")
    trains_dict_to_test = {'val_outliers': val_outliers}
    # train_val_outlier_means_test = test_if_features_statistically_different(train_outliers, trains_dict_to_test,
    #                                                                         alpha=alpha)
    # print('\n# Test if train and validation outliers means are statisically not different:\n',
    #       train_val_outlier_means_test)

# Impute outliers features

train_statistics = add_winsorization_values_to_train_statistics(train.drop(columns=target), train_statistics)
train = pd.concat([winsorize_outliers(train.drop(columns=target), train_statistics), train[target]], axis=1)
val = pd.concat([winsorize_outliers(val.drop(columns=target), train_statistics), val[target]], axis=1)
test = pd.concat([winsorize_outliers(test.drop(columns=target), train_statistics), test[target]], axis=1)

# Add Collaborative filtering features

In [None]:
import gc
gc.collect()

#### # The following lines will take a few minutes to run...

In [None]:
# Create a user-item matrix
user_item_matrix = train.pivot_table(index='userName', columns='item_id', values='rating')

## User-Based Collaborative Filtering - how similar users rated an item

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between users
user_similarity = cosine_similarity(user_item_matrix.fillna(0))

# Convert to DataFrame
user_similarity_df = pd.DataFrame(user_similarity, index=user_item_matrix.index, columns=user_item_matrix.index)

In [None]:
user_similarity_df.shape

In [None]:
def create_similar_users_dict(user_similarity_df, top_n=5):
    similar_users_dict = {}
    for user in user_similarity_df.index:
        # Get top N similar users; skip the first one as it will be the user itself
        top_similar = user_similarity_df[user].sort_values(ascending=False)[1:top_n+1]
        similar_users_dict[user] = list(zip(top_similar.index, top_similar.values))
    return similar_users_dict

# Assuming user_similarity_df is your user-user similarity DataFrame
similar_users_dict = create_similar_users_dict(user_similarity_df)

In [None]:
similar_users_dict.keys()

In [None]:
similar_users_dict[' Boo']

In [None]:
train[train['userName'].isin([' Boo','J. V. Robinson'])].sort_values('item_id')

#### # Test similarity matrix - they're indeed similair - both in item rated and in their ratings!

In [None]:
train[train['userName'].isin([' Boo','kim tindell'])].sort_values('item_id').sort_values('item_id')

#### # They're less similar

## Create features based on statistics on similar users: this will take about 30 minutes, depending on your hardware...

#### # The following box will take a few minutes to run...

In [None]:
from tqdm import tqdm
import gc

def similar_users_weighted_average_features_for_item(user_id, item_id, similar_users_dict, train_df, n_similar_users=5, features=None, default_val=0):
    if features is None:
        features = ['rating', 'description_len', 'n_images', 'feature_len', 'price', 'summary_len', 'reviewText_len']
    # Check if the user_id exists in the similar_users_dict
    if user_id not in similar_users_dict:
        # Create a Series with default values and correct index names
        default_values = [default_val] * len(features)
        default_series = pd.Series(default_values, index=[f'sim_{n_similar_users}_users_{feature}' for feature in features])
        return default_series

    # Filter for the specific item
    train_df_filtered  = train_df[train_df['item_id'] == item_id]

    # Get top N similar users for the given user
    top_similar_users = similar_users_dict[user_id][:n_similar_users]

    # Create a DataFrame from the similar users list
    similar_users_df = pd.DataFrame(top_similar_users, columns=['userName', 'similarity_score'])

    # Merge with the ratings DataFrame
    item_features = pd.merge(similar_users_df, train_df_filtered , on='userName')

    # Calculate weighted averages for all features
    weighted_avgs = {}
    for feature in features:
        if not item_features.empty:
            weighted_avg = (item_features[feature].multiply(item_features['similarity_score'], axis=0)).sum() / item_features['similarity_score'].sum()
        else:
            weighted_avg = 0  # or a default value, depending on your use case
        weighted_avgs[feature] = weighted_avg

    # Convert the result to a DataFrame
    #result_df = pd.DataFrame([weighted_avgs]).add_suffix(f'sim_{n_similar_users}_users_')
    result_series = pd.Series(weighted_avgs.values(), index=[f'sim_{n_similar_users}_users_{feature}' for feature in features])

    return result_series

def process_in_chunks(the_df, similar_users_dict, train_df, features, chunk_size=1000, n_similar_users=5):
    # Create chunks of the DataFrame
    chunks = (the_df.iloc[i:i + chunk_size] for i in range(0, the_df.shape[0], chunk_size))

    # Initialize an empty list to store processed chunks
    processed_chunks = []

    # Iterate through the chunks
    for chunk in tqdm(chunks, total=the_df.shape[0] // chunk_size):
        # Apply the function to calculate weighted averages for the features
        weighted_avg_features = chunk.apply(lambda row: similar_users_weighted_average_features_for_item(
            user_id=row['userName'],
            item_id=row['item_id'],
            similar_users_dict=similar_users_dict,
            train_df=train_df,
            features=features,
            n_similar_users=n_similar_users
        ), axis=1)
        # Concatenate the result to the original chunk
        chunk = pd.concat([chunk, weighted_avg_features], axis=1)
        # Append the processed chunk to the list
        processed_chunks.append(chunk)

        # After processing each chunk
        del chunk
        gc.collect()

    # Combine the processed chunks back into a single DataFrame
    return pd.concat(processed_chunks)

# Define the list of features you want to calculate weighted averages for
features = ['rating', 'description_len', 'n_images', 'feature_len', 'price', 'summary_len', 'reviewText_len']


IMPORT_DFS = True
SAVE_DFS = False
# Process each dataset
chunk_size = 500

train_processed = pd.read_csv("train_processed.csv") if IMPORT_DFS else process_in_chunks(train, similar_users_dict, train, features, chunk_size=chunk_size)
train_processed.to_csv("train_processed.csv", index=False) if SAVE_DFS else ""
val_processed = pd.read_csv("val_processed.csv") if IMPORT_DFS else process_in_chunks(val, similar_users_dict, train, features, chunk_size=chunk_size)
val_processed.to_csv("val_processed.csv", index=False) if SAVE_DFS else ""
test_processed = pd.read_csv("test_processed.csv") if IMPORT_DFS else process_in_chunks(test, similar_users_dict, train, features, chunk_size=chunk_size)
test_processed.to_csv("test_processed.csv", index=False) if SAVE_DFS else ""

## Item-Based Collaborative Filtering - Rating user gave for simlair items

In [None]:
train = train_processed
val = val_processed
test = test_processed

# Feature Engineering - from current features

In [None]:
numeric_cols = [col for col in train.describe().columns if not 'sim_' in col]
numeric_cols

In [None]:
train_user_stats_for_new_features = train.groupby('userName')[numeric_cols].agg(['min','mean','median','max','std']).fillna(0)
train_user_stats_for_new_features.columns = [a + "_" + b for a,b in train_user_stats_for_new_features.columns]
train_user_stats_for_new_features = train_user_stats_for_new_features.add_prefix('user_')
train_user_stats_for_new_features

In [None]:
train_item_stats_for_new_features = train.groupby('item_id')[numeric_cols].agg(['min','mean','median','max','std']).fillna(0)
train_item_stats_for_new_features.columns = [a + "_" + b for a,b in train_item_stats_for_new_features.columns]
train_item_stats_for_new_features = train_item_stats_for_new_features.add_prefix('item_')
train_item_stats_for_new_features

In [None]:
train.shape, val.shape, test.shape

In [None]:
train = train.merge(train_user_stats_for_new_features, on='userName', how='left')
train = train.merge(train_item_stats_for_new_features, on='item_id', how='left')

val = val.merge(train_user_stats_for_new_features, on='userName', how='left').fillna(0)
val = val.merge(train_item_stats_for_new_features, on='item_id', how='left').fillna(0)
test = test.merge(train_user_stats_for_new_features, on='userName', how='left').fillna(0)
test = test.merge(train_item_stats_for_new_features, on='item_id', how='left').fillna(0)

train.shape, val.shape, test.shape

# Drop features that are specific to review (and not to users or items)

In [None]:
train.shape, val.shape, test.shape

In [None]:
users_items_features_and_target = train.columns[train.columns.str.contains('user|item|brand|category')].tolist() + [target]
users_items_features_and_target = [col for col in users_items_features_and_target if col in val.columns if col in test.columns]
train = train[users_items_features_and_target]
val = val[users_items_features_and_target]
test = test[users_items_features_and_target]

In [None]:
train.shape, val.shape, test.shape

In [None]:
users_items_features_and_target

In [None]:
import gc
gc.collect()

# Some more EDA of final features

## NA's and dtypes

In [None]:
# Those we already marked NA and probably didn't survised the export and import from csv after the collaborative filtering features
train.loc[train.brand.isna(), 'brand'] = 'NA'

In [None]:
assert train.isna().sum().sum() == 0
assert val.isna().sum().sum() == 0
assert test.isna().sum().sum() == 0

In [None]:
train.info()

## statistics

In [None]:
train.describe(include='all')

## nunique values

In [None]:
print(train.nunique())

## modes frequencies

In [None]:
get_mode_and_freq(train)

## plots

In [None]:
#train = pd.concat([train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)
train_small = train.sample(frac=0.01)

### features relationship with features - not exectued, takes too long to run

In [None]:
#numeric_cols = train_small.describe().columns
#numeric_cols

In [None]:
#train_small[numeric_cols][:10]

In [None]:
# sns.pairplot()
# plt.show()
# #plt.tight_layout()

### features relationship with target  - not exectued, takes too long to run

In [None]:
# sns.pairplot(train_small[:100], hue=target)
# plt.show()
# #plt.tight_layout()

In [None]:
train.shape

## Correlations

## Pearson - linear correlation between two continuous variables

In [None]:
corr_features = get_correlation_stats(train, method='pearson', strong_corr_val = 0.5, figsize=(14,10), annot=False)
corr_features

#### # Drop features that are almost perfect multicollinear (corr over 0.9)

In [None]:
corr_features_to_drop = corr_features.drop(columns=target, errors='ignore')

In [None]:
# Identify highly correlated features
to_drop = set()
for i in range(len(corr_features_to_drop.columns)):
    for j in range(i+1, len(corr_features.columns)):
        if abs(corr_features.iloc[i, j]) > 0.9:
            colname = corr_features.columns[i]
            to_drop.add(colname)


# Drop identified features from the original DataFrame
train = train.drop(columns=to_drop)
val = val.drop(columns=to_drop)
test = test.drop(columns=to_drop)
train.shape

In [None]:
corr_features = get_correlation_stats(train, method='pearson', strong_corr_val = 0.5, figsize=(14,10), annot=False)
corr_features

In [None]:
corr_features[target] if target in corr_features else ""

In [None]:
train.shape, val.shape, test.shape

## Spearman - "rank pearson" - non-linear correlation between two continuous or ordinal variables

In [None]:
#corr_features = get_correlation_stats(X_train, method='spearman', strong_corr_val=0.5, figsize=(14,10), annot=False)
#corr_features

In [None]:
#corr_features[target] if target in corr_features else ""

## Kendall - concordant pairs - non-linear correlation between two ordinal variables

In [None]:
#corr_features = get_correlation_stats(X_train, method='kendall', strong_corr_val=0.4, figsize=(14,10), annot=False)
#corr_features

In [None]:
#corr_features[target] if target in corr_features else ""

## # Only statistics on user and item from rating have strong correlation to target. perhaps we need feature combinations, new features (user startistics, item statistics, sentiment) or a different approach - a collaborative filtering recommender system

In [None]:
# ## Test if train statistics are different then val and test statistics
# trains_dict_to_test = {'val': val, 'test': test}
# train_val_outlier_means_test = test_if_features_statistically_different(train, trains_dict_to_test, alpha=alpha)
# print('\n# Test if train, validation and test sets means are statisically not different:\n',
#       train_val_outlier_means_test)

In [None]:
#sum(train_val_outlier_means_test['val mean is the same with 99% significance']==True)

In [None]:
#sum(train_val_outlier_means_test['val mean is the same with 99% significance']==False)

In [None]:
#sum(train_val_outlier_means_test['test mean is the same with 99% significance']==True)

In [None]:
#sum(train_val_outlier_means_test['test mean is the same with 99% significance']==False)

## # about half the features have different distribution for train and test. this might cause bias in predictions, consider removing those

# Normalize dataset

In [None]:
features = train.drop(columns=target).describe().columns.tolist()
len(features), features

In [None]:
[col for col in features if 'sim' in col]

In [None]:
X_train = train.drop(columns=target)
y_train = train[target]

X_val = val.drop(columns=target)
y_val = val[target]

X_test = test.drop(columns=target)
y_test = test[target]

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit on training data
scaler.fit(X_train[features])

# Transform the datasets
X_train_scaled = scaler.transform(X_train[features])
X_val_scaled = scaler.transform(X_val[features])
X_test_scaled = scaler.transform(X_test[features])

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled)
X_train_scaled.columns = features

X_val_scaled = pd.DataFrame(X_val_scaled)
X_val_scaled.columns = features

X_test_scaled = pd.DataFrame(X_test_scaled)

X_test_scaled.columns = features

X_train_scaled

In [None]:
X_train_scaled.agg(['mean','std']).round(2)

# final preprocessing

In [None]:
# # fix columns names
# X_train = replace_columns_spaces_with_underscores(X_train)
# X_val = replace_columns_spaces_with_underscores(X_val)
# X_test = replace_columns_spaces_with_underscores(X_test)
# train_statistics = replace_columns_spaces_with_underscores(train_statistics.T).T

# Modeling

## Basline model - mean rating per user

In [None]:
train = pd.concat([X_train_scaled, y_train], axis=1)
train.index = X_train['userName']

val = pd.concat([X_val_scaled, y_val], axis=1)
val.index = X_val['userName']

test = pd.concat([X_test_scaled, y_test], axis=1)
test.index = X_test['userName']

train[train.index == 'Amazossn Customerccocooper17o']

In [None]:
baseline_pred_dict = train.groupby('userName')[target].mean().to_dict()
baseline_pred_dict

In [None]:
val[~val.index.isin(train.index)]

In [None]:
val.index.nunique()

In [None]:
val[~val.index.isin(train.index)].index.nunique()

In [None]:
218/23751

### # drop 0.9% val users (218) that are not in train - so that the comparison of baseline to other models will be fair

In [None]:
val = val[val.index.isin(train.index)]
X_val = X_val[X_val['userName'].isin(val.index)]
X_val.index = val.index

In [None]:
# val_n_ratings = val.groupby('userName').size()
# val_n_ratings.value_counts(normalize=True)

In [None]:
# val0 = val.copy(deep=True)

In [None]:
## drop val users without 5 ratings - we need to test if prediction is good for user's top 5 ratings

In [None]:
# val = val[val.index.isin(val_n_ratings[val_n_ratings>=5].index)]
# X_val = X_val[X_val.index.isin(val_n_ratings[val_n_ratings>=5].index)]

In [None]:
pred_col = f'{target}_pred_baseline'
val[f'{target}_pred_baseline'] = val.index.map(baseline_pred_dict)
val[[target, pred_col]]

In [None]:
from sklearn.metrics import max_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import r2_score

def get_regression_metrics(the_df, target_col, pred_col, the_metrics_df=None, regression_metrics=None, model_name='baseline'):
    if regression_metrics is None:
        regression_metrics = [mean_absolute_error, max_error, mean_absolute_percentage_error, mean_squared_error, r2_score]
    the_metrics_df = pd.DataFrame() if the_metrics_df is None else the_metrics_df
    the_metrics = {}
    the_metrics['userName'] = the_df.index.nunique()
    the_metrics['item_ids'] = the_df['item_id'].nunique()
    the_metrics[f'mean_{target}'] = the_df[target_col].mean()
    for metric in regression_metrics:
        the_metrics[metric.__name__] = metric(the_df[target_col], the_df[pred_col])
    the_df[f'{pred_col}_error_abs'] = (the_df[target] - the_df[pred_col]).abs()
    bins_error_abs = the_df.groupby(target)[f'{pred_col}_error_abs'].mean().add_prefix(f"{target}_").add_suffix("_error").to_dict()
    the_df = the_df.drop(columns=[f'{pred_col}_error_abs'])
    the_metrics.update(bins_error_abs)
    the_metrics = pd.DataFrame.from_dict(the_metrics, columns=[model_name], orient='index').T
    the_metrics_df = pd.concat([the_metrics_df, the_metrics], axis=0)


    return the_metrics_df

metrics = pd.DataFrame()
metrics = get_regression_metrics(pd.concat([val, X_val['item_id']], axis=1), target, f'{target}_pred_baseline')
metrics

## Linear regression - worse than baseline, very bad, coefficients are crazy big. we have many features with zero importance, we'll do LASSO next for features selection, we have many features

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression(n_jobs=-1).fit(train.drop(columns=target), train[target])
reg.score(train.drop(columns=target), train[target])

In [None]:
pd.set_option('display.float_format', '{:.3f}'.format)
coefs = reg.coef_
features_importance = pd.DataFrame(coefs, columns=['importance'], index=features).round(3)
features_importance['importance_abs'] = features_importance['importance'].abs()
features_importance = features_importance.sort_values('importance_abs', ascending=False)

import matplotlib.pyplot as plt

# Assuming 'model' is your fitted Linear Regression model and 'feature_names' is a list of feature names
display(features_importance)

# Create a plot
plt.figure(figsize=(10, 6))
plt.barh(features_importance.head(20).index[::-1], features_importance.head(20)[::-1]['importance'])
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.title('Linear Regression Feature Importance')
plt.show()

In [None]:
pred_col = f'{target}_pred_{reg.__str__()}'
val[pred_col] = reg.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
val[[target]+[col for col in val.columns if 'pred' in col]]

In [None]:
model_name = reg.__str__()
the_df = pd.concat([val, X_val['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

## Linear regression - Lasso - worse than baseline, only one important feature - item_rating_mean. Perhaps need something more comple like RF

In [None]:
from sklearn import linear_model

reg = linear_model.Lasso(alpha=0.5).fit(train.drop(columns=target), train[target])
reg.score(train.drop(columns=target), train[target])

In [None]:
pd.set_option('display.float_format', '{:.3f}'.format)
coefs = reg.coef_
features_importance = pd.DataFrame(coefs, columns=['importance'], index=features).round(3)
features_importance['importance_abs'] = features_importance['importance'].abs()
features_importance = features_importance.sort_values('importance_abs', ascending=False)

import matplotlib.pyplot as plt

# Assuming 'model' is your fitted Linear Regression model and 'feature_names' is a list of feature names
display(features_importance)

# Create a plot
plt.figure(figsize=(10, 6))
plt.barh(features_importance.head(20).index[::-1], features_importance.head(20)[::-1]['importance'])
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.title('Linear Regression Feature Importance')
plt.show()

In [None]:
pred_col = f'{target}_pred_{reg.__str__()}'
val[pred_col] = reg.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
val[[target]+[col for col in val.columns if 'pred' in col]]

In [None]:
model_name = reg.__str__()
the_df = pd.concat([val, X_val['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

## Random Forest Regressor - for max_depth=2 still worse than baseline, but 3 important features.

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor(max_depth=2, random_state=0, n_jobs=-1)
regr.fit(train.drop(columns=target), train[target])
regr.score(train.drop(columns=target), train[target])

In [None]:
import matplotlib.pyplot as plt

# Assuming 'rf_model' is your trained Random Forest model
# and 'feature_names' is a list of your feature names

# Get feature importances
importances = regr.feature_importances_

# Create a DataFrame for easier handling
feature_importances = pd.DataFrame({'Feature': features, 'Importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances = feature_importances.head(20)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Random Forest Feature Importances')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.show()

In [None]:
pred_col = f'{target}_pred_{regr.__str__()}'
val[pred_col] = regr.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
val[[target]+[col for col in val.columns if 'pred' in col]]

In [None]:
model_name = regr.__str__()
the_df = pd.concat([val, X_val['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

In [None]:
## For max_depth=None, better than baseline on MAE and lower bins, a lot of important features - but might overfit?

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor(random_state=0, n_jobs=-1)
regr.fit(train.drop(columns=target), train[target])
regr.score(train.drop(columns=target), train[target])

In [None]:
import matplotlib.pyplot as plt

# Assuming 'rf_model' is your trained Random Forest model
# and 'feature_names' is a list of your feature names

# Get feature importances
importances = regr.feature_importances_

# Create a DataFrame for easier handling
feature_importances = pd.DataFrame({'Feature': features, 'Importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances = feature_importances.head(20)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Random Forest Feature Importances')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.show()

In [None]:
pred_col = f'{target}_pred_{regr.__str__()}'
val[pred_col] = regr.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
val[[target]+[col for col in val.columns if 'pred' in col]]

In [None]:
model_name = regr.__str__()
the_df = pd.concat([val, X_val['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

## LightGBM - many important features, bad predictions. default hyperparameters are overfitting. We'll do a grid search

In [None]:
import lightgbm as lgb

lgbm = lgb.LGBMRegressor()
old_columns = train.columns.copy(deep=True)
# Replace unsupported characters with an underscore or remove them
train.columns = ["".join(c if c.isalnum() else "_" for c in str(x)) for x in train.columns]

lgbm.fit(train.drop(columns=target), train[target])
lgbm.score(train.drop(columns=target), train[target])

In [None]:
import matplotlib.pyplot as plt

# Assuming 'rf_model' is your trained Random Forest model
# and 'feature_names' is a list of your feature names

# Get feature importances
importances = lgbm.feature_importances_

# Create a DataFrame for easier handling
feature_importances = pd.DataFrame({'Feature': features, 'Importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances = feature_importances.head(20)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title(f'LGBM Feature Importances')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.show()

In [None]:
pred_col = f'{target}_pred_{lgbm.__str__()}'
old_columns_val = val.columns.copy(deep=True)
# Replace unsupported characters with an underscore or remove them
val.columns = ["".join(c if c.isalnum() else "_" for c in str(x)) for x in val.columns]

val[pred_col] = lgbm.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
val[[target]+[col for col in val.columns if 'pred' in col]]

In [None]:
model_name = regr.__str__()
the_df = pd.concat([val, X_val['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

## LGBM grid search - no better resulsts

In [None]:
from sklearn.model_selection import GridSearchCV

# Define a parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    # Add other parameters here
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=lgbm, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error')

# Perform the grid search
grid_search.fit(train.drop(columns=target), train[target])

# Print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# Use the best model for predictions
best_model = grid_search.best_estimator_


In [None]:
best_model.score(train.drop(columns=target), train[target])

In [None]:
import matplotlib.pyplot as plt

# Assuming 'rf_model' is your trained Random Forest model
# and 'feature_names' is a list of your feature names

# Get feature importances
importances = best_model.feature_importances_

# Create a DataFrame for easier handling
feature_importances = pd.DataFrame({'Feature': features, 'Importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances = feature_importances.head(20)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title(f'LGBM grid searched Feature Importances')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.show()

In [None]:
pred_col = f'{target}_pred_{best_model.__str__()}'
val[pred_col] = best_model.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
val[[target]+[col for col in val.columns if 'pred' in col]]

In [None]:
model_name = best_model.__str__()
the_df = pd.concat([val, X_val['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

## Final try - Random Forest grid search max_depth (grid search on all is too slow, LGBM much more efficient)

In [None]:
from sklearn.ensemble import RandomForestRegressor
# Range of max_depth values to explore
max_max_depth = 31
max_depth_range = np.arange(3, max_max_depth, 3)  # 1 to 30

for max_depth in max_depth_range:
    print(f"max_depth: {max_depth}/{max_max_depth}")
    regr = RandomForestRegressor(max_depth=max_depth, random_state=0, n_jobs=-1)
    regr.fit(train.drop(columns=target), train[target])
    print(regr.score(train.drop(columns=target), train[target]))
    pred_col = f'{target}_pred_{regr.__str__()}'
    val[pred_col] = regr.predict(val.drop(columns=[col for col in val if col==target or 'pred' in col]))
    model_name = regr.__str__()
    the_df = pd.concat([val, X_val['item_id']], axis=1)
    target_col = target
    the_metrics_df = metrics
    regression_metrics = None
    metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

In [None]:
metrics.reset_index().drop_duplicates(subset='index')[['index','mean_absolute_error']].sort_values(by='mean_absolute_error')

## We'll choose max_depth=None as it has the minimum mean absolute error.

## The basline model does have a lower MSE (less ratings with larger errors), but higher MAE (generally worse the RF), and there's also no practical way to use it - it essetinailly predicts the same fixed number of rating for all items the user might choose.

# Model Validation on test

In [None]:
train['rating_pred_baseline']

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor(random_state=0, n_jobs=-1)
regr.fit(train.drop(columns=target), train[target])
print(regr.score(train.drop(columns=target), train[target]))

## MAE is higher, as suspected, RF max_depth=None is overfitting... might also be a feature selection issue. But not time to explore :( hopefully retraining on entire dataset mitigate a little.

In [None]:

test[pred_col] = regr.predict(test.drop(columns=[col for col in test.columns if col==target or 'pred' in col]))
model_name = f'test_{regr.__str__()}'
X_test.index = test.index
the_df = pd.concat([test, X_test['item_id']], axis=1)
target_col = target
the_metrics_df = metrics
regression_metrics = None
metrics = get_regression_metrics(the_df, target_col, pred_col, the_metrics_df, regression_metrics, model_name)
metrics

# Retrain on all data and save to pickle, save metrics to csv

In [None]:
# Create Entire dataset: train + validation + test
df = pd.concat([train, val[train.columns], test[train.columns]], axis=0)
assert df.isna().sum().sum() == 0

In [None]:
df.shape

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor(random_state=0, n_jobs=-1)
regr.fit(df.drop(columns=target), df[target])
print(regr.score(df.drop(columns=target), df[target]))

# Get feature importances
importances = regr.feature_importances_

# Create a DataFrame for easier handling
feature_importances = pd.DataFrame({'Feature': features, 'Importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

feature_importances_top = feature_importances.head(20)
# Plot
plt.figure(figsize=(10, 6))
plt.barh(feature_importances_top['Feature'], feature_importances_top['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Random Forest Feature Importances on entire dataset')
plt.gca().invert_yaxis()  # To display the highest importance at the top
plt.show()

In [None]:
print("This is our current trained production model:")
print(regr)

filename='amazon_recommendation_RF_model.pickle'
save_model_to_pickle(regr, filename)

metrics.to_excel("models_metrics.xlsx")
print("Finished training pipeline!")


# How to make predictions in production

In [None]:
df['item_id'] = pd.concat([X_train['item_id'], X_val['item_id'], X_test['item_id']], axis=0, ignore_index=True).values

## our features are divided by item and user specific features, all can be queried from df,
## and similarity features, which should be calculated on demand using the user similarity matrix, as in "Add collaborative filtering features" above.
## However, we will only consider for prediction items that were their share of ratings out of total rating is at least 0.1% - and so we'll precompute all similarity features using all users and these highly rated set of items  


In [None]:
highly_rated_items = df['item_id'].value_counts(normalize=True)[df['item_id'].value_counts(normalize=True)>0.001].index.to_list()
highly_rated_items

In [None]:
df = df.reset_index()
df

In [None]:
import pandas as pd

# Get unique users
unique_users = df['userName'].unique()

# Create a DataFrame from highly_rated_items
highly_rated_df = pd.DataFrame(highly_rated_items, columns=['item_id'])

# Create a DataFrame for all unique users
users_df = pd.DataFrame(unique_users, columns=['userName'])

# Perform Cartesian product
cartesian_product_df = users_df.assign(key=1).merge(highly_rated_df.assign(key=1), on='key').drop('key', axis=1)

# Drop pairs that already exist in df
new_pairs_df = cartesian_product_df[~cartesian_product_df.set_index(['userName', 'item_id']).index.isin(df.set_index(['userName', 'item_id']).index)]

# Now, new_pairs_df contains all new combinations of userName and highly rated item_ids

In [None]:
new_pairs_df

## we created all possible pair of users and items. Now let's add their user and item specific features, and calculate their similarity features.

In [None]:
user_cols = [col for col in df.columns if 'user' in col if 'sim' not in col]
user_cols

In [None]:
user_specific_features = df.drop_duplicates('userName')[user_cols]
user_specific_features

In [None]:
new_pairs_df = new_pairs_df.merge(user_specific_features, on=['userName'], how='left')
new_pairs_df

In [None]:
item_cols = [col for col in df.columns if col.startswith('item') or col.startswith('brand') or col.startswith('category') if 'sim' not in col]
item_cols

In [None]:
item_specific_features = move_cols_to_first(df.drop_duplicates('item_id')[item_cols], ['item_id'])
item_specific_features

In [None]:
new_pairs_df = new_pairs_df.merge(item_specific_features, on=['item_id'], how='left')
new_pairs_df

In [None]:
[col for col in train.columns if col not in new_pairs_df.columns]

## We're only missing similarity cols, let's create them using similar_users_dict and the df

In [None]:
chunk_size = 1000
new_pairs_df = process_in_chunks(new_pairs_df, similar_users_dict, df, features, chunk_size=chunk_size)

# This will take too long - and those features are not the most important ones...
# I'll retrain the model without the 6 similarity features!

In [None]:
from sklearn.ensemble import RandomForestRegressor

regr = RandomForestRegressor(random_state=0, n_jobs=-1)
regr.fit(df.drop(columns=[col for col in train.columns if col not in new_pairs_df.columns] + ['userName','item_id']), df[target])

In [None]:
print(regr.score(df.drop(columns=[col for col in train.columns if col not in new_pairs_df.columns] + ['userName','item_id']), df[target]))

In [None]:
df_features = df.drop(columns=[col for col in train.columns if col not in new_pairs_df.columns] + ['userName','item_id']).columns

In [None]:
new_pairs_df[target] = regr.predict(new_pairs_df[df_features])

In [None]:
new_pairs_df.to_csv('user_item_recommendations_ratings_and_features.csv', index=False)

In [None]:
new_pairs_df[['userName','item_id','rating']]

## filter the top 5 recommendations per user

In [None]:
# Group by 'userName' and get the top 5 'rating' for each 'userName'
top_5_rated_per_user = df.groupby('userName').apply(lambda x: x.nlargest(5, 'rating')).reset_index(drop=True)[['userName','item_id','rating']]
top_5_rated_per_user

In [None]:
assert top_5_rated_per_user[target].mean() > new_pairs_df[target].mean()

In [None]:
top_5_rated_per_user.to_csv('user_item_top_5_recommendations_by_ratings.csv', index=False)

## save top 5 items for new users or users with not enough ratings (were dropped in the begining)

In [None]:
most_popular_items_per_category = pd.concat([X_train[['item_id','category']], X_val[['item_id','category']], X_test[['item_id','category']]], axis=0, ignore_index=True).reset_index().groupby(['item_id','category']).size().sort_values(ascending=False).reset_index().drop_duplicates(subset=['category'], keep='first')
most_popular_items_per_category =  most_popular_items_per_category[:5].item_id.str.split('_', expand=True)
most_popular_items_per_category.columns = ['brand','itemName', 'price']
most_popular_items_per_category = move_cols_to_first(most_popular_items_per_category, ['itemName'])
most_popular_items_per_category = most_popular_items_per_category.reset_index(drop=True)
most_popular_items_per_category

In [None]:
most_popular_items_per_category.to_csv('most_popular_items_per_category.csv', index=False)

## The final logic of the app

In [None]:
top_5_rated_per_user = pd.read_csv('user_item_top_5_recommendations_by_ratings.csv')
most_popular_items_per_category = pd.read_csv('most_popular_items_per_category.csv')

def get_item_recommendations_for_userName(userName):
    recommended_items = top_5_rated_per_user.loc[top_5_rated_per_user['userName'] == userName, 'item_id'].str.split('_', expand=True)
    if len(recommended_items) > 0:
        recommended_items.columns = ['brand','itemName', 'price']
        recommended_items = move_cols_to_first(recommended_items, ['itemName'])
        recommended_items = recommended_items.reset_index(drop=True)
    else:
        recommended_items = most_popular_items_per_category

    return recommended_items

userName = 'Boo'
get_item_recommendations_for_userName(userName)