## Recommendation Systems

A recommendation system works by analysing the user preferences and recommends products to the user which user may like.
For Ex: Netflix recommends movies you may like based on previous movies, Linked recommends Jobs you may be interested in based on your profile.

Types of recommendation systems
|Type|Description|
|-|-|
|Content based filtering | Content based filtering works based on recommending products which have attributes/features that are liked by you in the past. For ex. If you have liked horror genre pictures earlier, Netflix recommends more horror genre pictures.|
| Collaborative filtering | Collaborative filtering works based on assumption that people who had similar preferences in the past will have same preferences in the future.
| Demographic based recommender system | Demographic based recommender systems use the demographic data usually collected through market research to recommend products. 
| Utility based recommender system | Utility based recommender systems work by creating a utility function for the products and recommending products based on output of the utility function. The benefit of this system is that that non-product attributes like vendor reliability, product availability can be factored into utility function.
| Knowledge based recommender system | Knowledge based recommender system functions by understanding how a particular item meets user's need. 
| Hybrid recommender system | Hybrid recommender system works by combining any 2 of the above recommendation systems. Some of the famous techniques are applying weights to recommendation systems, frequently switching between any 2 recommendation systems, or showing all recommendations from different systems.


In [139]:
import pandas as pd
import numpy as np

# NLTK libraries
import nltk
nltk.download('all')
from nltk.corpus import stopwords
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
import re

# Modelling
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\vism\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\vism\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\vism\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\vism\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\vism\AppData\Roaming\nltk_data...
[nltk_data]    |   Pac

## Data Load and cleaning

In [117]:
df = pd.read_csv('https://raw.githubusercontent.com/antrikshsaxena/capstone_solution/main/reviews_dataset.csv')
df.head()

Unnamed: 0,id,brand,categories,manufacturer,name,reviews_date,reviews_didPurchase,reviews_doRecommend,reviews_rating,reviews_text,reviews_title,reviews_userCity,reviews_userProvince,reviews_username,user_sentiment
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,,,5,i love this album. it's very good. more to the...,Just Awesome,Los Angeles,,joshua,Positive
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,True,,5,Good flavor. This review was collected as part...,Good,,,dorothy w,Positive
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,True,,5,Good flavor.,Good,,,dorothy w,Positive
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,False,False,1,I read through the reviews on here before look...,Disappointed,,,rebecca,Negative
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,False,False,1,My husband bought this gel for us. The gel cau...,Irritation,,,walker557,Negative


In [118]:
df.info(), df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        29999 non-null  object
dtypes: int64(1), object(14)
memory usage

(None, (30000, 15))

In [119]:
df.isna().sum()/len(df)*100

id                       0.000000
brand                    0.000000
categories               0.000000
manufacturer             0.470000
name                     0.000000
reviews_date             0.153333
reviews_didPurchase     46.893333
reviews_doRecommend      8.566667
reviews_rating           0.000000
reviews_text             0.000000
reviews_title            0.633333
reviews_userCity        93.570000
reviews_userProvince    99.433333
reviews_username         0.210000
user_sentiment           0.003333
dtype: float64

In [120]:
# drop columns with > 50% missing values & unwanted columns
df.drop(['reviews_userCity', 'reviews_userProvince', 'reviews_didPurchase', 'reviews_doRecommend'], axis=1, inplace=True)

In [121]:
df = df[df['reviews_text'].notna()]
df = df[df['reviews_title'].notna()]
df = df[df['reviews_username'].notna()]
df = df[df['user_sentiment'].notna()]
df = df[df['reviews_date'].notna()]
df = df[df['manufacturer'].notna()]

In [122]:
df.isna().sum()/len(df)*100

id                  0.0
brand               0.0
categories          0.0
manufacturer        0.0
name                0.0
reviews_date        0.0
reviews_rating      0.0
reviews_text        0.0
reviews_title       0.0
reviews_username    0.0
user_sentiment      0.0
dtype: float64

## Data Pre-processing

In [123]:
df.head()

Unnamed: 0,id,brand,categories,manufacturer,name,reviews_date,reviews_rating,reviews_text,reviews_title,reviews_username,user_sentiment
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,5,i love this album. it's very good. more to the...,Just Awesome,joshua,Positive
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor. This review was collected as part...,Good,dorothy w,Positive
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor.,Good,dorothy w,Positive
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,1,I read through the reviews on here before look...,Disappointed,rebecca,Negative
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,1,My husband bought this gel for us. The gel cau...,Irritation,walker557,Negative


In [124]:
df.rename(columns={'reviews_username': 'username', 'name' : 'productname'}, inplace=True)

In [125]:
df.head()

Unnamed: 0,id,brand,categories,manufacturer,productname,reviews_date,reviews_rating,reviews_text,reviews_title,username,user_sentiment
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,5,i love this album. it's very good. more to the...,Just Awesome,joshua,Positive
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor. This review was collected as part...,Good,dorothy w,Positive
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor.,Good,dorothy w,Positive
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,1,I read through the reviews on here before look...,Disappointed,rebecca,Negative
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,1,My husband bought this gel for us. The gel cau...,Irritation,walker557,Negative


In [126]:
df['sentiment'] = df.user_sentiment.apply(lambda x: 1 if x == "Positive" else 0)
df.drop(['user_sentiment'], axis=1, inplace=True)

In [127]:
df['review'] = df['reviews_title'] + ' ' + df['reviews_text']
df.head()

Unnamed: 0,id,brand,categories,manufacturer,productname,reviews_date,reviews_rating,reviews_text,reviews_title,username,sentiment,review
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,5,i love this album. it's very good. more to the...,Just Awesome,joshua,1,Just Awesome i love this album. it's very good...
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor. This review was collected as part...,Good,dorothy w,1,Good Good flavor. This review was collected as...
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor.,Good,dorothy w,1,Good Good flavor.
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,1,I read through the reviews on here before look...,Disappointed,rebecca,0,Disappointed I read through the reviews on her...
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,1,My husband bought this gel for us. The gel cau...,Irritation,walker557,0,Irritation My husband bought this gel for us. ...


## Sentiment Prediction from Text

In this segment we predict the sentiment based on text, before doing that the following steps have to be performed
1. Lower case reviews text
2. remove stop words punctuation
3. lemmatize


In [128]:
s_words = stopwords.words('english')
lemmatizer = nltk.stem.WordNetLemmatizer()
wordnet_lemmatizer = WordNetLemmatizer() 

# convert nltk tag to wordnet tag for lemmatization
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def process(sentence, words):
    # lower case
    sentence = sentence.lower()
    # remove punctuation
    sentence = sentence.replace('[^\w\s]','')
    # remove stop words
    sentence = ' '.join([x for x in sentence.split() if x not in (words)])
    # tokenize sentence into (word, nltk_postag)
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
    # create a map of (word, wordnettag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            #if there is no available tag, append the token as is
            lemmatized_sentence.append(word)
        else:
            #else use the tag to lemmatize the token
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    sentence = ' '.join(lemmatized_sentence)
    # remove html markup
    sentence=re.sub("(<.*?>)","",sentence)
    # remove non-ascii and digits
    sentence=re.sub("(\\W|\\d)"," ",sentence)
    # remove whitespace
    sentence=sentence.strip()
    return sentence

df.review = df.review.apply(lambda x: process(x, s_words))

In [129]:
df.head()

Unnamed: 0,id,brand,categories,manufacturer,productname,reviews_date,reviews_rating,reviews_text,reviews_title,username,sentiment,review
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,5,i love this album. it's very good. more to the...,Just Awesome,joshua,1,awesome love album good hip hop side curre...
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor. This review was collected as part...,Good,dorothy w,1,good good flavor review collect part promotion
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor.,Good,dorothy w,1,good good flavor
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,1,I read through the reviews on here before look...,Disappointed,rebecca,0,disappoint read review look buy one couple lub...
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,1,My husband bought this gel for us. The gel cau...,Irritation,walker557,0,irritation husband buy gel us gel cause irri...


### Sentiment Classifier training

In [133]:
x=df['review'] 
y=df['sentiment']

In [138]:
y.value_counts()/len(y) * 100

1    88.825002
0    11.174998
Name: sentiment, dtype: float64

In [142]:
# Train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [146]:
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(20696,) (20696,)
(8870,) (8870,)


In [148]:
word_vectorizer = TfidfVectorizer(
    strip_accents='unicode',    # Remove accents and perform other character normalization during the preprocessing step. 
    analyzer='word',            # Whether the feature should be made of word or character n-grams.
    token_pattern=r'\w{1,}',    # Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'
    ngram_range=(1, 3),         # The lower and upper boundary of the range of n-values for different n-grams to be extracted
    stop_words='english',
    sublinear_tf=True)

word_vectorizer.fit(x_train)    # Fiting it on Train
## transforming the train and test datasets
X_train_transformed = word_vectorizer.transform(x_train.tolist())
X_test_transformed = word_vectorizer.transform(x_test.tolist())
# Print the shape of each dataset.
print('X_train_transformed', X_train_transformed.shape)
print('y_train', y_train.shape)
print('X_test_transformed', X_test_transformed.shape)
print('y_test', y_test.shape)

X_train_transformed (20696, 376845)
y_train (20696,)
X_test_transformed (8870, 376845)
y_test (8870,)


In [153]:
## Resampling 
df_x_train = pd.DataFrame(x_train)
from imblearn import over_sampling
ros = over_sampling.RandomOverSampler(random_state=0)
# reshaping x_train, fit_sample expects dataframe
x_train, y_train = ros.fit_resample(df_x_train, y_train)
y_train.value_counts()

1    18381
0    18381
Name: sentiment, dtype: int64

In [155]:
# converting xtrain to list
X_train_transformed = word_vectorizer.transform(x_train.iloc[:,0].tolist())
X_test_transformed = word_vectorizer.transform(x_test.tolist())

In [157]:
## training using XGBoost classifier
import time
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb 

time1 = time.time()
n_estimators = [10,15,20,25,30] 
max_features = ['auto', 'sqrt']
max_depth = [4,5,6]
max_depth.append(None) # If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
random_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}
xgb = xgb.XGBClassifier(n_jobs=-1)
xgb_final = RandomizedSearchCV(estimator=xgb, param_distributions=random_grid, n_iter=5, cv=3, 
                               verbose=2, random_state=42, n_jobs=-1)
xgb_final.fit(X_train_transformed,y_train)
time_taken = time.time() - time1
print('Time Taken: {:.2f} seconds'.format(time_taken))

Fitting 3 folds for each of 5 candidates, totalling 15 fits
Parameters: { "max_features", "min_samples_leaf", "min_samples_split" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Time Taken: 75.71 seconds


In [158]:
xgb_final.best_estimator_

In [159]:
# Prediction Train Data
y_pred_train= xgb_final.predict(X_train_transformed)
print("Xgboost Forest Model accuracy", accuracy_score(y_pred_train, y_train))
print(classification_report(y_pred_train, y_train))

Xgboost Forest Model accuracy 0.8708720961862794
              precision    recall  f1-score   support

           0       0.89      0.86      0.87     18952
           1       0.86      0.88      0.87     17810

    accuracy                           0.87     36762
   macro avg       0.87      0.87      0.87     36762
weighted avg       0.87      0.87      0.87     36762



In [160]:
# Prediction Test Data
y_pred_test = xgb_final.predict(X_test_transformed)
print("Xgboost Model accuracy", accuracy_score(y_pred_test, y_test))
print(classification_report(y_pred_test, y_test))

Xgboost Model accuracy 0.8303269447576099
              precision    recall  f1-score   support

           0       0.73      0.37      0.49      1954
           1       0.84      0.96      0.90      6916

    accuracy                           0.83      8870
   macro avg       0.79      0.66      0.69      8870
weighted avg       0.82      0.83      0.81      8870



In [165]:
print("Confusion matrix for train and test set")
cm_test = confusion_matrix(y_test, y_pred_test)
TN = cm_test[0, 0]
FP = cm_test[0, 1]
FN = cm_test[1, 0]
TP = cm_test[1, 1]
#Calculating the Sensitivity for train and test set
sensitivity = TP / float(FN + TP)
print("sensitivity for test set: ",sensitivity)
specificity = TN / float(TN + FP)
print("specificity for test set: ",specificity)

Confusion matrix for train and test set
sensitivity for test set:  0.8432939982235756
specificity for test set:  0.7269969666329625


## Recommendation system

In [166]:
df.head()

Unnamed: 0,id,brand,categories,manufacturer,productname,reviews_date,reviews_rating,reviews_text,reviews_title,username,sentiment,review
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,5,i love this album. it's very good. more to the...,Just Awesome,joshua,1,awesome love album good hip hop side curre...
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor. This review was collected as part...,Good,dorothy w,1,good good flavor review collect part promotion
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,5,Good flavor.,Good,dorothy w,1,good good flavor
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,1,I read through the reviews on here before look...,Disappointed,rebecca,0,disappoint read review look buy one couple lub...
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,1,My husband bought this gel for us. The gel cau...,Irritation,walker557,0,irritation husband buy gel us gel cause irri...


In [169]:
df1 = df[['reviews_rating', 'productname', 'username']]

In [189]:
df1.isnull().any()

reviews_rating    False
productname       False
username          False
dtype: bool

In [190]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29566 entries, 0 to 29999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   reviews_rating  29566 non-null  int64 
 1   productname     29566 non-null  object
 2   username        29566 non-null  object
dtypes: int64(1), object(2)
memory usage: 923.9+ KB


In [191]:
df_train, df_test = train_test_split(df1, test_size=0.30, random_state=42)
print(df_train.shape, df_test.shape)

(20696, 3) (8870, 3)


In [192]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20696 entries, 10253 to 24021
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   reviews_rating  20696 non-null  int64 
 1   productname     20696 non-null  object
 2   username        20696 non-null  object
dtypes: int64(1), object(2)
memory usage: 646.8+ KB


In [180]:
# Prepare dummy train which contains 0 if the user has already rated or 1, we will use this later to remove existing ratings
dummy_train = df_train.copy()
dummy_train.reviews_rating = df_train.reviews_rating.apply(lambda x : 0 if x >= 1 else 1)
dummy_train =  dummy_train.pivot_table(index='username', columns='productname', values='reviews_rating').fillna(1)
dummy_train.head()

productname,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),2017-2018 Brownline174 Duraflex 14-Month Planner 8 1/2 X 11 Black,"2x Ultra Era with Oxi Booster, 50fl oz","42 Dual Drop Leaf Table with 2 Madrid Chairs""",4C Grated Parmesan Cheese 100% Natural 8oz Shaker,Africa's Best No-Lye Dual Conditioning Relaxer System Super,Alberto VO5 Salon Series Smooth Plus Sleek Shampoo,"All,bran Complete Wheat Flakes, 18 Oz.",Ambi Complexion Cleansing Bar,...,"Vicks Vaporub, Regular, 3.53oz",Voortman Sugar Free Fudge Chocolate Chip Cookies,Wagan Smartac 80watt Inverter With Usb,"Wallmount Server Cabinet (450mm, 9 RU)","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee",Wedding Wishes Wedding Guest Book,Weleda Everon Lip Balm,Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash,Yes To Grapefruit Rejuvenating Body Wash
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
00sab00,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
02deuce,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
0325home,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1085,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
10ten,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [198]:
user_based_matrix = df_train.pivot_table(index='username', columns='productname', values='reviews_rating')
# row wise mean excluding NANs
m = np.nanmean(user_based_matrix, axis=1)
m.shape, user_based_matrix.shape
df_substracted = (user_based_matrix.T-m).T

In [218]:
from sklearn.metrics.pairwise import pairwise_distances
user_correlation = 1 - pairwise_distances(df_substracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0

In [219]:
user_correlation[user_correlation<0] = 0
predictions = np.dot(user_correlation, user_based_matrix.fillna(0))
predictions = np.multiply(predictions, dummy_train)
predictions.shape

(18025, 231)

### Evaluate

Evaluate by calculating ratings for products in test set which are rated already and compute RMSE

In [250]:
test = df_test[df_test.username.isin(df_train.username)]
df_test = pd.pivot_table(test, values='reviews_rating', index='username', columns='productname')

In [246]:
df_user_correlation = pd.DataFrame(user_correlation)
df_user_correlation['username'] = df_substracted.index
df_user_correlation.set_index('username', inplace=True)
df_user_correlation.shape

(18025, 18025)

In [247]:
common_users = test.username.tolist()
df_user_correlation.columns = df_substracted.index.to_list()
df_user_correlation_1 = df_user_correlation[df_user_correlation.index.isin(common_users)]
print(df_user_correlation_1.shape)
df_user_correlation_2 = df_user_correlation_1.T[df_user_correlation_1.T.index.isin(common_users)]
df_user_correlation_2.shape

(1649, 18025)


(1649, 1649)

In [252]:
common_user_predicted_ratings = np.dot(df_user_correlation_2, df_test.fillna(0))

In [255]:
dummy_test = test.copy()
dummy_test['reviews_rating'] = dummy_test['reviews_rating'].apply(lambda x: 1 if x>=1 else 0)
dummy_test = dummy_test.pivot_table(index='username', columns='productname', values='reviews_rating').fillna(0)

In [294]:
common_user_predicted_ratings= np.multiply(common_user_predicted_ratings, dummy_test)
common_user_predicted_ratings.head()

productname,0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest,100:Complete First Season (blu-Ray),"Aussie Aussome Volume Shampoo, 13.5 Oz","Australian Gold Exotic Blend Lotion, SPF 4","Aveeno Baby Continuous Protection Lotion Sunscreen with Broad Spectrum SPF 55, 4oz","Avery174 Ready Index Contemporary Table Of Contents Divider, 1-8, Multi, Letter",Axe Dry Anti-Perspirant Deodorant Invisible Solid Phoenix,"Banana Boat Sunless Summer Color Self Tanning Lotion, Light To Medium",Bisquick Original Pancake And Baking Mix - 40oz,Black Front Loading Frame Set (8.5x11) Set Of 12,...,Tresemme Kertatin Smooth Infusing Conditioning,Various - Country's Greatest Gospel:Gold Ed (cd),Various - Red Hot Blue:Tribute To Cole Porter (cd),Various Artists - Choo Choo Soul (cd),Vaseline Intensive Care Healthy Hands Stronger Nails,Vaseline Intensive Care Lip Therapy Cocoa Butter,"Vicks Vaporub, Regular, 3.53oz","Way Basics 3-Shelf Eco Narrow Bookcase Storage Shelf, Espresso - Formaldehyde Free - Lifetime Guarantee",Windex Original Glass Cleaner Refill 67.6oz (2 Liter),Yes To Carrots Nourishing Body Wash
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
143st,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1witch,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37f5p,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4 rooms 1 dog lotsa fur,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8ellie24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [265]:
# scale between 1 to 5
from sklearn.preprocessing import MinMaxScaler
X=common_user_predicted_ratings.copy()
X = X[X>0]
scaler = MinMaxScaler(feature_range=(1,5))
scaler.fit(X)
y = scaler.transform(X)
print(y)

[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]


  data_min = np.nanmin(X, axis=0)
  data_max = np.nanmax(X, axis=0)


In [288]:
df_test = test.pivot_table(index='username', columns='productname', values='reviews_rating')
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1649 entries, 143st to zmom
Columns: 109 entries, 0.6 Cu. Ft. Letter A4 Size Waterproof 30 Min. Fire File Chest to Yes To Carrots Nourishing Body Wash
dtypes: float64(109)
memory usage: 1.4+ MB


In [293]:
total_non_nan = np.count_nonzero(~np.isnan(y))
# rmse = (sum(sum((df_test-y)**2))/total_non_nan)**0.5
sum((df_test-y)**2)
# print(rmse)

TypeError: unsupported operand type(s) for +: 'int' and 'str'