<a href="https://colab.research.google.com/github/vsancnaj/Skincare-Product-Recommendation-System/blob/main/SkincareCapstone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

• Examine Sephora dataset, figure out distribution of authors vs number of reviews given

## Make connections and load data

In [None]:
# connect to google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Connect Kaggle
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/Springboard/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# dataset download
!kaggle datasets download -d nadyinky/sephora-products-and-skincare-reviews

# Unzip file
!unzip /content/sephora-products-and-skincare-reviews.zip

In [None]:
# # Connect Kaggle
# !mkdir ~/.kaggle
# !cp /content/drive/MyDrive/Springboard/kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json

# # dataset download
# !kaggle datasets download -d nadyinky/sephora-products-and-skincare-reviews

# # Unzip file
# !unzip /content/sephora-products-and-skincare-reviews.zip

### Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# table paths
product_path = '/content/product_info.csv'
review_250path = '/content/reviews_0-250.csv'
review_1250endpath = '/content/reviews_1250-end.csv'
review_250_500path = '/content/reviews_250-500.csv'
review_500_750path ='/content/reviews_500-750.csv'
review_750_1250path = '/content/reviews_750-1250.csv'

In [None]:
# read csv
product_df = pd.read_csv(product_path)
review_250df = pd.read_csv(review_250path,  dtype={1: str })
review_1250enddf = pd.read_csv(review_1250endpath, dtype={1: str })
review_250_500df = pd.read_csv(review_250_500path, dtype={1: str })
review_500_750df = pd.read_csv(review_500_750path, dtype={1: str })
review_750_1250df = pd.read_csv(review_750_1250path, dtype={1: str })

*Note: reviews 0-250: reviews of 250 products are collected*

In [None]:
# view head of product_df
product_df.shape

*Note: The review tables only show reviews for **skin care** products, therefore we want to drop or eliminate poroducts that do not belong to this category.*

In [None]:
# look at the first 5 rows
review_1250enddf.head()

In [None]:
# drop unecessary columns to our skincare investigation
cols_to_drop = ['Unnamed: 0','hair_color','eye_color','skin_tone']
review_250df.drop(cols_to_drop, axis=1, inplace=True)
review_1250enddf.drop(cols_to_drop, axis=1, inplace=True)
review_250_500df.drop(cols_to_drop, axis=1, inplace=True)
review_500_750df.drop(cols_to_drop, axis=1, inplace=True)
review_750_1250df.drop(cols_to_drop, axis=1, inplace=True)


In [None]:
print('Review tables shape:')
print(review_250df.shape)
print(review_1250enddf.shape)
print(review_250_500df.shape)
print(review_500_750df.shape)
print(review_750_1250df.shape)

In [None]:
# look at the first 5 rows
review_250df.head()

# Exploratory Data Analysis

Take a look at the product category present in product_df dataframe, we are only interested in skincare product. Therefore, it would be beneficial to remove those that are not in our desired category.

In [None]:
# bar plot of products in dataframe
counts = product_df['primary_category'].value_counts()
percentages = (counts / counts.sum())*100

plt.figure(figsize=(10, 6))
percentages.plot(kind='bar', title='Percentage Distribution of Products by their Primary Category',color='blue',alpha=.5)
plt.ylabel('Percentage')
plt.xlabel('Primary Category')
plt.show()

It is good to see that the majority of products present are skincare, now we will go ahead and remove all the other categories.

In [None]:
# drop all rows that are not skincare as primary category
product_df= product_df[product_df['primary_category']=='Skincare']
print(product_df.shape)

In [None]:
# observe columns after isolation
product_df.columns

In [None]:
# change name of column reviews
product_df.rename(columns={'reviews':'num_reviews',
                          'price_usd':'original_price_usd',
                           'rating':'average_rating'}, inplace=True)

The 'wellness', 'high tech tools' and 'self tanners', secondary category of products is not within the scope of our current investigation. This category includes items like supplements and tools such as rollers. Our primary research and analysis are centered on products like creams, serums, toners, etc., which are chemically formulated to be applied to the face and address specific skincare concerns over time. As a result, we won't be conducting an in-depth examination of the 'wellness','high tech tools', and 'self tanners' categories at this time.

In [None]:
# drop rows in secondary category products_df we wont be using
values_to_drop = ['Self Tanners','High Tech Tools','Wellness','Shop by Concern','Mini Size','Value & Gift Sets']
product_df = product_df[~product_df['secondary_category'].isin(values_to_drop)]

Combining similar categories may enhance the clarity of analysis and visualization. "Lip Balms & Treatments" are typically regarded as forms of treatment, and its differentiation from the "Treatments" category does not substantially impact the analysis, grouping them together could be beneficial.

In [None]:
# combining treatments
categories_to_rename = ['Treatments', 'Lip Balms & Treatments']
product_df['secondary_category'] = product_df['secondary_category'].replace(categories_to_rename, 'Treatments & Serums')

In [None]:
# secondary category of product distribution
counts = product_df['secondary_category'].value_counts()
percentages = (counts / counts.sum())*100

plt.figure(figsize=(10, 6))
percentages.plot(kind='bar', title='Percentage Distribution of Products by their Secondary Category',color='blue',alpha=.5)
plt.ylabel('Percentage')
plt.xlabel('Secondary Category')
plt.show()

In [None]:
# drop rows in tertiary category products_df we wont be using
values_to_drop = ['Facial Rollers','BB & CC Creams','Face Wipes','Makeup Removers','Holistic Wellness', 'Teeth Whitening','Blotting Papers','Hair Removal','Beauty Supplements']
product_df = product_df[~product_df['tertiary_category'].isin(values_to_drop)]

In [None]:
# tertiary category of product distribution
counts = product_df['tertiary_category'].value_counts()
percentages = (counts / counts.sum())*100

plt.figure(figsize=(10, 6))
percentages.plot(kind='bar', title='Percentage Distribution of Products by their Tertiary Category',color='blue',alpha=.5)
plt.ylabel('Percentage')
plt.xlabel('Tertiary Category')
plt.show()

*Note: While products may be marketed for specific areas of the body, it's important to note that they often contain ingredients that address broader skincare concerns.*

In [None]:
# check duplicated in products table -> product ids
duplicated_products = product_df[product_df.duplicated(['product_id'])]
duplicated_products

In [None]:
# missing values on products table
sum_na = product_df.isna().sum()
percent_missing = sum_na/len(product_df)*100
percent_missing.sort_values(ascending=False)

In [None]:
product_df['product_id'].nunique()

In [None]:
# Concatenate all review DataFrames vertically
all_reviews = pd.concat([review_250df, review_1250enddf, review_250_500df, review_500_750df, review_750_1250df], ignore_index=True)

In [None]:
# get rid of products with no ingredients
product_df = product_df[product_df['ingredients'].notna()]

In [None]:
# missing values on review table
sum_na = all_reviews.isna().sum()
percent_missing = sum_na/len(all_reviews)*100
percent_missing.sort_values(ascending=False)

In [None]:
# impute skin_type by most common
mode_skin_type = all_reviews['skin_type'].mode()[0]  # Get the most common skin_type value
all_reviews['skin_type'].fillna(mode_skin_type, inplace=True)

# Check if all missing skin_type values have been imputed
missing_skin_type_count = all_reviews['skin_type'].isnull().sum()
print(f"Number of missing skin_type values after imputation: {missing_skin_type_count}")

In [None]:
print(f'Shape of Concatenated Reviews Dataframe: {all_reviews.shape}\nFirst 5 rows:')
all_reviews.head()

In [None]:
# renaming of columns to prevent overlap with purchase
all_reviews.rename(columns={'price_usd':'purchase_price_usd',
                           'rating':'author_rating'}, inplace=True)

In [None]:
# merge product table and review table
cols_join = ['product_id','product_name','brand_name']
merged_df = pd.merge(product_df, all_reviews, on=cols_join, how='inner')

We will perform an **INNER JOIN** to retain only those products that have received reviews. The reviews table contains more reviewed products than the total number of products in our products table. Initially, I filtered products from the products table based on their secondary and tertiary categories to identify those relevant to skincare for our analysis. Consequently, there is a possibility that we have reviews on the reviews table for products that do not fall within the scope of our skincare investigation.


In [None]:
# new shape of merged tables
print(f'Shape of Merged Product and Review Table:\n{merged_df.shape}\nColumn names:')
merged_df.columns

In [None]:
# missing values on review table
sum_na = merged_df.isna().sum()
percent_missing = sum_na/len(merged_df)*100
percent_missing.sort_values(ascending=False)

In [None]:
# Calculate the total number of reviews
total_reviews = len(merged_df)

# Calculate the percentage of reviews for each skin type
merged_df['skin_type_percentage'] = merged_df.groupby('skin_type')['skin_type'].transform(lambda x: len(x) / total_reviews * 100)

# Plot the percentage of reviews by skin type
plt.figure(figsize=(10, 6))
sns.barplot(x='skin_type', y='skin_type_percentage', data=merged_df)
plt.title('Percentage of Reviews by Skin Type')
plt.xlabel('Skin Type')
plt.ylabel('Percentage of Reviews (%)')
plt.xticks(rotation=45)
plt.show()


In [None]:
numerical_cols = ['loves_count', 'average_rating', 'num_reviews', 'original_price_usd']


# Pair Plots
pair_plot = sns.pairplot(merged_df[numerical_cols])
pair_plot.fig.suptitle('Pair Plot of Numerical Variables', y=1.02)  # Adjust y position to prevent overlap
plt.show()


In [None]:
for i in range(20):
  print(i,np.sum(merged_df['author_id'].value_counts() > i))

In [None]:
# Define the upper threshold
upper_threshold = 20

# Calculate the count of reviews per unique author
author_review_counts = merged_df['author_id'].value_counts()

# Filter out authors with more than the upper threshold
author_review_counts_filtered = author_review_counts[author_review_counts <= upper_threshold]

# Calculate percentages
total_authors = len(merged_df['author_id'].unique())
percentages = (author_review_counts_filtered.value_counts() / total_authors) * 100

# Plot the histogram or bar plot
plt.figure(figsize=(10, 6))
percentages.plot(kind='bar', color='skyblue')
plt.title(f'Distribution of Number of Reviews per Unique Author (Threshold = {upper_threshold})')
plt.xlabel('Number of Reviews')
plt.ylabel('Percentage of Authors')
plt.xticks(rotation=0)
plt.show()


In [None]:
# # Temporal Analysis of Reviews
# # Convert submission_time to datetime
# merged_df['submission_time'] = pd.to_datetime(merged_df['submission_time'], errors='coerce')
# # Extract year and month for trend analysis
# merged_df['year'] = merged_df['submission_time'].dt.year
# merged_df['month'] = merged_df['submission_time'].dt.month
# # Plotting the number of reviews over time
# time_series = merged_df.groupby(['year', 'month']).size().reset_index(name='counts')
# plt.figure(figsize=(15, 7))
# sns.lineplot(data=time_series, x='month', y='counts', hue='year', marker='o')
# plt.title('Number of Reviews Over Time')
# plt.xlabel('Month')
# plt.ylabel('Number of Reviews')
# plt.legend(title='Year')
# plt.show()

In [None]:
# # Heatmap of Correlations
# plt.figure(figsize=(20, 20))
# sns.heatmap(merged_df.corr(numeric_only=True), annot=True, cmap='coolwarm')
# plt.title('Heatmap of Correlations')
# plt.show()

In [None]:
merged_df.describe().T

# User-based Collaborative Filtering

User-Based Similarity: Focuses on finding similar users to a given user. The idea is that if two users have rated items similarly in the past, they will likely have similar tastes and preferences in the future. This approach can recommend items that similar users have liked but the target user hasn't seen yet.

In [None]:
# remove duplicates
merged_df = merged_df.drop_duplicates(subset=['author_id', 'product_id'])

In [44]:
# Removing products with no values in ratings and reviews columns
merged_df = merged_df.dropna(subset=['num_reviews', 'average_rating'])

# Getting the count of products
product_stats = merged_df.groupby(['product_id'])['product_id'].count().reset_index(name='counts')

# Sorting products by count
product_stats = product_stats.sort_values('counts', ascending=False)

# Calculating cutoff value for products
cutoff = product_stats['counts'].quantile(0.20)

# Filtering products
filtered_products = product_stats.loc[product_stats['counts'] > cutoff]

# Converting product_ids to set
products_set = set(filtered_products['product_id'])

# Keeping products with filtered IDs
merged_df = merged_df.loc[merged_df['product_id'].isin(products_set)]

# Getting the count of reviews by authors
author_stats = merged_df.groupby(['author_id'])['author_id'].count().reset_index(name='counts')

# Sorting authors by count
author_stats = author_stats.sort_values('counts', ascending=False)

# Calculating cutoff value for authors
cutoff = author_stats['counts'].quantile(0.95)

# Filtering authors
filtered_authors = author_stats.loc[author_stats['counts'] > cutoff]

# Converting author_ids to set
authors_set = set(filtered_authors['author_id'])

# Keeping reviews from filtered authors
merged_df = merged_df.loc[merged_df['author_id'].isin(authors_set)]


In [49]:
merged_df.shape

(228006, 40)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

merged_df['author_rating'] = merged_df['author_rating'].astype(float)
# isolate the columns that will be used for collaborative filtering
col = ['author_id','product_id','author_rating']

product_ratings = merged_df[col]
# convert rating to float

product_ratings.head()

Unnamed: 0,author_id,product_id,author_rating
0,5880814443,P439055,5.0
1,1726924575,P439055,5.0
2,1551348158,P439055,5.0
3,8222942765,P439055,5.0
4,2403670662,P439055,2.0


In [None]:
# # Save DataFrame to a CSV file
# product_ratings.to_csv('/content/drive/MyDrive/Springboard/capstones/3. Final Capstone/product_ratings.csv', index=False, encoding='utf-8')

## Take a sample of our Data

In [None]:
# Define the desired sample size per product
your_sample_size_per_product = 15

# Group by 'product_id' and sample a fixed number of rows for each product
sampled_data = product_ratings.groupby('product_id', group_keys=False).apply(lambda x: x.sample(n=min(len(x), your_sample_size_per_product), replace=False))

# Reset the index to make the result a dataframe
sampled_data = sampled_data.reset_index(drop=True)

In [None]:
sampled_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25699 entries, 0 to 25698
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   author_id      25699 non-null  object 
 1   product_id     25699 non-null  object 
 2   author_rating  25699 non-null  float64
dtypes: float64(1), object(2)
memory usage: 602.4+ KB


In [None]:
sampled_data.head(5)

Unnamed: 0,author_id,product_id,author_rating
0,1091565251,P107306,5.0
1,2074982661,P107306,5.0
2,1708204737,P107306,4.0
3,1418363993,P107306,2.0
4,1321143025,P107306,5.0


In [None]:
# unique_values_count = product_ratings['product_id'].nunique()

# print("Number of unique values in the column:", unique_values_count)

In [None]:
from sklearn.model_selection import train_test_split
# split the data and look at the shape of the test and train
X_train, X_test = train_test_split(sampled_data, test_size = 0.30, random_state = 42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (17989, 3)
Test set shape: (7710, 3)


In [None]:
# drop duplicate authors that reviewed the same product
X_train = X_train.drop_duplicates(subset=['author_id', 'product_id'])
X_test = X_test.drop_duplicates(subset=['author_id', 'product_id'])

In [None]:
X_train.head()

Unnamed: 0,author_id,product_id,author_rating
10480,21902949040,P448542,5.0
25459,46046914805,P505316,5.0
16257,1468604816,P470217,2.0
10971,22008665052,P450262,4.0
15376,6092231805,P469088,4.0


In [None]:
X_train.shape

(17986, 3)

In [None]:
# pivot the ratings into products
user_data = X_train.pivot(index = 'author_id', columns = 'product_id', values = 'author_rating').fillna(0)
user_data.head(5)

product_id,P107306,P114902,P12045,P122651,P122661,P122718,P122727,P122762,P122767,P122774,...,P505452,P505711,P54509,P6028,P7880,P94421,P94812,P9939,P9940,P9941
author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10000770719,,,,,,,,,,,...,,,,,,,,,,
1000235057,,,,,,,,,,,...,,,,,,,,,,
10005363344,,,,,,,,,,,...,,,,,,,,,,
10005372037,,,,,,,,,,,...,,,,,,,,,,
1000899606,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# Count the number of missing values in the DataFrame
missing_values_count = user_data.isnull().sum().sum()

# Calculate the total number of values in the DataFrame
total_values = user_data.shape[0] * user_data.shape[1]

# Calculate the sparsity
sparsity = missing_values_count / total_values

print("Sparsity of the data:", sparsity)

Sparsity of the data: 0.9993912552424529


### Create a copy of train and test dataset
These datasets will be used for prediction and evaluation.

Dummy train will be used later for prediction of the movies which has not been rated by the user. To ignore the movies rated by the user, we will mark it as 0 during prediction. The movies not rated by user is marked as 1 for prediction.

Dummy test will be used for evaluation. To evaluate, we will only make prediction on the movies rated by the user. So, this is marked as 1. This is just opposite of dummy_train

In [None]:
dummy_train = X_train.copy()
dummy_test = X_test.copy()

dummy_train['author_rating'] = dummy_train['author_rating'].apply(lambda x: 0 if x > 0 else 1)
dummy_test['author_rating'] = dummy_test['author_rating'].apply(lambda x: 1 if x > 0 else 0)

In [None]:
dummy_train.head()

Unnamed: 0,author_id,product_id,author_rating
322211,1971862942,P12336,0
93142,5987960857,P94812,0
737592,6581846678,P429952,0
187336,5865606966,P416755,0
840917,25537874067,P443845,0


In [None]:
# The products not rated by user is marked as 1 for prediction
dummy_train = dummy_train.pivot(index = 'author_id', columns = 'product_id', values = 'author_rating').fillna(1)

# The products not rated by user is marked as 0 for evaluation
dummy_test = dummy_test.pivot(index ='author_id', columns = 'product_id', values = 'author_rating').fillna(0)

In [None]:
dummy_train.sample(10)

product_id,P107306,P114902,P12045,P122651,P122661,P122718,P122727,P122762,P122767,P122774,...,P505452,P505711,P54509,P6028,P7880,P94421,P94812,P9939,P9940,P9941
author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
7131183798,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1243993719,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
21477563236,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
30269788335,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7541500531,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2066012546,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1971860270,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2399730362,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
681169385,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
22639559596,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# User Similarity Matrix using Cosine similarity as a similarity measure between Users
user_similarity = cosine_similarity(user_data)
user_similarity[np.isnan(user_similarity)] = 0
print(user_similarity)
print(user_similarity.shape)

In [None]:
user_predicted_ratings = np.dot(user_similarity, user_data)
user_predicted_ratings

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
user_predicted_ratings.shape

(16328, 1805)

In [None]:
# np.multiply for cell-by-cell multiplication

user_final_ratings = np.multiply(user_predicted_ratings, dummy_train)
user_final_ratings.head()

product_id,P107306,P114902,P12045,P122651,P122661,P122718,P122727,P122762,P122767,P122774,...,P505452,P505711,P54509,P6028,P7880,P94421,P94812,P9939,P9940,P9941
author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10000846022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000092935,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10001606143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1000235057,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10003431865,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
X_train[X_train['author_id'] == '28218583828']

Unnamed: 0,author_id,product_id,author_rating
15376,28218583828,P469088,4.0


In [None]:
user_final_ratings.loc['28218583828'].sort_values(ascending = False)[0:5]

product_id
P480160    3.535534
P467250    3.535534
P473726    0.000000
P474110    0.000000
P474109    0.000000
Name: 28218583828, dtype: float64