###**Skincare Recommendation**

Produk Skincare merupakan salah satu produk populer yang banyak digunakan oleh masyarakat dunia. Seiring berkembangnya produk skincare, semakin banyak juga jenis dan varian skincare yang dapat dipilih oleh konsumen. Namun terkadang, banyaknya pilihan ini membuat konsumen kesulitan dalam memilih produk yang paling cocok bagi mereka.


Salah satu hal yang dapat menjadi solusi permasalahan ini adalah sebuah sistem temu kembali informasi (information retrieval) produk skincare yang dapat memfilter hasil pencarian dengan baik dan akurat berdasarkan keywords dan query yang dimasukan konsumen. Sistem temu kembali tersebut harus dapat membedakan setiap produk skincare berdasarkan tipe, komposisi atau bahan-bahan yang digunakan, serta deskripsi produk untuk dapat menghasilkan hasil pencarian yang akurat.

Sistem ini dirancang untuk merekomendasikan skincare berdasarkan informasi berupa tipe produk, komposisi atau ingredient produk, serta harga dari produk tersebut.

####**Menyiapkan Data**

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re
from scipy.sparse import hstack

nltk.download('punkt')
nltk.download('stopwords')

In [18]:
# Load dataset
df = pd.read_csv('skincare_products_clean.csv')
df.head()

Unnamed: 0,product_name,product_url,product_type,clean_ingreds,price
0,The Ordinary Natural Moisturising Factors + HA...,https://www.lookfantastic.com/the-ordinary-nat...,Moisturiser,"['capric triglyceride', 'cetyl alcohol', 'prop...",£5.20
1,CeraVe Facial Moisturising Lotion SPF 25 52ml,https://www.lookfantastic.com/cerave-facial-mo...,Moisturiser,"['homosalate', 'glycerin', 'octocrylene', 'eth...",£13.00
2,The Ordinary Hyaluronic Acid 2% + B5 Hydration...,https://www.lookfantastic.com/the-ordinary-hya...,Moisturiser,"['sodium hyaluronate', 'sodium hyaluronate', '...",£6.20
3,AMELIORATE Transforming Body Lotion 200ml,https://www.lookfantastic.com/ameliorate-trans...,Moisturiser,"['ammonium lactate', 'c12-15', 'glycerin', 'pr...",£22.50
4,CeraVe Moisturising Cream 454g,https://www.lookfantastic.com/cerave-moisturis...,Moisturiser,"['glycerin', 'cetearyl alcohol', 'capric trigl...",£16.00


**Menghapus data 'product_url' dan 'price'**

In [19]:
df = df.drop(columns=['product_url', 'price'])
df.head()

Unnamed: 0,product_name,product_type,clean_ingreds
0,The Ordinary Natural Moisturising Factors + HA...,Moisturiser,"['capric triglyceride', 'cetyl alcohol', 'prop..."
1,CeraVe Facial Moisturising Lotion SPF 25 52ml,Moisturiser,"['homosalate', 'glycerin', 'octocrylene', 'eth..."
2,The Ordinary Hyaluronic Acid 2% + B5 Hydration...,Moisturiser,"['sodium hyaluronate', 'sodium hyaluronate', '..."
3,AMELIORATE Transforming Body Lotion 200ml,Moisturiser,"['ammonium lactate', 'c12-15', 'glycerin', 'pr..."
4,CeraVe Moisturising Cream 454g,Moisturiser,"['glycerin', 'cetearyl alcohol', 'capric trigl..."


In [20]:
combined_df = df.astype(str).agg(' '.join, axis=1)
combined_df.head()

0    The Ordinary Natural Moisturising Factors + HA...
1    CeraVe Facial Moisturising Lotion SPF 25 52ml ...
2    The Ordinary Hyaluronic Acid 2% + B5 Hydration...
3    AMELIORATE Transforming Body Lotion 200ml Mois...
4    CeraVe Moisturising Cream 454g Moisturiser ['g...
dtype: object

**Menghapus stopword**

In [21]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    # Tokenisasi
    words = word_tokenize(text.lower())  # Tokenisasi dan ubah ke huruf kecil
    # Hapus stop words
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

In [22]:
removed_stopword = combined_df.apply(remove_stopwords)
removed_stopword.head()

0    ordinary natural moisturising factors + ha 30m...
1    cerave facial moisturising lotion spf 25 52ml ...
2    ordinary hyaluronic acid 2 % + b5 hydration su...
3    ameliorate transforming body lotion 200ml mois...
4    cerave moisturising cream 454g moisturiser [ '...
dtype: object

**Tokenisasi Data**

In [23]:
tokenized_data = removed_stopword.apply(word_tokenize)

# Tampilkan hasil tokenisasi
tokenized_data.head()

0    [ordinary, natural, moisturising, factors, +, ...
1    [cerave, facial, moisturising, lotion, spf, 25...
2    [ordinary, hyaluronic, acid, 2, %, +, b5, hydr...
3    [ameliorate, transforming, body, lotion, 200ml...
4    [cerave, moisturising, cream, 454g, moisturise...
dtype: object

In [24]:
tokenized_data = tokenized_data.astype(str)

###**Menerapkan teknik indexing**

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

dataset = tokenized_data.tolist()
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(tokenized_data)
dict(zip(vectorizer.get_feature_names_out(), vectors.toarray()[0]))

tfidf_df = pd.DataFrame(
    vectors.toarray(), columns=vectorizer.get_feature_names_out()
)

In [26]:
tfidf_df.head()

Unnamed: 0,000iu,069,090,094,10,100,1000,100g,100ml,101,...,zealand,zelens,zeolite,zeylanicum,zinc,zingiber,zizanioides,zizanoides,zizyphus,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.111098,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.116991,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# Example for skincare #1
print(dataset[0])
tfidf_df.loc[0].sort_values(ascending=False)

['ordinary', 'natural', 'moisturising', 'factors', '+', 'ha', '30ml', 'moisturiser', '[', "'capric", 'triglyceride', "'", ',', "'cetyl", 'alcohol', "'", ',', "'propanediol", "'", ',', "'stearyl", 'alcohol', "'", ',', "'glycerin", "'", ',', "'sodium", 'hyaluronate', "'", ',', "'arganine", "'", ',', "'aspartic", 'acid', "'", ',', "'glycine", "'", ',', "'alanine", "'", ',', "'serine", "'", ',', "'valine", "'", ',', "'isoleucine", "'", ',', "'proline", "'", ',', "'threonine", "'", ',', "'histidine", "'", ',', "'phenylalanine", "'", ',', "'glucose", "'", ',', "'maltose", "'", ',', "'fructose", "'", ',', "'trehalose", "'", ',', "'sodium", 'pca', "'", ',', "'pca", "'", ',', "'sodium", 'lactate', "'", ',', "'urea", "'", ',', "'allantoin", "'", ',', "'linoleic", 'acid', "'", ',', "'oleic", 'acid', "'", ',', "'phytosteryl", 'canola', 'glycerides', "'", ',', "'palmitic", 'acid', "'", ',', "'stearic", 'acid', "'", ',', "'lecithin", "'", ',', "'triolein", "'", ',', "'tocopherol", "'", ',', "'carbom

acid       0.246565
pca        0.197432
factors    0.196617
sodium     0.187877
maltose    0.182143
             ...   
10g        0.000000
10ml       0.000000
11         0.000000
113g       0.000000
069        0.000000
Name: 0, Length: 3713, dtype: float64

**<h2> Teknik Pencarian: Vector Space Model <h2>**

In [28]:
from sklearn.metrics.pairwise import cosine_similarity
# Query dari pengguna
query = input("Apa yang kamu cari? ")

# Menghapus stopwords pada query
cleaned_query = remove_stopwords(query)

# Mengubah query ke dalam bentuk vektor TF-IDF
query_vector = vectorizer.transform([cleaned_query])

# Menghitung cosine similarity antara query dan dataset
cosine_similarities = cosine_similarity(query_vector, vectors)

# Mengurutkan hasil pencarian berdasarkan cosine similarity
sorted_indices = cosine_similarities[0].argsort()[::-1]

# Menampilkan hasil pencarian berdasarkan urutan relevansi
print("Produk yang relevan dengan query:")
for index in sorted_indices[:5]:  # Menampilkan 5 produk teratas
    print(f"Produk {index + 1}: {dataset[index]} (Relevansi: {cosine_similarities[0][index]:.4f})")

Produk yang relevan dengan query:
Produk 1: ['ordinary', 'natural', 'moisturising', 'factors', '+', 'ha', '30ml', 'moisturiser', '[', "'capric", 'triglyceride', "'", ',', "'cetyl", 'alcohol', "'", ',', "'propanediol", "'", ',', "'stearyl", 'alcohol', "'", ',', "'glycerin", "'", ',', "'sodium", 'hyaluronate', "'", ',', "'arganine", "'", ',', "'aspartic", 'acid', "'", ',', "'glycine", "'", ',', "'alanine", "'", ',', "'serine", "'", ',', "'valine", "'", ',', "'isoleucine", "'", ',', "'proline", "'", ',', "'threonine", "'", ',', "'histidine", "'", ',', "'phenylalanine", "'", ',', "'glucose", "'", ',', "'maltose", "'", ',', "'fructose", "'", ',', "'trehalose", "'", ',', "'sodium", 'pca', "'", ',', "'pca", "'", ',', "'sodium", 'lactate', "'", ',', "'urea", "'", ',', "'allantoin", "'", ',', "'linoleic", 'acid', "'", ',', "'oleic", 'acid', "'", ',', "'phytosteryl", 'canola', 'glycerides', "'", ',', "'palmitic", 'acid', "'", ',', "'stearic", 'acid', "'", ',', "'lecithin", "'", ',', "'triolein",

###Eval berdasarkan Keseluruhan Dataset

In [29]:
# Menentukan threshold untuk relevansi
threshold_recommend = 0.150
threshold_ground_truth = 0.180

# Rekomendasi dianggap relevan jika cosine similarity >= threshold
recommended_relevance = cosine_similarities[0][sorted_indices] >= threshold_recommend

# Ground truth relevansi berdasarkan threshold
ground_truth_relevance = cosine_similarities[0][sorted_indices] >= threshold_ground_truth

# Print nilai recommended_relevance dan ground_truth_relevance
print("Recommended Relevance (>= threshold):", recommended_relevance)
print("Ground Truth Relevance (>= threshold):", ground_truth_relevance)

# Menghitung True Positives (TP), False Positives (FP), dan False Negatives (FN)
tp = np.sum(np.logical_and(recommended_relevance, ground_truth_relevance))  # True Positives
fp = np.sum(np.logical_and(recommended_relevance, ~ground_truth_relevance))  # False Positives
fn = np.sum(np.logical_and(~recommended_relevance, ground_truth_relevance))  # False Negatives

# Menghitung Precision, Recall, dan F1-Score
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Hasil evaluasi
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")

Recommended Relevance (>= threshold): [False False False ... False False False]
Ground Truth Relevance (>= threshold): [False False False ... False False False]
Precision: 0
Recall: 0
F1-Score: 0


### Eval berdasarkan 5 teratas

In [30]:
# Menentukan threshold untuk relevansi
threshold_recommend = 0.155
threshold_ground_truth = 0.180

# Rekomendasi dianggap relevan jika cosine similarity >= threshold
recommended_relevance = cosine_similarities[0] >= threshold_recommend
ground_truth_relevance = cosine_similarities[0] >= threshold_ground_truth

# Mengambil indeks produk yang relevan
relevant_indices = np.where(recommended_relevance)[0]
sorted_indices = np.argsort(-cosine_similarities[0][relevant_indices])[:5]  # Ambil 5 teratas

# Hitung TP, FP, dan FN berdasarkan produk teratas yang relevan
recommended_relevance_top = recommended_relevance[relevant_indices][sorted_indices]
ground_truth_relevance_top = ground_truth_relevance[relevant_indices][sorted_indices]

tp = np.sum(np.logical_and(recommended_relevance_top, ground_truth_relevance_top))  # True Positives
fp = np.sum(np.logical_and(recommended_relevance_top, ~ground_truth_relevance_top))  # False Positives
fn = np.sum(np.logical_and(~recommended_relevance_top, ground_truth_relevance_top))  # False Negatives

# Menghitung Precision, Recall, dan F1-Score
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Hasil evaluasi
print(f"Recommended Relevance (>= threshold): {recommended_relevance_top}")
print(f"Ground Truth Relevance (>= threshold): {ground_truth_relevance_top}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")

Recommended Relevance (>= threshold): []
Ground Truth Relevance (>= threshold): []
Precision: 0
Recall: 0
F1-Score: 0
