In [1]:
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
amazon_df = pd.read_csv('/content/amazon_product.csv')
amazon_df.head()

Unnamed: 0,id,Title,Description,Category
0,1,Swissmar Capstore Select Storage Rack for 18-...,Swissmar's capstore select 18 storage unit kee...,Home & Kitchen Kitchen & Dining Kitchen Utens...
1,2,Gemini200 Delta CV-880 Gold Crown Livery Airc...,Welcome to the exciting world of GeminiJets! O...,Toys & Games Hobbies Models & Model Kits Pre-...
2,5,Superior Threads 10501-2172 Magnifico Cream P...,"For quilting and embroidery, this product is m...","Arts, Crafts & Sewing Sewing Thread & Floss S..."
3,6,Fashion Angels Color Rox Hair Chox Kit,Experiment with the haute trend of hair chalki...,Beauty & Personal Care Hair Care Hair Colorin...
4,8,Union Creative Giant Killing Figure 05: Daisu...,From Union Creative. Turn your display shelf i...,Toys & Games › Action Figures & Statues › Sta...


In [3]:
amazon_df = amazon_df.drop('id',axis=1)
amazon_df

Unnamed: 0,Title,Description,Category
0,Swissmar Capstore Select Storage Rack for 18-...,Swissmar's capstore select 18 storage unit kee...,Home & Kitchen Kitchen & Dining Kitchen Utens...
1,Gemini200 Delta CV-880 Gold Crown Livery Airc...,Welcome to the exciting world of GeminiJets! O...,Toys & Games Hobbies Models & Model Kits Pre-...
2,Superior Threads 10501-2172 Magnifico Cream P...,"For quilting and embroidery, this product is m...","Arts, Crafts & Sewing Sewing Thread & Floss S..."
3,Fashion Angels Color Rox Hair Chox Kit,Experiment with the haute trend of hair chalki...,Beauty & Personal Care Hair Care Hair Colorin...
4,Union Creative Giant Killing Figure 05: Daisu...,From Union Creative. Turn your display shelf i...,Toys & Games › Action Figures & Statues › Sta...
...,...,...,...
663,Rosemery (Rosemary) - Box of Six 20 Stick Hex...,"Six tubes, each containing 20 sticks of incens...",Home & Kitchen Home Décor Home Fragrance Ince...
664,"InterDesign Linus Stacking Organizer Bin, Ext...",The InterDesign Linus Organizer Bins are stack...,Home & Kitchen Kitchen & Dining Storage & Org...
665,Gourmet Rubber Stamps Diagonal Stripes Stenci...,Gourmet Rubber Stamps-Stencil. This delicious ...,Toys & Games Arts & Crafts Printing & Stamping
666,Spenco RX Arch Cushion Full Length Comfort Su...,"Soft, durable arch support. consumers with gen...",Health & Household › Health Care › Foot Healt...


In [4]:
amazon_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 668 entries, 0 to 667
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        668 non-null    object
 1   Description  668 non-null    object
 2   Category     668 non-null    object
dtypes: object(3)
memory usage: 15.8+ KB



#Define a function that tokenizes and stems the text of the product titles and descriptions using the NLTK library.

Overall, this function is useful for preprocessing text data for tasks such as natural language processing and machine learning, where reducing the number of unique words in the text can improve model performance. Stemming can help by reducing each word to its base form, allowing the model to treat variations of the same word (e.g., "run", "running", "ran") as equivalent.

In [5]:
stemmer = SnowballStemmer("english")

def tokenize_stem(text):
    tokens = nltk.word_tokenize(text.lower())
    stem = [stemmer.stem(w) for w in tokens]
    return " ".join(stem)

In [6]:
# amazon_df['stemmed_tokens'] = amazon_df.apply(lambda row: tokenize_stem(row['Title'] + ' ' + row['Description']), axis=1)
amazon_df['stemmed_tokens'] = amazon_df.apply(lambda row: tokenize_stem(row['Category']), axis=1)
amazon_df.head()

Unnamed: 0,Title,Description,Category,stemmed_tokens
0,Swissmar Capstore Select Storage Rack for 18-...,Swissmar's capstore select 18 storage unit kee...,Home & Kitchen Kitchen & Dining Kitchen Utens...,home & kitchen kitchen & dine kitchen utensil ...
1,Gemini200 Delta CV-880 Gold Crown Livery Airc...,Welcome to the exciting world of GeminiJets! O...,Toys & Games Hobbies Models & Model Kits Pre-...,toy & game hobbi model & model kit pre-built &...
2,Superior Threads 10501-2172 Magnifico Cream P...,"For quilting and embroidery, this product is m...","Arts, Crafts & Sewing Sewing Thread & Floss S...","art , craft & sew sew thread & floss sew"
3,Fashion Angels Color Rox Hair Chox Kit,Experiment with the haute trend of hair chalki...,Beauty & Personal Care Hair Care Hair Colorin...,beauti & person care hair care hair color prod...
4,Union Creative Giant Killing Figure 05: Daisu...,From Union Creative. Turn your display shelf i...,Toys & Games › Action Figures & Statues › Sta...,toy & game › action figur & statu › statu & bo...


## Define a Function that returns the cosine similarity between 2 entities ( in  this case , titles and descriptions)

TfidfVectorizer is a text feature extraction tool that creates a numerical representation of text by converting it into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. This matrix can then be used as input for various machine learning algorithms.

The tokenizer parameter in TfidfVectorizer is set to tokenize_stem, which is a custom function defined elsewhere in the code that tokenizes and stems text.

cosine_similarity is a function in scikit-learn that calculates the cosine similarity between two matrices. In this code, it is used to calculate the cosine similarity between two texts represented as TF-IDF matrices generated by TfidfVectorizer.

The cosine_sim function defined in the code takes in two text inputs (txt1 and txt2), uses TfidfVectorizer to generate TF-IDF matrices for them, and then calculates the cosine similarity between these matrices using cosine_similarity. It returns the cosine similarity score as a single value.

### What is Cosine Similarity?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. In the context of natural language processing, it is commonly used to measure the similarity between two text documents. The cosine similarity score ranges from 0 to 1, where 0 means no similarity and 1 means identical. The score is calculated based on the frequency of common words or terms in the two documents. The higher the cosine similarity score, the more similar the documents are.

In [7]:
tfidvectorizer = TfidfVectorizer(tokenizer=tokenize_stem)

def cosine_sim(txt1,txt2):
    tfid_matrix = tfidvectorizer.fit_transform([txt1,txt2])
    return cosine_similarity(tfid_matrix)[0][1]    # returns the cosine similarity of both the texts here the title and description between 0 to 1, 1 identical , 0 dissimilar.

## Define a function that takes a query string and returns a DataFrame with the top 10 most relevant products based on the cosine similarity between the query string and the product titles and descriptions.

The tokenize_stem function is called on the query string, which presumably tokenizes and stems the words in the string. The cosine_sim function is used to compute the cosine similarity between the stemmed query and each product's stemmed_tokens column. This function likely calculates the cosine similarity between two vectors. A new column called similarity is added to the amzon_df dataframe using the apply method. This column contains the cosine similarity between the query and each product's stemmed_tokens column. The amzon_df dataframe is sorted in descending order by the similarity column using the sort_values method. The top 10 rows of the sorted dataframe are selected using the head method and the [['Title','Description','Category']] notation is used to return only the specified columns.

In [8]:
def search_product(query):
    stemmed_query = tokenize_stem(query)
    #calcualting cosine similarity between query and stemmed tokens columns
    amazon_df['similarity'] = amazon_df['stemmed_tokens'].apply(lambda x:cosine_sim(stemmed_query,x))
    res = amazon_df.sort_values(by=['similarity'],ascending=False).head(10)[['Title','Description','Category']]
    return res

In [9]:
amazon_df.head()

Unnamed: 0,Title,Description,Category,stemmed_tokens
0,Swissmar Capstore Select Storage Rack for 18-...,Swissmar's capstore select 18 storage unit kee...,Home & Kitchen Kitchen & Dining Kitchen Utens...,home & kitchen kitchen & dine kitchen utensil ...
1,Gemini200 Delta CV-880 Gold Crown Livery Airc...,Welcome to the exciting world of GeminiJets! O...,Toys & Games Hobbies Models & Model Kits Pre-...,toy & game hobbi model & model kit pre-built &...
2,Superior Threads 10501-2172 Magnifico Cream P...,"For quilting and embroidery, this product is m...","Arts, Crafts & Sewing Sewing Thread & Floss S...","art , craft & sew sew thread & floss sew"
3,Fashion Angels Color Rox Hair Chox Kit,Experiment with the haute trend of hair chalki...,Beauty & Personal Care Hair Care Hair Colorin...,beauti & person care hair care hair color prod...
4,Union Creative Giant Killing Figure 05: Daisu...,From Union Creative. Turn your display shelf i...,Toys & Games › Action Figures & Statues › Sta...,toy & game › action figur & statu › statu & bo...


In [10]:
search_product('Household items')



Unnamed: 0,Title,Description,Category
79,"Lola Rola Sticky Mop, Picks Up Dirt, Dust, an...","You'll get stuck on our Sticky Mop. ""It's Like...",Health & Household Household Supplies Cleanin...
171,"MET-Rx Active Man Multivitamin for Men, with ...",﻿MET-Rx Active Man is designed for men of all ...,Health & Household Sports Nutrition
228,Pedag 101 Princess Cushioning Leather Half Fo...,"LEATHER Art. 172: Durable, naturally tanned sh...",Health & Household Health Care Foot Health In...
291,"Febreze Air Freshener, Noticeables Air Freshe...","Febreze PLUG doesn't just mask odors, it clean...",Health & Household › Household Supplies › Air...
287,Quality Importers Trading Ashtoon Hammersmith...,Ashtoon Hammersmith Iron heavy-duty cigar asht...,Health & Household Household Supplies Tobacco...
340,WalterDrake Bunion Regulator Splint,Right toe bunion regulator. Reduce pain and pr...,Health & Household Health Care Foot Health Bu...
173,"Nutricost D-Aspartic Acid Capsules, (DAA) 180...",D-Aspartic Acid is commonly known as D-AA or D...,Health & Household › Sports Nutrition › Testo...
524,"NOW Supplements, Psyllium Husk 500 mg, 500 Ca...","Take 3 capsules with 8 oz. glass of liquid, 2 ...",Health & Household Vitamins & Dietary Supplem...
384,Visol Ferrara Red Cigar Case - Holds 2-3 Cigars,"Vibrant red, soft leather wraps this crushproo...",Health & Household Household Supplies Tobacco...
401,Puritans Pride Milk Thistle 4:1 Extract 1000 ...,The exceptional benefits of Milk Thistle are d...,Health & Household › Vitamins & Dietary Suppl...
