## Amazon Product Autocomplete

This notebook provides a solution for providing product recommendations based on the search query from a user. The notebook implements two approachs:

- Keyword based search using TheFuzz
- Semantic search using sentence-transformers and Faiss

The steps implemented in this notebook are:

- Data loading and basic analysis
- Data cleaning
- Data preprocessing
- Search implementation
- Testing

For testing, we have considered three strings,

- "Fire TV" straightforward keyword based search
- "A birthday gift for kids party" for semantic based search
- "Dire tablet" for misspelling text search

Further improvements needed:

- Optimize search
- Semantic search needs improvement. Currently, it is done only on product names (extracted from the data); ideally it should be done on all names
- More comprehensive testing

<p><span style="font-size:18px"><span style="background-color:#f1c40f">&nbsp;<span style="color:#ffffff"><strong>Data Loading and Basic Analysis</strong></span>&nbsp;</span></span></p>

Steps Taken:

- Load the datasource
- Check size
- Look at top entries
- Check if any null values

In [37]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("product_names.csv")
df.head(5)
pd.options.display.max_colwidth = 300

In [3]:
df.shape

(2397876, 1)

In [4]:
df.head(-5)

Unnamed: 0,Product Name
0,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Magenta"
1,"Kindle Oasis E-reader with Leather Charging Cover - Merlot, 6 High-Resolution Display (300 ppi), Wi-Fi - Includes Special Offers,,"
2,"Amazon Kindle Lighted Leather Cover,,,\r\nAmazon Kindle Lighted Leather Cover,,,"
3,"Amazon Kindle Lighted Leather Cover,,,\r\nKindle Keyboard,,,"
4,"Kindle Keyboard,,,\r\nKindle Keyboard,,,"
...,...
2397866,"Premium Cotton Towels, Stripe"
2397867,Organic Textured Cotton Towel
2397868,Premium Cotton Towels
2397869,"L.L.Bean Egyptian Cotton Towels, Stripe"


In [5]:
df['Product Name'].isna().sum()

0

<p><span style="font-size:18px"><span style="background-color:#f1c40f">&nbsp;<span style="color:#ffffff"><strong>Data Cleaning</strong></span>&nbsp;</span></span></p>

Steps Taken:

- Remove \r\n
- Split by ',' as the data suggest entries are separated by comma values
- Remove values after split that are blank
- Check if any null values

In [6]:
df_cleaned = df['Product Name'].str.replace("\r\n", "").str.split(",")

In [7]:
df_empty_removed = df_cleaned.apply(lambda row: list(filter (None, row)))

In [8]:
df_empty_removed.head(5)

0                                           [All-New Fire HD 8 Tablet,  8 HD Display,  Wi-Fi,  16 GB - Includes Special Offers,  Magenta]
1    [Kindle Oasis E-reader with Leather Charging Cover - Merlot,  6 High-Resolution Display (300 ppi),  Wi-Fi - Includes Special Offers]
2                                                              [Amazon Kindle Lighted Leather Cover, Amazon Kindle Lighted Leather Cover]
3                                                                                  [Amazon Kindle Lighted Leather Cover, Kindle Keyboard]
4                                                                                                      [Kindle Keyboard, Kindle Keyboard]
Name: Product Name, dtype: object

<p><span style="font-size:18px"><span style="background-color:#f1c40f">&nbsp;<span style="color:#ffffff"><strong>Data Preprocessing</strong></span>&nbsp;</span></span></p>

Steps Taken:

- Get product name from the first value in the in the split list. We are getting product name for faster searches
- Get description from the rest of the values in the list
- 
- Check if any null values

In [9]:
all_product_names = df_empty_removed.apply(lambda x: x[0])

In [10]:
all_product_description = df_empty_removed.apply(lambda x: x[1:])

In [11]:
df_product_name = pd.DataFrame({'Product Name':all_product_names, 'Description':all_product_description})

In [12]:
df_product_name['Description'] = df_product_name['Description'].apply(lambda row: ' '.join(row))

In [13]:
df_product_name.loc[df_product_name ['Product Name'] == df_product_name ['Description'], "Description"] = ""

In [14]:
df_product_name.head(5)

Unnamed: 0,Product Name,Description
0,All-New Fire HD 8 Tablet,8 HD Display Wi-Fi 16 GB - Includes Special Offers Magenta
1,Kindle Oasis E-reader with Leather Charging Cover - Merlot,6 High-Resolution Display (300 ppi) Wi-Fi - Includes Special Offers
2,Amazon Kindle Lighted Leather Cover,
3,Amazon Kindle Lighted Leather Cover,Kindle Keyboard
4,Kindle Keyboard,


In [15]:
df_product_name_unique = df_product_name.drop_duplicates()

In [16]:
df_product_name_unique.loc[:,'Product Name Orig'] = df_product_name_unique['Product Name'] + " " + df_product_name_unique['Description']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product_name_unique.loc[:,'Product Name Orig'] = df_product_name_unique['Product Name'] + " " + df_product_name_unique['Description']


In [17]:
df_product_name_unique.loc[:,'Product Name'] = df_product_name_unique['Product Name'].str.lower()

In [18]:
df_product_name_unique.head(5)

Unnamed: 0,Product Name,Description,Product Name Orig
0,all-new fire hd 8 tablet,8 HD Display Wi-Fi 16 GB - Includes Special Offers Magenta,All-New Fire HD 8 Tablet 8 HD Display Wi-Fi 16 GB - Includes Special Offers Magenta
1,kindle oasis e-reader with leather charging cover - merlot,6 High-Resolution Display (300 ppi) Wi-Fi - Includes Special Offers,Kindle Oasis E-reader with Leather Charging Cover - Merlot 6 High-Resolution Display (300 ppi) Wi-Fi - Includes Special Offers
2,amazon kindle lighted leather cover,,Amazon Kindle Lighted Leather Cover
3,amazon kindle lighted leather cover,Kindle Keyboard,Amazon Kindle Lighted Leather Cover Kindle Keyboard
4,kindle keyboard,,Kindle Keyboard


In [20]:
import nltk
nltk.download('punkt')
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def remove_stop_words (line):
    tokens = word_tokenize(line)
    new_filtered_tokens = [word for word in tokens if word not in stopwords.words('english')]
    return " ".join(new_filtered_tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dikshashukla\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dikshashukla\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [64]:
import time
import swifter
start_time = time.time()
df_product_name_unique.loc[:,'Product Name_sw_rm'] = df_product_name_unique['Product Name'].swifter.apply(lambda x: " ".join([k for k in x.split(" ") if k not in stopwords.words('english') ]))
print("--- %s seconds ---" % (time.time() - start_time))

Pandas Apply:   0%|          | 0/2397438 [00:00<?, ?it/s]

--- 4989.166684150696 seconds ---


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product_name_unique.loc[:,'Product Name_sw_rm'] = df_product_name_unique['Product Name'].swifter.apply(lambda x: " ".join([k for k in x.split(" ") if k not in stopwords.words('english') ]))


In [65]:
df_product_name_unique.to_csv('product_details_processed.csv', index=False)

In [67]:
df_product_name_unique.head(5)

Unnamed: 0,Product Name,Description,Product Name Orig,Product Name_sw_rm
0,all-new fire hd 8 tablet,8 HD Display Wi-Fi 16 GB - Includes Special Offers Magenta,All-New Fire HD 8 Tablet 8 HD Display Wi-Fi 16 GB - Includes Special Offers Magenta,all-new fire hd 8 tablet
1,kindle oasis e-reader with leather charging cover - merlot,6 High-Resolution Display (300 ppi) Wi-Fi - Includes Special Offers,Kindle Oasis E-reader with Leather Charging Cover - Merlot 6 High-Resolution Display (300 ppi) Wi-Fi - Includes Special Offers,kindle oasis e-reader leather charging cover - merlot
2,amazon kindle lighted leather cover,,Amazon Kindle Lighted Leather Cover,amazon kindle lighted leather cover
3,amazon kindle lighted leather cover,Kindle Keyboard,Amazon Kindle Lighted Leather Cover Kindle Keyboard,amazon kindle lighted leather cover
4,kindle keyboard,,Kindle Keyboard,kindle keyboard


In [25]:
import re
df_product_name_unique[df_product_name_unique['Product Name'].str.contains('kindle', flags=re.IGNORECASE)].shape

(761, 3)

In [32]:
from sentence_transformers import SentenceTransformer
import faiss
from faiss import write_index, read_index
import os

def getIndexForSearch():
    if(os.path.isfile("large.index") == False):
        encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
        text = df_product_name_unique['Product Name'].values
        vectors = encoder.encode(text)
        vector_dimension = vectors.shape[1]
        index = faiss.IndexFlatL2(vector_dimension)
        faiss.normalize_L2(vectors)
        index.add(vectors)
        write_index(index, "large.index")
        return index
    else:
        return read_index("large.index")
    

In [34]:
def getTopResults (search_text_passed,index_for_search):
    search_vector = encoder.encode(search_text_passed.lower())
    _vector = np.array([search_vector])
    faiss.normalize_L2(_vector)
    distances, ann = index_for_search.search(_vector, k=10)
    results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})
    print(df_product_name_unique['Product Name Orig'].iloc[results.ann.tolist()].values)

In [82]:
def combined_search (search_text):
    if (len(search_text) > 10):
        getTopResults(search_text,index_for_search)
    else:
        print(process.extract(search_text, df_product_name_unique["Product Name Orig"], limit = 5))

In [84]:
combined_search("Fire TV")

[('Amazon Echo and Fire TV Power Adapter ', 90, 27), ('Amazon Fire Tv ', 90, 42), ('Fire TV Stick Streaming Media Player Pair Kit ', 90, 75), ('Amazon Fire TV Gaming Edition Streaming Media Player ', 90, 119), ('Amazon Kindle Fire Hd (3rd Generation) 8gb ', 86, 9)]


In [83]:
combined_search ("A birthday gift for kids party")

['TREORSI Blank Satin Sash  Plain Sash  Party Decorations  Make Your Own Sash  2 Pack (White)'
 '20Pcs Tissue Paper Pom Poms Pink Flowers Paper Honeycomb Balls Paper Lanterns Hanging Paper Fans for Wedding  Birthday  Baby Shower  Nursery  Bridal Shower Decor'
 'URATOT 72 Pieces Spa Party Supplies Multiple Spa Party Favors for Girls 12 Tote Bags  24 Toe Separators 12 Emery Boards 12 Body Jewels and 12 Colored Hair Clip Braids'
 'senover Mr and Mrs Sign Wedding Sweetheart Table Decorations Mr and Mrs Letters Decorative Letters for Wedding Photo Props Party Banner Decoration，Wedding Shower Gift (Gold Glitter)'
 '24 Make A Dinosaur Stickers For Kids - Great Dino Theme Birthday Party Favors - Fun Craft Project For Children 3+ - Let Your Kids Get Creative & Design Their Favorite Dinosaur Sticker '
 'Graceful Movements Mermaid Watercolor Art Print Legend of The Sea Set of 4(8" x10") Unframed Canvas Print  Great Gift for Girls Bedroom Bathroom Home Decor'
 'FirstKitchen 3.2M/10.5Feet Lace Bunt

In [81]:
combined_search("dire tablet")

['WiWi Womens Comfy Pajama Set Short Sleeve Sleepwear S-4X '
 'Legends Never Die American Pharaoh 2015 Triple Crown Winner Framed Photo Collage  11" X 14"'
 'Nivea Shimmer Radiant Lip Care 0.17 Oz (Pack of 2) '
 'SONGMICS 5x Magnifying Wall Mount Makeup Mirror 8 Inch Two-Sided Swivel Extendable Bathroom Mirror Chrome UBBM513 '
 'THREE PACKS of Astral Cream x 200ml by Astral '
 'Organic Baby Conditioner with Aloe  Coconut Oil  Citrus Essential Oils – Safe  Gentle  Nourishing – Eczema Friendly – Paraben  Dye  Gluten  and Sulfate Free – 8 oz'
 'Avanti Linens Banana Palm Hand Towel  Linen'
 'Nicknocks Wooden Bamboo Dustproof Yarn Bowl Holder with Lid Crochet Wool Storage Tool '
 'VCNY Home Melanie Ruffle Shower Curtain  72x72  White'
 'DECOWALL DA-1406B Animal Hot Air Balloons Kids Wall Decals Wall Stickers Peel and Stick Removable Wall Stickers for Kids Nursery Bedroom Living Room ']
