# Technical Assessment for a Full Stack AI/ML Engineer Role
## Task
*Develop a mini AI-based chatbot system that recommends products based on user queries.
The chatbot should be able to understand the user's text input, process it, and recommend a
list of products.*

# 1. Data Preparation:
*a. Given a mock dataset of products in a CSV format with the following fields:*
**i. Product ID
ii. Product Name
iii. Description
iv. Category
v. Price**
*b. Preprocess the dataset:*
**i. Handle missing values
ii. Tokenization of product descriptions**

In [562]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')

In [563]:
df = pd.read_csv('mini-product-recommender-dataset.csv')
df.head()

Unnamed: 0,Product ID,Product Name,Description,Category,Price
0,1,Smartphone A,"Sleek design, 64GB storage, 12MP camera",Electronics,699.99
1,2,Laptop B,"15.6 inch, 8GB RAM, 256GB SSD",Electronics,999.99
2,3,Casual Shoes C,"Leather, size 10, brown",Fashion,79.99
3,4,Travel Mug D,"Stainless steel, 500ml",Home & Kitchen,14.99
4,5,Eau De Parfum E,"Floral scent, 100ml",Beauty,49.99


In [564]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Product ID    40 non-null     int64  
 1   Product Name  38 non-null     object 
 2   Description   36 non-null     object 
 3   Category      37 non-null     object 
 4   Price         36 non-null     float64
dtypes: float64(1), int64(1), object(3)
memory usage: 1.7+ KB


In [565]:
df.isnull().sum()

Product ID      0
Product Name    2
Description     4
Category        3
Price           4
dtype: int64

In [566]:
df['Category'].value_counts()

Category
Electronics               8
Fashion                   7
Home & Kitchen            6
Beauty                    3
Books                     2
Grocery                   2
Sports                    2
Toys & Games              2
Movies & TV               1
Musical Instruments       1
Arts & Crafts             1
Health & Personal Care    1
Pets                      1
Name: count, dtype: int64

In [567]:
df['Category'].fillna('Others', inplace=True)

In [568]:
df.isnull().sum()

Product ID      0
Product Name    2
Description     4
Category        0
Price           4
dtype: int64

In [569]:
# drop rows with missing values
df.dropna(inplace=True)

In [570]:
df.isnull().sum()

Product ID      0
Product Name    0
Description     0
Category        0
Price           0
dtype: int64

In [571]:
df.isnull().sum()

Product ID      0
Product Name    0
Description     0
Category        0
Price           0
dtype: int64

In [572]:
# tokenize the product descriptions so that the model can understand the text based on user query
df['Tokenized_Description'] = df['Description'].apply(word_tokenize)
df.head()


Unnamed: 0,Product ID,Product Name,Description,Category,Price,Tokenized_Description
0,1,Smartphone A,"Sleek design, 64GB storage, 12MP camera",Electronics,699.99,"[Sleek, design, ,, 64GB, storage, ,, 12MP, cam..."
1,2,Laptop B,"15.6 inch, 8GB RAM, 256GB SSD",Electronics,999.99,"[15.6, inch, ,, 8GB, RAM, ,, 256GB, SSD]"
2,3,Casual Shoes C,"Leather, size 10, brown",Fashion,79.99,"[Leather, ,, size, 10, ,, brown]"
3,4,Travel Mug D,"Stainless steel, 500ml",Home & Kitchen,14.99,"[Stainless, steel, ,, 500ml]"
4,5,Eau De Parfum E,"Floral scent, 100ml",Beauty,49.99,"[Floral, scent, ,, 100ml]"


In [573]:
# save the processed data to a csv file and first row is header
df.to_csv('processed_data.csv', index=False, header=True)


In [574]:
# read the processed data
df_new = pd.read_csv('processed_data.csv')

# 2. Model Training:
*a. Use the processed data to train a simple recommendation model, based on
the candidate's preference.
b. The model should take a user's query and output the top 3 product
recommendations based on similarity to product descriptions.*

In [575]:
# import libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')


In [576]:
# read the processed data
df = pd.read_csv('processed_data.csv')
df.head()

Unnamed: 0,Product ID,Product Name,Description,Category,Price,Tokenized_Description
0,1,Smartphone A,"Sleek design, 64GB storage, 12MP camera",Electronics,699.99,"['Sleek', 'design', ',', '64GB', 'storage', ',..."
1,2,Laptop B,"15.6 inch, 8GB RAM, 256GB SSD",Electronics,999.99,"['15.6', 'inch', ',', '8GB', 'RAM', ',', '256G..."
2,3,Casual Shoes C,"Leather, size 10, brown",Fashion,79.99,"['Leather', ',', 'size', '10', ',', 'brown']"
3,4,Travel Mug D,"Stainless steel, 500ml",Home & Kitchen,14.99,"['Stainless', 'steel', ',', '500ml']"
4,5,Eau De Parfum E,"Floral scent, 100ml",Beauty,49.99,"['Floral', 'scent', ',', '100ml']"


In [577]:
# create a tfidf vectorizer object
tfidf = TfidfVectorizer()


In [578]:
# fit the vectorizer object on the tokenized product descriptions
tfidf.fit(df['Tokenized_Description'])

In [579]:
# transform the tokenized product descriptions
tfidf_matrix = tfidf.transform(df['Tokenized_Description'])


In [580]:
# compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


In [581]:
# save the cosine similarity matrix to a csv file
pd.DataFrame(cosine_sim).to_csv('cosine_sim.csv', index=False, header=False)

In [582]:
# read the cosine similarity matrix
cosine_sim = pd.read_csv('cosine_sim.csv', header=None)

In [583]:
# create a series of product names
product_names = pd.Series(df['Product Name'])

In [584]:
# function to recommend products based on user query
def recommend_products(query):
    # tokenize the user query
    query = word_tokenize(query)
    # transform the tokenized query
    query = tfidf.transform(query)
    # compute the cosine similarity between the user query and all product descriptions
    similarity_scores = cosine_similarity(query, tfidf_matrix)
    # get the indices of the top 3 most similar products
    indices = similarity_scores.argsort()[0][-3:]
    # get the product names corresponding to the indices
    product_names = df['Product Name'].iloc[indices]
    # return the product names
    return product_names

In [585]:
# test the function
recommend_products('leather jacket')

9      Wrist Watch J
2     Casual Shoes C
28        Sandals H1
Name: Product Name, dtype: object

In [586]:
recommend_products('Leather shoes') #

9      Wrist Watch J
2     Casual Shoes C
28        Sandals H1
Name: Product Name, dtype: object