# Technical Assessment for a Full Stack AI/ML Engineer Role
## Task
*Develop a mini AI-based chatbot system that recommends products based on user queries.
The chatbot should be able to understand the user's text input, process it, and recommend a
list of products.*

# 1. Data Preparation:
*a. Given a mock dataset of products in a CSV format with the following fields:*
**i. Product ID
ii. Product Name
iii. Description
iv. Category
v. Price**
*b. Preprocess the dataset:*
**i. Handle missing values
ii. Tokenization of product descriptions**

In [66]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

import warnings
warnings.filterwarnings('ignore')

In [67]:
df = pd.read_csv('mini-product-recommender-dataset.csv')
df.head()

Unnamed: 0,Product ID,Product Name,Description,Category,Price
0,1,Smartphone A,"Sleek design, 64GB storage, 12MP camera",Electronics,699.99
1,2,Laptop B,"15.6 inch, 8GB RAM, 256GB SSD",Electronics,999.99
2,3,Casual Shoes C,"Leather, size 10, brown",Fashion,79.99
3,4,Travel Mug D,"Stainless steel, 500ml",Home & Kitchen,14.99
4,5,Eau De Parfum E,"Floral scent, 100ml",Beauty,49.99


In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Product ID    40 non-null     int64  
 1   Product Name  38 non-null     object 
 2   Description   36 non-null     object 
 3   Category      37 non-null     object 
 4   Price         36 non-null     float64
dtypes: float64(1), int64(1), object(3)
memory usage: 1.7+ KB


In [69]:
df.isnull().sum()

Product ID      0
Product Name    2
Description     4
Category        3
Price           4
dtype: int64

In [70]:
df['Category'].value_counts()

Category
Electronics               8
Fashion                   7
Home & Kitchen            6
Beauty                    3
Books                     2
Grocery                   2
Sports                    2
Toys & Games              2
Movies & TV               1
Musical Instruments       1
Arts & Crafts             1
Health & Personal Care    1
Pets                      1
Name: count, dtype: int64

In [71]:
df['Category'].fillna('Others', inplace=True)

In [72]:
df.isnull().sum()

Product ID      0
Product Name    2
Description     4
Category        0
Price           4
dtype: int64

In [73]:
# drop rows with missing values
df.dropna(inplace=True)

In [74]:
df.isnull().sum()

In [75]:
df.isnull().sum()

In [76]:
# tokenize the product descriptions so that the model can understand the text based on user query
df['Tokenized_Description'] = df['Description'].apply(word_tokenize)
df.head()


In [77]:
# save the processed data to a csv file
df.to_csv('processed_data.csv', index=False)


In [78]:
# read the processed data
df = pd.read_csv('processed_data.csv')
df.head()

# 2. Model Training:
*a. Use the processed data to train a simple recommendation model, based on
the candidate's preference.
b. The model should take a user's query and output the top 3 product
recommendations based on similarity to product descriptions.*

In [79]:
# train the model using the tokenized descriptions so that when user query is given, the model can find the most similar product description
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



In [80]:
# create a tfidf vectorizer object
tfidf = TfidfVectorizer()

# fit the vectorizer using the tokenized descriptions
tfidf.fit(df['Tokenized_Description'])

# transform the tokenized descriptions
tfidf_descriptions = tfidf.transform(df['Tokenized_Description'])



In [81]:
# create a dataframe with the transformed descriptions
tfidf_descriptions_df = pd.DataFrame(tfidf_descriptions.toarray(), index=df.index.tolist())

