**Develop a Chatbot for the E-commerce domain using the apparel dataset provided separately. It is expected that you are able to entertain multiple types of scenarios:~**

 - 1) A user starts a conversation by asking a few things in the beginning, like:
        User: Do you sell boys tops? (or similar question)
        Bot: Bot would say “yes” or “no”, depending upon the availability of the item? Also ask relevant attribute(s) to simplify the search and/or to engage the user (like which color or type you are looking for?)
        User: ......
        Bot: .......

- 2) The user directly provides a long sentence, such as:
        User: I need a t-shirt for my son in green or blue color
        Bot: respond with search results or ask about other attributes to simplify the search User: ........
        Bot: .......

- 3) The user may be very crude in his/her input and may only say “blue t-shirt for boys”. It is similar to category 2) above but has lesser greetings or formality from the user side. The bot may continue in its typical fashion, as was the case in 2).


The user may add more information in the subsequent responses as the discussion continues. Ensure that you maintain a memory to store the data received/extracted so far.


As discussed in the class, the final solution may use a hybrid approach that compromises both pattern-based and word2vec based search, which is perfectly fine.


_Deliverables: In addition to submitting a working code (preferably using Streamlit interface), briefly describe the approach you took._

### We train our tfidf, svd, lsa embedding and doc2vec models in this notebook and save them in our repository.

In [1]:
import pandas as pd
import pickle
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
import re
import gensim
from skimage import io
import matplotlib.pyplot as plt
from gensim.test.utils import get_tmpfile

# Load Data

In [2]:
# Load description features
df = pd.read_excel("data/fashion_final.xlsx")
print (df.shape)
df.head(5)

(1272, 8)


  warn("Workbook contains no default style, apply openpyxl's default")


Unnamed: 0,SN,Gender,Category,SubCategory,ProductType,Colour,Usage,ProductTitle
0,1.0,Girls,Apparel,Topwear,Tops,White,Casual,Gini and Jony Girls Knit White Top
1,2.0,Girls,Apparel,Topwear,Tops,Black,Casual,Gini and Jony Girls Black Top
2,3.0,Girls,Apparel,Topwear,Tops,Blue,Casual,Gini and Jony Girls Pretty Blossom Blue Top
3,4.0,Girls,Apparel,Topwear,Tops,Pink,Casual,Doodle Kids Girls Pink I love Shopping Top
4,5.0,Girls,Apparel,Bottomwear,Capris,Black,Casual,Gini and Jony Girls Black Capris


In [3]:
def inspect_product(product_id):
    
    single_product = df.query('SN==@product_id')
    print (single_product.ProductTitle)

    print ("Gender:")
    print (single_product.Gender)
    print ("Category:")
    print (single_product.Category)
    print ("SubCategory:")
    print (single_product.SubCategory)
    print ("Colour:")
    print (single_product.Colour)
    print ("Usage:")
    print (single_product.Usage)

In [4]:
inspect_product(product_id=8)

7    Gini and Jony Girls Red Top
Name: ProductTitle, dtype: object
Gender:
7    Girls
Name: Gender, dtype: object
Category:
7    Apparel
Name: Category, dtype: object
SubCategory:
7    Topwear
Name: SubCategory, dtype: object
Colour:
7    Red
Name: Colour, dtype: object
Usage:
7    Casual
Name: Usage, dtype: object


## Clean text

In [5]:
def stem_words(text):
    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)
    return text

def make_lower_case(text):
    return text.lower()

def remove_stop_words(text):
    text = text.split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]
    text = " ".join(text)
    return text

def remove_punctuation(text):
    tokenizer = RegexpTokenizer(r'\w+')
    text = tokenizer.tokenize(text)
    text = " ".join(text)
    return text

In [6]:
df['ProductTitle'] = df.ProductTitle.apply(func=make_lower_case)
df['ProductTitle'] = df.ProductTitle.apply(func=remove_stop_words)
df['ProductTitle'] = df.ProductTitle.apply(func=remove_punctuation)
df['ProductTitle'] = df.ProductTitle.apply(func=stem_words)

In [7]:
df.head(1)

Unnamed: 0,SN,Gender,Category,SubCategory,ProductType,Colour,Usage,ProductTitle
0,1.0,Girls,Apparel,Topwear,Tops,White,Casual,gini joni girl knit white top


# TF-IDF Model

In [8]:
df['full_document'] = df['ProductTitle'] + ' ' + df['Gender'] + ' ' + df['Category']+ ' ' + df['SubCategory'] + ' ' + df['ProductType']+ ' ' + df['Colour'] + ' ' + df['Usage']

In [9]:
#Fit TFIDF 
#Learn vocabulary and tfidf from all ids.
tf = TfidfVectorizer(analyzer='word', 
                     min_df=10,
                     ngram_range=(1, 2),
                     #max_features=1000,
                     stop_words='english')
tf.fit(df['full_document'])

#Transform id products to document-term matrix.
tfidf_matrix = tf.transform(df['full_document'])
pickle.dump(tf, open("models/tfidf_model.pkl", "wb"))

print (tfidf_matrix.shape)

(1272, 292)


In [10]:
# Compress with SVD
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=25)
latent_matrix = svd.fit_transform(tfidf_matrix)
pickle.dump(svd, open("models/svd_model.pkl", "wb"))

print (latent_matrix.shape)

(1272, 25)


In [11]:
n = 25 #pick components

#Use elbow and cumulative plot to pick number of components. 

#Need high ammount of variance explained. 
doc_labels = df.ProductTitle
svd_feature_matrix = pd.DataFrame(latent_matrix[:,0:n] ,index=doc_labels)
print (svd_feature_matrix.shape)
svd_feature_matrix.head()

pickle.dump(svd_feature_matrix, open("models/lsa_embeddings.pkl", "wb"))

(1272, 25)


# Doc2Vec Model

Doc to vec preverves word order in the embeddings, so "I hate green" and "I love green" will be treated differently.

In [12]:
descriptions = df.ProductTitle.values.tolist()

documents = []
for i in range(len(df)):
    mystr = descriptions[i]
    documents.append(re.sub("[^\w]", " ",  mystr).split())

In [13]:
print (len(df))
print (len(documents))

1272
1272


In [14]:
df.loc[5]

SN                                                             6.0
Gender                                                       Girls
Category                                                   Apparel
SubCategory                                                Topwear
ProductType                                                   Tops
Colour                                                       White
Usage                                                       Casual
ProductTitle                    doodl kid girl citi chic white top
full_document    doodl kid girl citi chic white top Girls Appar...
Name: 5, dtype: object

In [15]:
formatted_documents = [gensim.models.doc2vec.TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]

model = gensim.models.doc2vec.Doc2Vec(vector_size=25, min_count=5, epochs=1000, seed=0, window=1, dm=1)
model.build_vocab(formatted_documents)

In [16]:
%time model.train(formatted_documents, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 12.2 s, sys: 401 ms, total: 12.7 s
Wall time: 12.3 s


In [17]:
fname = get_tmpfile("models/doc2vec_model")
model.save("models/doc2vec_model")
model = gensim.models.doc2vec.Doc2Vec.load("./models/doc2vec_model")

In [18]:
vector = model.infer_vector(doc_words=["this", "is", "a", "test"], epochs=50)
vector

array([ 0.01065145, -0.00836683, -0.00930168,  0.00070612,  0.00284922,
       -0.01154674,  0.01747444, -0.00443355,  0.00369449,  0.00168475,
        0.00875234, -0.01220508, -0.00441751,  0.00090119, -0.00957686,
       -0.0124551 ,  0.01810199,  0.01608706, -0.00217809,  0.01831646,
       -0.00655122, -0.0021663 , -0.00702667,  0.00577031, -0.00783004],
      dtype=float32)

In [19]:
doctovec_feature_matrix = pd.DataFrame(model.docvecs.vectors_docs, index=df.ProductTitle)
print (doctovec_feature_matrix.shape)
doctovec_feature_matrix.head(3)
pickle.dump(doctovec_feature_matrix, open("models/doctovec_embeddings.pkl", "wb"))

(1272, 25)
