# Topic Modeling
Prepared by: Yifan Ren, Ricardo Lu, and Dr. Yilu Zhou

Welcome to Lab 4: Topic Modeling. This will be the last lab of the semester. We are going to talk about 3 latent methods for <b>dimension reduction</b> and <b>topic modeling</b>：
1. Latent Semantic Analysis (LSA or LSI)
2. Latent Dirichlet Allocation (LDA)
3. Correlated LDA Topic Model (Optional)


Hightly recommend you go through the link to learn more about both models: https://towardsdatascience.com/2-latent-methods-for-dimension-reduction-and-topic-modeling-20ff6d7d547

In the same folder, we provide a regular expression ipython file for your reference. Let's get started!

In [1]:
import pandas as pd 
import gensim
from gensim import corpora,models

## Preprocessing 

In [2]:
# Read data
# use read_csv to read csv file, not read_table
df = pd.read_csv('fashion.csv')
df

Unnamed: 0,year,season,brand,author of review,location,time,review text
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ..."
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...
...,...,...,...,...,...,...,...
429,2016,Spring,Zo Jordan,Maya Singer,LONDON,"September 19, 2015","Water, water, everywhere, / nor any drop to dr..."
430,2016,Spring,Zuhair Murad,Amy Verner,PARIS,"October 4, 2015","From a new Paris showroom, Zuhair Murad came a..."
431,2016,Spring,1205,Luke Leitch,LONDON,"September 19, 2015",Fashion and Instagram are such (often sacchari...
432,2016,Spring,3.1 Phillip Lim,Maya Singer,NEW YORK,"September 14, 2015",Let other New York City fashion designers toas...


In [3]:
#convert all review text into list format
docs = df['review text'].tolist()
docs[1]

'You heard this collection coming long before you saw it: a gutsy roar that grew to a crescendo as the models rode around the block from backstage into the courtyard of the Facult de Mdecine Paris Descartes. They were riding pillion on a 25-strong lineup of muscle bikesHarleys and Triumphs. This was because, as An Vandevorst explained pre-show: Its a road trip by a woman who lives in the East and has traveled to the West. Hence the mirrored Indian beading of lean biker-touched separates and fabulous goth-influenced saris, and the silver-shot Chinese brocade on cheongsam-biker hybrids. None of these souvenir details was especially literal; a studded, textured, and collarless burgundy jacket looked Chanel-meets-Mongolia (via Antwerp) and made a fine cupola to the elaborately tented blue pliss skirt below.\rVandevorst ticsfrogging, tailoring, leanness, sculpted volume for effectrecurred, most strongly on a set of white looks near the finale that achieved malleable stiffness and an otherwo

In [4]:
# Tokenize the documents.
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
#[0-9][a-Z]_
# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
    docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.

print(docs[1][:150])

['you', 'heard', 'this', 'collection', 'coming', 'long', 'before', 'you', 'saw', 'it', 'a', 'gutsy', 'roar', 'that', 'grew', 'to', 'a', 'crescendo', 'as', 'the', 'models', 'rode', 'around', 'the', 'block', 'from', 'backstage', 'into', 'the', 'courtyard', 'of', 'the', 'facult', 'de', 'mdecine', 'paris', 'descartes', 'they', 'were', 'riding', 'pillion', 'on', 'a', '25', 'strong', 'lineup', 'of', 'muscle', 'bikesharleys', 'and', 'triumphs', 'this', 'was', 'because', 'as', 'an', 'vandevorst', 'explained', 'pre', 'show', 'its', 'a', 'road', 'trip', 'by', 'a', 'woman', 'who', 'lives', 'in', 'the', 'east', 'and', 'has', 'traveled', 'to', 'the', 'west', 'hence', 'the', 'mirrored', 'indian', 'beading', 'of', 'lean', 'biker', 'touched', 'separates', 'and', 'fabulous', 'goth', 'influenced', 'saris', 'and', 'the', 'silver', 'shot', 'chinese', 'brocade', 'on', 'cheongsam', 'biker', 'hybrids', 'none', 'of', 'these', 'souvenir', 'details', 'was', 'especially', 'literal', 'a', 'studded', 'textured', '

In [5]:
# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]
    
print(docs[1][:50])

['you', 'heard', 'this', 'collection', 'coming', 'long', 'before', 'you', 'saw', 'it', 'a', 'gutsy', 'roar', 'that', 'grew', 'to', 'a', 'crescendo', 'as', 'the', 'models', 'rode', 'around', 'the', 'block', 'from', 'backstage', 'into', 'the', 'courtyard', 'of', 'the', 'facult', 'de', 'mdecine', 'paris', 'descartes', 'they', 'were', 'riding', 'pillion', 'on', 'a', 'strong', 'lineup', 'of', 'muscle', 'bikesharleys', 'and', 'triumphs']


In [6]:
# Remove stopwords.
docs = [[token for token in doc if token not in stopwords.words('english')] for doc in docs]
print(docs[1][:50])


['heard', 'collection', 'coming', 'long', 'saw', 'gutsy', 'roar', 'grew', 'crescendo', 'models', 'rode', 'around', 'block', 'backstage', 'courtyard', 'facult', 'de', 'mdecine', 'paris', 'descartes', 'riding', 'pillion', 'strong', 'lineup', 'muscle', 'bikesharleys', 'triumphs', 'vandevorst', 'explained', 'pre', 'show', 'road', 'trip', 'woman', 'lives', 'east', 'traveled', 'west', 'hence', 'mirrored', 'indian', 'beading', 'lean', 'biker', 'touched', 'separates', 'fabulous', 'goth', 'influenced', 'saris']


In [7]:
# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]
print(docs[1][:50])

['heard', 'collection', 'coming', 'long', 'saw', 'gutsy', 'roar', 'grew', 'crescendo', 'models', 'rode', 'around', 'block', 'backstage', 'courtyard', 'facult', 'de', 'mdecine', 'paris', 'descartes', 'riding', 'pillion', 'strong', 'lineup', 'muscle', 'bikesharleys', 'triumphs', 'vandevorst', 'explained', 'pre', 'show', 'road', 'trip', 'woman', 'lives', 'east', 'traveled', 'west', 'hence', 'mirrored', 'indian', 'beading', 'lean', 'biker', 'touched', 'separates', 'fabulous', 'goth', 'influenced', 'saris']


In [8]:
# Lemmatize the documents.
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
print(docs[1][:50])

['heard', 'collection', 'coming', 'long', 'saw', 'gutsy', 'roar', 'grew', 'crescendo', 'model', 'rode', 'around', 'block', 'backstage', 'courtyard', 'facult', 'de', 'mdecine', 'paris', 'descartes', 'riding', 'pillion', 'strong', 'lineup', 'muscle', 'bikesharleys', 'triumph', 'vandevorst', 'explained', 'pre', 'show', 'road', 'trip', 'woman', 'life', 'east', 'traveled', 'west', 'hence', 'mirrored', 'indian', 'beading', 'lean', 'biker', 'touched', 'separate', 'fabulous', 'goth', 'influenced', 'sari']


In [9]:
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams to docs (only ones that appear 10 times or more).
bigram = Phrases(docs,min_count = 10)


print(bigram[docs[0]])


['detachment', 'word', 'day', 'dtacher', 'yes', 'like', 'label', 'name', 'bien', 'sr', 'designer', 'mona', 'kowalska', 'love', 'high', 'concept', 'one', 'imago', 'today', 'detachment', 'included', 'unconcerned', 'gaze', 'others', 'kowalskas', 'woman', 'appears', 'runway', 'real', 'world', 'dress', 'intensely', 'arty', 'bend', 'taste', 'clothes', 'match', 'make', 'dtacher', 'cultishly', 'beloved', 'brand', 'among', 'certain', 'shopper', 'season', 'kowalska', 'presented', 'lineup', 'relatively', 'playful', 'offering', 'collection', 'opened', 'pair', 'midi', 'dress', 'indonesian', 'inspired', 'floral_print', 'reemerged', 'later', 'imagined', 'allover', 'pop', 'white', 'polka_dot', 'elsewhere', 'came', 'cardigan', 'uncanny', 'kind', 'amoxicillin', 'pink', 'imagined', 'dtacher', 'woman', 'wearing', 'tongue', 'firmly', 'cheek', 'kawakubo', 'esque', 'allover', 'hole', 'boot', 'popcorn', 'knit', 'pretty', 'fun', 'choice', 'use', 'hardier', 'material', 'lent', 'dress', 'eccentric', 'volume', 'a

In [10]:
# put the bigram (string with _) back into docs
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)
            
print(docs[0])

['detachment', 'word', 'day', 'dtacher', 'yes', 'like', 'label', 'name', 'bien', 'sr', 'designer', 'mona', 'kowalska', 'love', 'high', 'concept', 'one', 'imago', 'today', 'detachment', 'included', 'unconcerned', 'gaze', 'others', 'kowalskas', 'woman', 'appears', 'runway', 'real', 'world', 'dress', 'intensely', 'arty', 'bend', 'taste', 'clothes', 'match', 'make', 'dtacher', 'cultishly', 'beloved', 'brand', 'among', 'certain', 'shopper', 'season', 'kowalska', 'presented', 'lineup', 'relatively', 'playful', 'offering', 'collection', 'opened', 'pair', 'midi', 'dress', 'indonesian', 'inspired', 'floral', 'print', 'reemerged', 'later', 'imagined', 'allover', 'pop', 'white', 'polka', 'dot', 'elsewhere', 'came', 'cardigan', 'uncanny', 'kind', 'amoxicillin', 'pink', 'imagined', 'dtacher', 'woman', 'wearing', 'tongue', 'firmly', 'cheek', 'kawakubo', 'esque', 'allover', 'hole', 'boot', 'popcorn', 'knit', 'pretty', 'fun', 'choice', 'use', 'hardier', 'material', 'lent', 'dress', 'eccentric', 'volum

In [11]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 10 documents, or more than 80% of the documents.
# This step would be necessary in larger text
dictionary.filter_extremes(no_below=10, no_above=0.8)

print(docs[0])

['detachment', 'word', 'day', 'dtacher', 'yes', 'like', 'label', 'name', 'bien', 'sr', 'designer', 'mona', 'kowalska', 'love', 'high', 'concept', 'one', 'imago', 'today', 'detachment', 'included', 'unconcerned', 'gaze', 'others', 'kowalskas', 'woman', 'appears', 'runway', 'real', 'world', 'dress', 'intensely', 'arty', 'bend', 'taste', 'clothes', 'match', 'make', 'dtacher', 'cultishly', 'beloved', 'brand', 'among', 'certain', 'shopper', 'season', 'kowalska', 'presented', 'lineup', 'relatively', 'playful', 'offering', 'collection', 'opened', 'pair', 'midi', 'dress', 'indonesian', 'inspired', 'floral', 'print', 'reemerged', 'later', 'imagined', 'allover', 'pop', 'white', 'polka', 'dot', 'elsewhere', 'came', 'cardigan', 'uncanny', 'kind', 'amoxicillin', 'pink', 'imagined', 'dtacher', 'woman', 'wearing', 'tongue', 'firmly', 'cheek', 'kawakubo', 'esque', 'allover', 'hole', 'boot', 'popcorn', 'knit', 'pretty', 'fun', 'choice', 'use', 'hardier', 'material', 'lent', 'dress', 'eccentric', 'volum

## Generate Term Document Matrix

In [12]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 1388
Number of documents: 434


In [13]:
print(corpus)


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 3), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 3), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 2), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 3), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)], [(1, 1), (4, 2), (12, 2), (26, 1), (36, 1), (38, 1), (39, 1), (47, 2), (59, 1), (65, 2), (76, 1), (80, 1), (81, 1), (87, 1), (88, 3), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), 

In [14]:
print(corpus[0])
#The (0, 1) means, the word with id=0 appears once in the 1st document/review. 

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 3), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 3), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 2), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 3), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)]


In [15]:
word_counts = [[(dictionary[id], count) for id, count in line] for line in corpus]
print(word_counts[0])


[('ago', 1), ('also', 1), ('among', 1), ('appeal', 1), ('around', 1), ('boot', 1), ('brand', 1), ('came', 1), ('cardigan', 1), ('certain', 1), ('choice', 1), ('clothes', 3), ('collection', 2), ('concept', 1), ('could', 1), ('day', 1), ('designer', 1), ('dot', 1), ('dress', 3), ('elsewhere', 1), ('felt', 1), ('find', 1), ('floral', 1), ('floral_print', 1), ('fun', 1), ('gone', 1), ('high', 1), ('imagined', 2), ('included', 1), ('inspired', 1), ('kind', 1), ('knit', 1), ('label', 1), ('later', 1), ('led', 1), ('lent', 1), ('life', 1), ('like', 2), ('lineup', 2), ('looked', 1), ('love', 1), ('make', 1), ('material', 1), ('midi', 1), ('name', 1), ('offering', 1), ('often', 1), ('one', 1), ('opened', 1), ('others', 1), ('pair', 1), ('pink', 1), ('playful', 1), ('plenty', 1), ('polka', 1), ('polka_dot', 1), ('pop', 1), ('presented', 1), ('pretty', 1), ('print', 1), ('real', 1), ('room', 1), ('runway', 1), ('season', 2), ('shopper', 1), ('show', 1), ('spring', 1), ('spring_collection', 1), ('

In [17]:

# generate a unique token list 
sort_token = sorted(dictionary.items(),key=lambda k:k[0], reverse = False)


In [18]:
unique_token = [token for (ID,token) in sort_token]
#unique_token


In [19]:
import numpy as np
matrix = gensim.matutils.corpus2dense(corpus,num_terms=len(dictionary),dtype = 'int')

matrix = matrix.T #transpose the matrix 
print(matrix)

#convert the numpy matrix into pandas data frame
matrix_df = pd.DataFrame(matrix, columns=unique_token)


[[1 1 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [20]:
#write matrix dataframe into csv
matrix_df#.to_csv('Term_Document_matrix.csv')

Unnamed: 0,ago,also,among,appeal,around,boot,brand,came,cardigan,certain,...,la,robe,casting,voluminous,straightforward,stick,outerwear,fur,pure,parade
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,1,1,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
430,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
431,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
432,0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## LDA model 

In [22]:
# Train LDA model.
from gensim.models import LdaModel

In [23]:
# Set training parameters.
num_topics = 10
chunksize = 2000
#chenksize is the number of documents to be used in each training chunk.
passes = 20
iterations = 100
eval_every = 1  # Don't evaluate model perplexity, takes too much time.


temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token
#– Mapping from word IDs to words. 


#chunksize is 2000, passes is 20, and iterations is 100, algorithm goes through these rounds:
#Round #1: documents 0–99  
#Round #2: documents 100–199
#Round #3: documents 200-299
#Each round will iterate each document’s probability distribution assignments for a maximum of 100 times, 
#moving to the next document before 100 times if it already reached convergence.

In [24]:
lda = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)
# alpha - dictates how the topics are distributed across the documents 
# auto :  Learns an asymmetric prior from the corpus
# eta - A-priori belief on topic-word distribution, auto - Learns an asymmetric prior from the corpus.

In [25]:
# V matrix - a vector of the words
for i,topic in lda.print_topics(10):
    print(f'Top 10 words for topic #{i+1}:')
    print(topic)
    print('\n')

Top 10 words for topic #1:
0.013*"dress" + 0.009*"season" + 0.009*"brand" + 0.007*"designer" + 0.007*"sweater" + 0.007*"knit" + 0.007*"one" + 0.007*"collection" + 0.007*"spring" + 0.007*"style"


Top 10 words for topic #2:
0.016*"collection" + 0.012*"new" + 0.011*"season" + 0.011*"like" + 0.009*"designer" + 0.008*"show" + 0.008*"one" + 0.008*"said" + 0.007*"look" + 0.006*"made"


Top 10 words for topic #3:
0.020*"collection" + 0.013*"made" + 0.011*"look" + 0.010*"clothes" + 0.008*"show" + 0.008*"designer" + 0.008*"well" + 0.008*"de" + 0.007*"year" + 0.006*"fabric"


Top 10 words for topic #4:
0.013*"show" + 0.013*"collection" + 0.013*"dress" + 0.010*"look" + 0.009*"woman" + 0.008*"way" + 0.008*"like" + 0.008*"one" + 0.006*"new" + 0.006*"model"


Top 10 words for topic #5:
0.016*"dress" + 0.012*"collection" + 0.010*"look" + 0.009*"like" + 0.008*"designer" + 0.008*"one" + 0.008*"spring" + 0.007*"new" + 0.007*"white" + 0.007*"skirt"


Top 10 words for topic #6:
0.012*"silk" + 0.012*"feel"

In [26]:
import re
for i,topic in lda.print_topics(10):
    print(f'Top 10 words for topic #{i+1}:')
    print(",".join(re.findall('".*?"',topic)))
    print('\n')

Top 10 words for topic #1:
"dress","season","brand","designer","sweater","knit","one","collection","spring","style"


Top 10 words for topic #2:
"collection","new","season","like","designer","show","one","said","look","made"


Top 10 words for topic #3:
"collection","made","look","clothes","show","designer","well","de","year","fabric"


Top 10 words for topic #4:
"show","collection","dress","look","woman","way","like","one","new","model"


Top 10 words for topic #5:
"dress","collection","look","like","designer","one","spring","new","white","skirt"


Top 10 words for topic #6:
"silk","feel","style","top","trouser","yet","collection","raw","fabric","fluid"


Top 10 words for topic #7:
"dress","collection","de","stripe","black","white","jacket","shirt","cut","color"


Top 10 words for topic #8:
"dress","new","one","designer","season","collection","like","piece","look","brand"


Top 10 words for topic #9:
"spring","dress","silk","also","fashion","collection","shirt","show","designer","way"

In [27]:
top_topics = lda.top_topics(corpus) 

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.3250.
[([(0.012721112, 'dress'),
   (0.011206143, 'new'),
   (0.010772348, 'one'),
   (0.010460437, 'designer'),
   (0.009447374, 'season'),
   (0.0090580415, 'collection'),
   (0.007098073, 'like'),
   (0.006596353, 'piece'),
   (0.0064962287, 'look'),
   (0.0064402414, 'brand'),
   (0.0059645916, 'show'),
   (0.005743937, 'jacket'),
   (0.0057208245, 'print'),
   (0.005611487, 'fashion'),
   (0.005224773, 'spring'),
   (0.005220289, 'came'),
   (0.005195366, 'time'),
   (0.005116625, 'way'),
   (0.0051039415, 'back'),
   (0.0051032617, 'white')],
  -0.895201041274183),
 ([(0.014981075, 'collection'),
   (0.013693339, 'dress'),
   (0.011592543, 'one'),
   (0.01029165, 'show'),
   (0.009189725, 'designer'),
   (0.008740251, 'like'),
   (0.0074928096, 'look'),
   (0.0070348517, 'piece'),
   (0.007032408, 'new'),
   (0.0068038777, 'fashion'),
   (0.0057941917, 'season'),
   (0.005786032, 'thing'),
   (0.00561043, 'made'),
   (0.00549414, 'print'),
   (0.005276

In [29]:
# Generate U Matrix for LDA model
corpus_lda = lda[corpus] #transform lda model

#convert corpus_lda to numpy matrix
U_matrix_lda = gensim.matutils.corpus2dense(corpus_lda,num_terms=10).T

#write U_matrix into pandas dataframe and output
U_matrix_lda_df = pd.DataFrame(U_matrix_lda)
U_matrix_lda_df#.to_csv('U_matrix_lda.csv')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.000000,0.392661,0.604892,0.0,0.00000,0.000000,0.0,0.000000
1,0.0,0.0,0.995600,0.000000,0.000000,0.0,0.00000,0.000000,0.0,0.000000
2,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.552833,0.0,0.444824
3,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,0.0,0.996923
4,0.0,0.0,0.000000,0.000000,0.996407,0.0,0.00000,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...
429,0.0,0.0,0.150691,0.000000,0.845722,0.0,0.00000,0.000000,0.0,0.000000
430,0.0,0.0,0.000000,0.000000,0.320952,0.0,0.00000,0.676452,0.0,0.000000
431,0.0,0.0,0.000000,0.000000,0.822648,0.0,0.12767,0.000000,0.0,0.046368
432,0.0,0.0,0.163065,0.000000,0.239265,0.0,0.00000,0.000000,0.0,0.596086


In [30]:
print(matrix_df.shape)
print(U_matrix_lda_df.shape)

(434, 1388)
(434, 10)


See what we have achieved! We decrease features from 7493 to 10!

## LSI model 

### Generate Tf-idf Matrix

In [32]:
# review: what is corpus
corpus = [dictionary.doc2bow(doc) for doc in docs]
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 3), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 3), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 2), (38, 2), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 3), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)]


In [33]:
# Tfidf Transformation 
#from gensim.models import LsiModel
tfidf = models.TfidfModel(corpus) #fit tfidf model
corpus_tfidf = tfidf[corpus]      #apply model to the corpus 

In [34]:
corpus_tfidf[0]  # the first document 
# (0, 0.11963255332401328) =  0.119 is the tf-idf value of the id = 0 token 

[(0, 0.11963255332401328),
 (1, 0.03946650199631388),
 (2, 0.08930202191258248),
 (3, 0.10174488434454576),
 (4, 0.06455571113652354),
 (5, 0.11181231971029078),
 (6, 0.05560789834344296),
 (7, 0.04445767652571364),
 (8, 0.14827615723342932),
 (9, 0.10064284609379101),
 (10, 0.13152072306091103),
 (11, 0.11914408453910828),
 (12, 0.02017428008006786),
 (13, 0.11963255332401328),
 (14, 0.05903766182884297),
 (15, 0.06365732824804932),
 (16, 0.019766632251364283),
 (17, 0.12139128620519539),
 (18, 0.03321327281086008),
 (19, 0.0955346247928842),
 (20, 0.07199924218437495),
 (21, 0.10064284609379101),
 (22, 0.07782917210688856),
 (23, 0.12139128620519539),
 (24, 0.11181231971029078),
 (25, 0.12515060263169386),
 (26, 0.06107373793437025),
 (27, 0.29655231446685865),
 (28, 0.11326243289479608),
 (29, 0.06641328162835639),
 (30, 0.0753976481481135),
 (31, 0.0606584180031664),
 (32, 0.06987959884043292),
 (33, 0.13638798749653158),
 (34, 0.13905497639616282),
 (35, 0.14190603680421215),
 (36

In [35]:
for doc in corpus_tfidf :
    print([[dictionary[id], np.around(freq,decimals=3)] for id, freq in doc])

[['ago', 0.12], ['also', 0.039], ['among', 0.089], ['appeal', 0.102], ['around', 0.065], ['boot', 0.112], ['brand', 0.056], ['came', 0.044], ['cardigan', 0.148], ['certain', 0.101], ['choice', 0.132], ['clothes', 0.119], ['collection', 0.02], ['concept', 0.12], ['could', 0.059], ['day', 0.064], ['designer', 0.02], ['dot', 0.121], ['dress', 0.033], ['elsewhere', 0.096], ['felt', 0.072], ['find', 0.101], ['floral', 0.078], ['floral_print', 0.121], ['fun', 0.112], ['gone', 0.125], ['high', 0.061], ['imagined', 0.297], ['included', 0.113], ['inspired', 0.066], ['kind', 0.075], ['knit', 0.061], ['label', 0.07], ['later', 0.136], ['led', 0.139], ['lent', 0.142], ['life', 0.081], ['like', 0.045], ['lineup', 0.185], ['looked', 0.075], ['love', 0.09], ['make', 0.064], ['material', 0.086], ['midi', 0.123], ['name', 0.113], ['offering', 0.104], ['often', 0.085], ['one', 0.021], ['opened', 0.123], ['others', 0.115], ['pair', 0.074], ['pink', 0.077], ['playful', 0.125], ['plenty', 0.095], ['polka',

[['around', 0.06], ['could', 0.055], ['day', 0.059], ['playful', 0.117], ['presented', 0.113], ['pretty', 0.09], ['season', 0.029], ['wearing', 0.094], ['woman', 0.043], ['explained', 0.086], ['made', 0.076], ['skirt', 0.033], ['sometimes', 0.108], ['black', 0.039], ['bloom', 0.276], ['clearly', 0.118], ['fashion', 0.038], ['many', 0.065], ['strongest', 0.123], ['work', 0.062], ['french', 0.118], ['piece', 0.062], ['seems', 0.099], ['silk', 0.043], ['wear', 0.065], ['every', 0.08], ['layered', 0.092], ['motif', 0.085], ['traditional', 0.096], ['certainly', 0.115], ['hard', 0.081], ['pant', 0.051], ['short', 0.136], ['tee', 0.1], ['evening', 0.082], ['gold', 0.09], ['shirt', 0.053], ['sweatshirt', 0.118], ['track', 0.123], ['drawstring', 0.141], ['flower', 0.275], ['go', 0.062], ['across', 0.096], ['printed', 0.066], ['meanwhile', 0.107], ['organza', 0.196], ['left', 0.095], ['decorative', 0.135], ['desert', 0.145], ['lace', 0.06], ['shift', 0.096], ['theme', 0.066], ['version', 0.103],

# LSI model

In [36]:
# Train LSI model

from gensim.models import LsiModel


lsi = models.LsiModel(corpus_tfidf, id2word=dictionary.id2token, num_topics=10)
#id2word :  ID to word mapping

In [37]:
# V matrix
for i,topic in lsi.print_topics(10):
    print(f'Top 10 words for topic #{i+1}:')
    print(topic)
    print('\n')

Top 10 words for topic #1:
-0.085*"show" + -0.077*"new" + -0.076*"woman" + -0.073*"print" + -0.072*"look" + -0.072*"fashion" + -0.071*"black" + -0.071*"season" + -0.071*"way" + -0.070*"jacket"


Top 10 words for topic #2:
-0.140*"show" + -0.137*"model" + -0.136*"fashion" + -0.109*"young" + -0.106*"people" + 0.100*"cotton" + 0.096*"particularly" + 0.090*"jumpsuit" + 0.088*"knit" + 0.087*"texture"


Top 10 words for topic #3:
-0.219*"denim" + -0.191*"jean" + -0.152*"brand" + 0.141*"gown" + -0.108*"vintage" + 0.101*"black_white" + 0.093*"black" + 0.093*"ruffle" + -0.091*"wear" + 0.085*"red"


Top 10 words for topic #4:
-0.179*"biker" + -0.132*"de" + 0.122*"gown" + 0.108*"new_york" + 0.108*"york" + 0.104*"carpet" + 0.103*"flower" + -0.100*"shoulder" + -0.098*"jacket" + -0.096*"fine"


Top 10 words for topic #5:
-0.156*"flower" + -0.124*"shirt" + 0.109*"sense" + 0.107*"clothes" + -0.106*"embroidered" + -0.096*"lace" + -0.084*"girl" + -0.084*"rose" + -0.082*"sequined" + -0.081*"jean"


Top 1

In [38]:

for i,topic in lda.print_topics(10):
    print(f'Top 10 words for topic #{i+1}:')
    print(",".join(re.findall('".*?"',topic)))
    print('\n')

Top 10 words for topic #1:
"dress","season","brand","designer","sweater","knit","one","collection","spring","style"


Top 10 words for topic #2:
"collection","new","season","like","designer","show","one","said","look","made"


Top 10 words for topic #3:
"collection","made","look","clothes","show","designer","well","de","year","fabric"


Top 10 words for topic #4:
"show","collection","dress","look","woman","way","like","one","new","model"


Top 10 words for topic #5:
"dress","collection","look","like","designer","one","spring","new","white","skirt"


Top 10 words for topic #6:
"silk","feel","style","top","trouser","yet","collection","raw","fabric","fluid"


Top 10 words for topic #7:
"dress","collection","de","stripe","black","white","jacket","shirt","cut","color"


Top 10 words for topic #8:
"dress","new","one","designer","season","collection","like","piece","look","brand"


Top 10 words for topic #9:
"spring","dress","silk","also","fashion","collection","shirt","show","designer","way"

In [39]:
# Generate U Matrix for LSI model
corpus_lsi = lsi[corpus_tfidf] #transform lsi model

#convert corpus_lsi to numpy matrix
U_matrix_lsi = gensim.matutils.corpus2dense(corpus_lsi,num_terms=10).T
print(U_matrix_lsi)
#write U_matrix into pandas dataframe and output
pd.DataFrame(U_matrix_lsi)#.to_csv('U_matrix_lsi.csv')

[[-0.28098276  0.03018355 -0.01921144 ...  0.09768897 -0.02725368
  -0.0368022 ]
 [-0.21378566 -0.03748095  0.09627153 ... -0.08487883  0.06971856
  -0.05286683]
 [-0.2581887   0.11200842 -0.065998   ... -0.04327662 -0.03789556
  -0.10610017]
 ...
 [-0.22592361 -0.06174918 -0.02018499 ... -0.09688623  0.03504658
   0.10249774]
 [-0.30440453 -0.01473377  0.0384463  ...  0.02390151 -0.06650899
   0.01236416]
 [-0.26291904  0.03632743 -0.2009411  ... -0.02999881 -0.05907334
   0.04154236]]


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.280983,0.030184,-0.019211,0.160555,0.058921,0.108053,0.009577,0.097689,-0.027254,-0.036802
1,-0.213786,-0.037481,0.096272,-0.244720,-0.077087,0.149585,0.022516,-0.084879,0.069719,-0.052867
2,-0.258189,0.112008,-0.065998,0.080148,0.069029,0.094309,0.033665,-0.043277,-0.037896,-0.106100
3,-0.301482,-0.021194,-0.142813,-0.149136,0.038401,-0.008035,-0.036274,0.013772,-0.032229,-0.184593
4,-0.262021,0.050145,0.099388,-0.016385,0.087563,-0.019164,0.055110,0.191923,0.129695,0.066944
...,...,...,...,...,...,...,...,...,...,...
429,-0.234227,0.097007,-0.108975,-0.064248,0.085803,-0.059206,0.070988,0.034458,0.063572,0.106582
430,-0.263408,0.072693,0.025383,0.026032,0.039886,0.015990,0.013696,-0.128699,-0.048395,-0.002675
431,-0.225924,-0.061749,-0.020185,-0.080936,0.054701,0.068489,-0.016571,-0.096886,0.035047,0.102498
432,-0.304405,-0.014734,0.038446,-0.123345,0.073677,0.004231,-0.012188,0.023902,-0.066509,0.012364


## Correlated LDA Topic Model (Optional)

In [35]:
import tomotopy as tp

In [36]:
ctm = tp.CTModel(k=10)
# k is the number of topic


In [37]:
#put training data into model ctm
#add document 
for doc in docs:
    ctm.add_doc(doc)

#Learning 
for i in range(0, 500, 10):
    ctm.train(10)
# 10 iterations at a time

In [79]:
U_matrix_lda_df = pd.DataFrame([doc.get_topic_dist() for doc in ctm.docs])
#get_topic_dist() Return a distribution of the topics in the document.
U_matrix_lda_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.151370,0.082877,0.082877,0.178767,0.048630,0.117123,0.076027,0.082877,0.082877,0.096575
1,0.119375,0.119375,0.144375,0.056875,0.063125,0.094375,0.081875,0.131875,0.056875,0.131875
2,0.126286,0.086286,0.120571,0.114857,0.080571,0.080571,0.172000,0.069143,0.080571,0.069143
3,0.078065,0.084516,0.097419,0.071613,0.129677,0.090968,0.129677,0.116774,0.084516,0.116774
4,0.102027,0.129054,0.068243,0.088514,0.095270,0.102027,0.115541,0.102027,0.081757,0.115541
...,...,...,...,...,...,...,...,...,...,...
429,0.117500,0.092500,0.150833,0.100833,0.067500,0.075833,0.159167,0.100833,0.067500,0.067500
430,0.075625,0.125625,0.169375,0.094375,0.100625,0.119375,0.094375,0.063125,0.106875,0.050625
431,0.090977,0.083459,0.121053,0.075940,0.083459,0.128571,0.128571,0.106015,0.083459,0.098496
432,0.144059,0.094554,0.069802,0.109406,0.079703,0.089604,0.104455,0.124257,0.094554,0.089604


In [40]:
imitate_print = lambda ctm:[(i," + ".join([str(round(p,3))+"*"+'"{}"'.format(w) 
                                           for w,p in ctm.get_topic_words(i)])) for i in range(10)]

In [88]:
ctm.get_topic_words(0)

[('one', 0.05410395935177803),
 ('way', 0.030450383201241493),
 ('also', 0.028882190585136414),
 ('jacket', 0.02692195028066635),
 ('would', 0.01672869734466076),
 ('long', 0.015421870164573193),
 ('back', 0.01516050472855568),
 ('could', 0.0135923121124506),
 ('hand', 0.01293889805674553),
 ('new_york', 0.011632070876657963)]

In [83]:
import re
for i,topic in imitate_print(ctm):
    print(f'Top 10 words for topic #{i+1}:')
    print(topic)
    print('\n')

Top 10 words for topic #1:
0.054*"one" + 0.03*"way" + 0.029*"also" + 0.027*"jacket" + 0.017*"would" + 0.015*"long" + 0.015*"back" + 0.014*"could" + 0.013*"hand" + 0.012*"new_york"


Top 10 words for topic #2:
0.05*"like" + 0.021*"first" + 0.014*"day" + 0.012*"theme" + 0.011*"printed" + 0.011*"best" + 0.01*"see" + 0.01*"red" + 0.01*"many" + 0.01*"floral"


Top 10 words for topic #3:
0.049*"new" + 0.029*"woman" + 0.024*"came" + 0.02*"pant" + 0.018*"time" + 0.016*"point" + 0.015*"take" + 0.014*"though" + 0.012*"always" + 0.011*"something"


Top 10 words for topic #4:
0.051*"designer" + 0.034*"spring" + 0.03*"clothes" + 0.022*"look" + 0.018*"today" + 0.018*"knit" + 0.013*"two" + 0.012*"work" + 0.011*"idea" + 0.01*"signature"


Top 10 words for topic #5:
0.034*"dress" + 0.034*"piece" + 0.023*"made" + 0.019*"runway" + 0.016*"cut" + 0.015*"leather" + 0.014*"much" + 0.012*"cotton" + 0.012*"shirt" + 0.012*"sense"


Top 10 words for topic #6:
0.018*"high" + 0.018*"line" + 0.016*"girl" + 0.015*"e

In [41]:
import re
for i,topic in imitate_print(ctm):
    print(f'Top 10 words for topic #{i+1}:')
    print(",".join(re.findall('".*?"',topic)))
    print('\n')

Top 10 words for topic #1:
"one","way","also","jacket","would","long","back","could","hand","new_york"


Top 10 words for topic #2:
"like","first","day","theme","printed","best","see","red","many","floral"


Top 10 words for topic #3:
"new","woman","came","pant","time","point","take","though","always","something"


Top 10 words for topic #4:
"designer","spring","clothes","look","today","knit","two","work","idea","signature"


Top 10 words for topic #5:
"dress","piece","made","runway","cut","leather","much","cotton","shirt","sense"


Top 10 words for topic #6:
"high","line","girl","even","yet","trouser","little","blue","color","bit"


Top 10 words for topic #7:
"skirt","fashion","look","brand","well","top","model","still","make","suit"


Top 10 words for topic #8:
"show","thing","shoulder","feel","looked","set","body","found","full","inspiration"


Top 10 words for topic #9:
"season","said","print","black","style","silk","lace","silhouette","denim","inspired"


Top 10 words for topic #1

In [42]:
U_matrix_lda_df#.to_csv('U_matrix_ctm.csv')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.151370,0.082877,0.082877,0.178767,0.048630,0.117123,0.076027,0.082877,0.082877,0.096575
1,0.119375,0.119375,0.144375,0.056875,0.063125,0.094375,0.081875,0.131875,0.056875,0.131875
2,0.126286,0.086286,0.120571,0.114857,0.080571,0.080571,0.172000,0.069143,0.080571,0.069143
3,0.078065,0.084516,0.097419,0.071613,0.129677,0.090968,0.129677,0.116774,0.084516,0.116774
4,0.102027,0.129054,0.068243,0.088514,0.095270,0.102027,0.115541,0.102027,0.081757,0.115541
...,...,...,...,...,...,...,...,...,...,...
429,0.117500,0.092500,0.150833,0.100833,0.067500,0.075833,0.159167,0.100833,0.067500,0.067500
430,0.075625,0.125625,0.169375,0.094375,0.100625,0.119375,0.094375,0.063125,0.106875,0.050625
431,0.090977,0.083459,0.121053,0.075940,0.083459,0.128571,0.128571,0.106015,0.083459,0.098496
432,0.144059,0.094554,0.069802,0.109406,0.079703,0.089604,0.104455,0.124257,0.094554,0.089604
