# Grouping Documents with Averaged Word Embeddings

Now that I have my selected questions and tarot interpretations, I'll use a pre-trained word embedding model to take an average document vector. Then I can group similar documents and questions using a distance metric like cosine similarity. 

The final result will be a series of bundled similar questions and interpretations that I can assign to each card image.




**Sources & References**
* [Word2Vec to Analyze News Headlines and Predict Article Success - by Charlene Chambliss (from Towards Data Science)](https://towardsdatascience.com/using-word2vec-to-analyze-news-headlines-and-predict-article-success-cdeda5f14751)

## Imports and Setup

In [1]:
# Necessary imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import gensim
import gensim.downloader as api

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Importing tarot interpretations

In [3]:
# Import tarot interpretations / meanings from previous notebook

meanings_file = '/content/drive/MyDrive/gpt2/tarot_meanings.txt'

with open(meanings_file) as f:
    meanings = f.read().split('\n')

print(f'Number of meanings in corpus: {len(meanings)}')

Number of meanings in corpus: 266


In [4]:
# sample of output
meanings[:3]

['You have done your fair share of work and still need to be paid some respect, but now is not the time to act in an aggressive way.',
 'Try to keep your plans light over the next few days, especially if they are of a financial nature.',
 'Keep your ideas to yourself for a while and don’t let anyone in on a secret that could be used against you.']

### Loading the pre-trained word embeddings model

In [5]:
model = api.load('glove-wiki-gigaword-200')



A quick test of the model...

In [6]:
model.vector_size

200

In [7]:
model.most_similar('tarot')

  if np.issubdtype(vec.dtype, np.int):


[('divination', 0.48255202174186707),
 ('cards', 0.48239564895629883),
 ('smartcard', 0.4747677445411682),
 ('arcana', 0.46141287684440613),
 ('fortunetelling', 0.4498494267463684),
 ('decks', 0.4484478831291199),
 ('numerology', 0.4411526620388031),
 ('card', 0.4390334486961365),
 ('thoth', 0.4360690116882324),
 ('astrology', 0.4214267134666443)]

## Grouping similar documents

This next section will flow as follows: 

1. Text Preprocessing
    * Lowercase, tokenize, and remove stopwords from each document
    * Check to make sure the word exists in the pre-trained word2vec model's vocab -- remove any words that are not present
2. Find a document vector by averaging all the word vectors in a document
3. Create a cosine similarity matrix of vectorized documents
4. Group top *n* documents together by similarity value
5. Repeat for questions
6. Repeat again to group questions with tarot interpetations

In [8]:
# NLTK Imports
import nltk
nltk.download('punkt')

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [9]:
# preprocessing -- lowercase, remove stopwords, keep only alpha-numeric
def preprocess(text):
    text = text.lower()
    doc = word_tokenize(text)
    doc = [word for word in doc if word not in stop_words]
    doc = [word for word in doc if word.isalpha()]
    return doc

# Function to help drop documents that have no word vectors in w2v
def has_vector_representation(model, doc):
    return not all(word not in model.vocab for word in doc)

# filter out documents
def filter_docs(corpus, texts, condition_on_doc):
    number_of_docs = len(corpus)
    if texts is not None:
        texts = [text for (text, doc) in zip(texts, corpus) if condition_on_doc(doc)]
    corpus = [doc for doc in corpus if condition_on_doc(doc)] 
    print(f'{number_of_docs - len(corpus)} docs removed')
    return (corpus, texts) 

# preprocess the corpus

def preprocess_corpus(text_list):
    corpus = [preprocess(text) for text in text_list]
    corpus, text_list = filter_docs(corpus, text_list, lambda doc: has_vector_representation(model, doc))
    corpus, text_list = filter_docs(corpus, text_list, lambda doc: (len(doc) != 0))
    return corpus, text_list    

### Processing & grouping tarot interpretations

In [10]:
corpus, text_list = preprocess_corpus(meanings)

0 docs removed
0 docs removed


In [11]:
# Function to take average of a documents word vectors
def document_vector(model, doc):
    # remove out of vocab words
    doc = [word for word in doc if word in model.vocab]
    return np.mean(model[doc], axis=0)


def average_doc_vec(corpus):
    x = []
    for doc in corpus: # append the vector for each document
        x.append(document_vector(model, doc))
        
    X = np.array(x) # list to array
    return X    

# Function to create cosine similarity matrix from averaged doc vectors
from sklearn.metrics.pairwise import cosine_similarity

def make_cosine_similarity_matrix(corpus, text_list):
    dists = cosine_similarity(average_doc_vec(corpus))
    dists_df = pd.DataFrame(data=dists, index=text_list, columns=text_list)
    return dists_df

In [12]:
# Create cosine similarity matrix
cosine_sim_df = make_cosine_similarity_matrix(corpus, text_list)

# Sample output
cosine_sim_df.iloc[:5, :5]

Unnamed: 0,"You have done your fair share of work and still need to be paid some respect, but now is not the time to act in an aggressive way.","Try to keep your plans light over the next few days, especially if they are of a financial nature.",Keep your ideas to yourself for a while and don’t let anyone in on a secret that could be used against you.,No matter how badly the first attempt failed you would still be amazed how easily it worked in the end.,"Your first task today and over the weekend is to get them to see sense, not just in their words but in your actions."
"You have done your fair share of work and still need to be paid some respect, but now is not the time to act in an aggressive way.",1.0,0.88593,0.850314,0.87374,0.904928
"Try to keep your plans light over the next few days, especially if they are of a financial nature.",0.88593,1.0,0.857395,0.862742,0.890857
Keep your ideas to yourself for a while and don’t let anyone in on a secret that could be used against you.,0.850314,0.857395,1.0,0.843146,0.844295
No matter how badly the first attempt failed you would still be amazed how easily it worked in the end.,0.87374,0.862742,0.843146,1.0,0.867176
"Your first task today and over the weekend is to get them to see sense, not just in their words but in your actions.",0.904928,0.890857,0.844295,0.867176,1.0


Grouping similar documents using the cosine similarity matrix

In [16]:
def find_similar_meanings(idx, top_n, cosine_sim_df):
    most_similar = []
    docs = cosine_sim_df.iloc[idx, :]
    top_matches = docs.sort_values(ascending=False)[1:(top_n+1)]
    #return top_matches
    for i in range(len(top_matches)):
         res = top_matches.index[i]
         most_similar.append(res)

    return most_similar


def group_similar_docs(cosine_sim_df, text_list, top_n):
    
    grouped_docs = []

    for i in range(len(text_list)):
        group = find_similar_meanings(i, top_n, cosine_sim_df)
        group.insert(0, text_list[i])
        grouped_docs.append(group)

    return grouped_docs

In [19]:
# Compiled function to be able to do the above in one line

def make_groupings(doc_list, n_items):
    print('Grouping documents...\n')

    corpus, text_list = preprocess_corpus(doc_list)
    cosine_sim_df = make_cosine_similarity_matrix(corpus, text_list)
    grouped_docs_list = group_similar_docs(cosine_sim_df, text_list, n_items)

    print('\nGrouping complete!\n')
    return grouped_docs_list

In [20]:
# Will make groups of 4 
# the 3 finds 3 additional doc for a given doc
meanings_grouped = make_groupings(meanings, 3)

Grouping documents...

0 docs removed
0 docs removed

Grouping complete!



In [30]:
print(f'Number of grouped meanings: {len(meanings_grouped)}')

Number of grouped meanings: 266


### Processing & grouping tarot questions

In [21]:
# Importing questions from previous notebook
questions_file = '/content/drive/MyDrive/gpt2/questions.txt'

with open(questions_file) as f:
    questions = f.read().split('\n')

print(f'Number of questions in corpus: {len(questions)}')

Number of questions in corpus: 149


In [22]:
# Sample output
questions[:3]

['Who is it worth fighting for?',
 'Who has not been found?',
 'Why are you really that good at being tough?']

In [23]:
# Make groups of 3 questions
questions_grouped = make_groupings(questions, 2)

Grouping documents...

2 docs removed
0 docs removed

Grouping complete!



In [32]:
print(f'Number of grouped questions: {len(questions_grouped)}')

Number of grouped questions: 147


## Grouping Questions and Interpetations

At this point, I have a set of questions in groups of 3 and interpretations in groups of 4. Now I want to group the sets of questions and interpretations together by the cosine similarity score of their averaged document vectors. 

Next, I'll handpick several bundles to match with titles and finally annotate the cards.

In [24]:
def match_meanings_with_questions(questions_grouped, meanings_grouped):

    # Convert list of questions/meanings to single doc
    questions = [' '.join(q) for q in questions_grouped]
    meanings = [' '.join(m) for m in meanings_grouped]

    # preprocess text
    q_corpus, q_text = preprocess_corpus(questions)
    m_corpus, m_text = preprocess_corpus(meanings)

    # get average doc vectors for each
    q_x, m_x = average_doc_vec(q_corpus), average_doc_vec(m_corpus)

    # list of best matches by questions
    matches = []

    for i in range(len(q_text)):
        
        # list of results from passing through questions
        q_sims = []

        q = q_x[i].reshape(1, q_x.shape[1])
        q_txt = q_text[i]

        for j in range(len(m_text)):
            
            m = m_x[j].reshape(1, m_x.shape[1])
            m_txt = m_text[j]

            # cosine similarity calculation
            sim = cosine_similarity(q, m)

            # append results to question loop
            result = (sim[0][0], q_txt, m_txt)
            q_sims.append(result)

        # take best match from meanings and append ot matches
        best_result = max(q_sims)
        matches.append(best_result)

    return list(set(matches))            
        

In [25]:
matched_groupings = match_meanings_with_questions(questions_grouped, meanings_grouped)

0 docs removed
0 docs removed
0 docs removed
0 docs removed


In [27]:
print(f'Number of groups: {len(matched_groupings)}')

Number of groups: 147


In [28]:
# Create groups to review -- sorted by top similarity scores.

grouped = sorted(matched_groupings, key = lambda tup: tup[0], reverse=True)

for i, group in enumerate(grouped):
    print(f'Group {i}')
    print(f'Score: {group[0]:.4f}')
    print(f'Questions: {group[1]}')
    print(f'Meanings: {group[2]}\n')

Group 0
Score: 0.9900
Questions: Where should we, as rational people, expect other people to be happy in the same way? Where can I really trust people who make the world a better place? Is there something you need to say, something you would like to let others know what you think and how you feel?
Meanings: Let the important people know they can trust you. You should know by now that most important people in the world would like to help you. Don’t worry if you do something today that you think would be of benefit to other people. For now though you need to go right the other way and do something for people you are not really good at.

Group 1
Score: 0.9865
Questions: Where can I really trust people who make the world a better place? Is there something you need to say, something you would like to let others know what you think and how you feel? How might you make of the fact that they are really your friends?
Meanings: Let the important people know they can trust you. You should know by

Lots of repeats in here -- I'll snag a few that I like and repeat the selection process. Then I'll just edit my final choices if it turns out something is being repeated too frequently.

In [37]:
keep_idx = [0, 4, 6, 10, 15, 26, 32, 33, 37, 44, 
            48, 51, 57, 59, 61, 66, 67, 70, 73, 79,
            81, 82, 84, 89, 91, 97, 98, 101, 104, 105, 
            106, 108, 109, 117, 130, 142, 146]

bundles = list(map(grouped.__getitem__, keep_idx))

In [38]:
from nltk.tokenize import sent_tokenize
bundles = [(sent_tokenize(item[1]), sent_tokenize(item[2])) for item in bundles]

In [35]:
bundles = [(item[1], item[2]) for item in bundles]

In [40]:
len(bundles)

37

In [47]:
for i, bundle in enumerate(bundles):
    print(f'Group {i}')
    print('Questions:')
    for q in bundle[0]:
        print('\t', q)
    print('Meanings:')
    for m in bundle[1]:
        print('\t', m)
    print('')        

Group 0
Questions:
	 Where should we, as rational people, expect other people to be happy in the same way?
	 Where can I really trust people who make the world a better place?
	 Is there something you need to say, something you would like to let others know what you think and how you feel?
Meanings:
	 Let the important people know they can trust you.
	 You should know by now that most important people in the world would like to help you.
	 Don’t worry if you do something today that you think would be of benefit to other people.
	 For now though you need to go right the other way and do something for people you are not really good at.

Group 1
Questions:
	 Why did you get what you wanted?
	 Is there something you need to say, something you would like to let others know what you think and how you feel?
	 Why is it possible that someone is trying to get close to you?
Meanings:
	 Don’t worry if a friend or colleague feels the need to ask you to back out of some sort of financial arrangemen

In [49]:
keep_final = [2, 4, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 27, 28, 32, 36]

final = list(map(bundles.__getitem__, keep_final))

In [51]:
for i, bundle in enumerate(final):
    print(f'Group {i}')
    print('Questions:')
    for q in bundle[0]:
        print('\t', q)
    print('Meanings:')
    for m in bundle[1]:
        print('\t', m)
    print('') 

Group 0
Questions:
	 How might their attitude be if they were less likely to come to their senses?
	 How might it be better to leave things as they are for a while?
	 Is there something you need to say, something you would like to let others know what you think and how you feel?
Meanings:
	 You may be tempted to splash out on something but you are advised to look at it in a light-hearted manner – it will certainly do you good.
	 You may not think much of yourself but now is a good time to look more closely at what you are doing – maybe even thinking!
	 Some people seem to expect you to be the kind of person who always wants to be noticed but there is something you must learn about yourself.
	 The less you say too much about a loved one’s behavior the more likely it is they will try to make you feel bad about yourself.

Group 1
Questions:
	 Why did you know the answer already?
	 Is there something you want to say or do but don’t quite know what it is?
	 Is there something you need to sa

Now to handpick the questions and interpreations from these bundles as the final texts for the cards. 

In [82]:
# Selected Questions and Interpretatios from final output

c00 = [final[0][0][1:], [final[0][1][0]]]
c01 = [final[1][0][:2], final[1][1][:2]]
c02 = [final[2][0], final[2][1][:2]]
c03 = [[final[3][0][0]], [final[3][1][0], final[3][1][1], final[3][1][3]]]
c04 = [final[4][0][:2], [final[4][1][1], final[4][1][3]]]
c05 = [[final[5][0][0], final[5][0][2]], [final[5][1][0], final[5][1][2], final[5][1][3]]]
c06 = [[final[6][0][0], final[6][0][2]], final[6][1][:3]]
c07 = [[final[7][0][0]], [final[7][1][0]]]
c08 = [final[8][0], [final[8][1][3]]]
c09 = [[final[9][0][0]], [final[9][1][0], final[9][1][2]]]
c10 = [[final[10][0][0], final[10][0][2]], final[10][1][2:3]]
c11 = [final[11][0][:2], final[11][1]]
c12 = [[final[12][0][2]], final[12][1][:2]]
c13 = [[final[13][0][0], final[13][0][2]], [final[13][1][0], final[13][1][2]]]
c14 = [final[14][0][:2], [final[14][1][0], final[14][1][3]]]
c15 = [final[15][0], [final[15][1][0], final[15][1][1], final[15][1][3]]]
c16 = [final[16][0][:1], [final[16][1][0], final[16][1][2], final[16][1][3]]]
c17 = [[final[17][0][0], final[17][0][2]], [final[17][1][0], final[17][1][1], final[17][1][3]]]
c18 = [[final[18][0][0], final[18][0][2]], final[18][1][:2]]
c19 = [[final[19][0][0]], final[19][1]]
c20 = [final[20][0][:2], [final[20][1][0], final[20][1][2]]]
c21 = [final[21][0][:2], [final[21][1][0], final[21][1][2]]]

cards = [c00, c01, c02, c03, c04, c05, c06, c07, c08, c09, c10, 
         c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21]


In [86]:
for i, card in enumerate(cards):
    print(f'Card {i}')
    print('---------')
    print('Questions')
    for q in card[0]:
        print('* ', q)
    print('Meanings')
    for m in card[1]:
        print('* ', m)
    print('')                      

Card 0
---------
Questions
*  How might it be better to leave things as they are for a while?
*  Is there something you need to say, something you would like to let others know what you think and how you feel?
Meanings
*  You may be tempted to splash out on something but you are advised to look at it in a light-hearted manner – it will certainly do you good.

Card 1
---------
Questions
*  Why did you know the answer already?
*  Is there something you want to say or do but don’t quite know what it is?
Meanings
*  You know what needs to be done.
*  Make sure you know what is going on before you speak.

Card 2
---------
Questions
*  Who will make you smile?
*  Who will go out of your way to make it seem as if everyone is as stupid as you?
*  How might you make of the fact that they are really your friends?
Meanings
*  Try not to overreact to someone who annoys you in any way.
*  If you say the wrong thing to someone today you could make the mistake of saying something that turns them agai

## Matching Names with Annotations

In a separate notebook, I randomly generated 78 possible names for the tarot deck (the 78 figure comes from the number of cards in a tarot deck). I'll manually match these options with my selected annotations for my final set of card names and descriptions.

In [107]:
# Import tarot interpretations / meanings from previous notebook

name_list_file = '/content/drive/MyDrive/gpt2/name_list.txt'

with open(name_list_file) as f:
    name_list = f.read().split('\n')

print(f'Number of possible names: {len(name_list)}')

Number of possible names: 78


In [113]:
for name in name_list:
    if name[:3] == 'the':
        print(name.title())
    else:
        print("The", name.title())

The Remorseful
The Selfishness
The Wither
The Invigorated
The Heroic
The Hexproof
The Riot
The Suffocated
The People 
The Graft
The Furious
The Nervous
The Persist
The Law
The Divine Madman
The Great Invigorating 
The Cohort
The Fateseal
The Pissed
The Spectacle
The Converge
The Ungrounded
The Beginning
The Perplexed
The Fluttery
The Proliferate
The Bushido
The Great Accumulating 
The Encouraged
The Truth
The Cycling
The Small 
The Sexual Awakening
The Great Exceeding 
The Concording 
The Grouchy
The Toughness
The Sore
The Loyalty
The Realism
The Frightened
The Victorious Hero
The Contracted
The Mindset
The On Edge
The Assumptions
The Planeswalk
The Transform
The Student
The Arguing 
The Reinforce
The Fullness
The Crisis
The Provoke
The Success
The Skill
The Karma
The Listless
The Union
The Arousing
The Polarising Opposition
The Unleash
The Useless
The Humbling 
The Aura Swap
The Will Of The Council
The Heartbroken
The Renewed
The Completion
The Trusting
The Soulshift
The Fortify
The S

In [159]:
card_dict = {
    'The Ungrounded': c00,
    'The Assumptions': c01,
    'The Provoke': c02,
    'The Useless': c03,
    'The Realism': c04,     
    'The On Edge': c05,
    'The Suffocated': c06,
    'The Release': c07,
    'The Arguing': c08,
    'The Great Invigorating': c09,
    'The Proliferate': c10,
    'The Renewed': c11,
    'The Soulshift': c12,
    'The Spectacle': c13,
    'The Aura Swap': c14,
    'The Union': c15,
    'The Persist': c16,
    'The Perplexed': c17,
    'The Fortify': c18,
    'The Great Exceeding ': c19,
    'The Sexual Awakening': c20,
    'The Hexproof': c21,
}

Finally, names and annotations!

In [166]:
for name, annotation in card_dict.items():
    print(name)
    print('--------------------')
    print('Questions')
    for q in annotation[0]:
        print('* ', q)
    print('')
    print('Interpretations')
    for i in annotation[1]:
        print('* ', i)
    print('\n * * * \n')

The Ungrounded
--------------------
Questions
*  How might it be better to leave things as they are for a while?
*  Is there something you need to say, something you would like to let others know what you think and how you feel?

Interpretations
*  You may be tempted to splash out on something but you are advised to look at it in a light-hearted manner – it will certainly do you good.

 * * * 

The Assumptions
--------------------
Questions
*  Why did you know the answer already?
*  Is there something you want to say or do but don’t quite know what it is?

Interpretations
*  You know what needs to be done.
*  Make sure you know what is going on before you speak.

 * * * 

The Provoke
--------------------
Questions
*  Who will make you smile?
*  Who will go out of your way to make it seem as if everyone is as stupid as you?
*  How might you make of the fact that they are really your friends?

Interpretations
*  Try not to overreact to someone who annoys you in any way.
*  If you say the

Saving the cards to match with images!

In [175]:
import json

with open('/content/drive/MyDrive/gpt2/named_cards.json', 'w') as outfile:
    json.dump(card_dict, outfile)