# TF-IDF Weighted Word Embeddings

Traditionally, we can represent a document by averaging the word embeddings of the tokens in the document. However, this assumes that each token represents the same "importance/relevance" level for a document. Instead of weighting each document equally, we'll weight each document by their TF-IDF scores.

In [2]:
import pandas as pd
import spacy

# load spacy en_core_web_md model
nlp = spacy.load("en_core_web_md")

In [4]:
tweets = pd.read_csv("tweets_pandas.csv", encoding="latin1")["tweet"]
tweets = list(tweets.values)
tweets[:5]

['@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.',
 'Reading my kindle2...  Love it... Lee childs is good read.',
 'Ok, first assesment of the #kindle2 ...it fucking rocks!!!',
 "@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)",
 "@mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)"]

# Tokenize and Get Average Word Embeddings as Sentence Embeddings (What Spacy Already Does For You)

In [5]:
for idx, tweet in enumerate(tweets):
    print(nlp(tweet))
    print(nlp(tweet).vector[:10]) 
    
    if idx == 5: # stop printing after first 5 or so, takes a long time!
        break

@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.
[ 0.05852792  0.18400304 -0.16155939 -0.14803118  0.13116674  0.02988832
  0.02218468 -0.18794231 -0.00737952  1.8001341 ]
Reading my kindle2...  Love it... Lee childs is good read.
[-0.02049392  0.27483907 -0.16202064 -0.07091364  0.10088945  0.03455271
  0.10933404 -0.24974571  0.04030607  1.5997756 ]
Ok, first assesment of the #kindle2 ...it fucking rocks!!!
[-0.16872081  0.205246   -0.07031294 -0.10531273 -0.04304165 -0.06574225
  0.0774812  -0.13597105 -0.02863897  1.6457433 ]
@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)
[-0.03181249  0.16175789 -0.18572699 -0.08025248  0.06440208 -0.05401917
  0.01015068 -0.140078    0.00418231  2.060503  ]
@mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)
[-0.04977628  0.20684046 -0.20100766 -0.13358122 -0.00246899 

## Get TF-IDF Weighted Average Word Embeddings (Our Own Algorithm!)

In [156]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(tweets)

tf_idf_lookup_table = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

In [157]:
DOCUMENT_SUM_COLUMN = "DOCUMENT_TF_IDF_SUM"

# sum the tf idf scores for each document
tf_idf_lookup_table[DOCUMENT_SUM_COLUMN] = tf_idf_lookup_table.sum(axis=1)
available_tf_idf_scores = tf_idf_lookup_table.columns # a list of all the columns we have
available_tf_idf_scores = list(map( lambda x: x.lower(), available_tf_idf_scores)) # lowercase everything

In [158]:
import numpy as np

tweets_vectors = []
for idx, tweet in enumerate(tweets): # iterate through each review
    tokens = nlp(tweet) # have spacy tokenize the review text
    
    # initially start a running total of tf-idf scores for a document
    total_tf_idf_score_per_document = 0
    
    # start a running total of initially all zeroes (300 is picked since that is the word embedding size used by word2vec)
    running_total_word_embedding = np.zeros(300) 
    for token in tokens: # iterate through each token
    
    # if the token has a pretrained word embedding it also has a tf-idf score
        if token.has_vector and token.text.lower() in available_tf_idf_scores:
            
            tf_idf_score = tf_idf_lookup_table.loc[idx, token.text.lower()]
            #print(f"{token} has tf-idf score of {tf_idf_lookup_table.loc[idx, token.text.lower()]}")
            running_total_word_embedding += tf_idf_score * token.vector
            
            total_tf_idf_score_per_document += tf_idf_score
    
    # divide the total embedding by the total tf-idf score for each document
    document_embedding = running_total_word_embedding / total_tf_idf_score_per_document
    tweets_vectors.append(document_embedding)


# Check the Similarity of Different Documents

In [159]:
from sklearn.metrics.pairwise import cosine_similarity
similarities = pd.DataFrame(cosine_similarity(tweets_vectors), columns=tweets, index=tweets)

In [160]:
similarities = similarities.unstack().reset_index()
similarities.columns = ["tweet1", "tweet2", "similarity"]
similarities = similarities[similarities["similarity"] < 0.9999999999]
similarities.drop_duplicates(subset=["similarity"], inplace=True)

In [163]:
for idx, row in similarities.sort_values(by="similarity", ascending=False).head(50).iterrows():
    print(row["tweet1"])
    print("--" * 10)
    print(row["tweet2"])
    print("\n\n")

@faithbabywear Ooooh, what model are you getting??? I have the 40D and LOVE LOVE LOVE LOVE it!
--------------------
Just got my new toy. Canon 50D. Love love love it!



' Barack Obama shows his funny side " &gt;&gt; http://tr.im/l0gY !! Great speech..
--------------------
I like this guy : ' Barack Obama shows his funny side " &gt;&gt; http://tr.im/l0gY !!



just got back from the movies.  went to see the new night at the museum with rachel.  it was good
--------------------
saw the new Night at the Museum and i loved it. Next is to go see UP in 3D



Just saw the new Night at the Museum movie...it was...okay...lol 7\10
--------------------
saw night at the museum 2 last night.. pretty crazy movie.. but the cast was awesome so it was well worth it. Robin Williams forever!



going to see the new night at the museum  movie with my family oh boy a three year old in the movies fuin
--------------------
just got back from the movies.  went to see the new night at the museum with rachel. 