<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2019
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 4-3: USING TF-IDF <br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)
    </span>
</p>

<img style="width: 240px; height: 120px; float: right; margin: 0 0 0 0;" src="http://www.merritt.edu/wp/histotech/wp-content/uploads/sites/275/2018/08/berkeley-logo.jpg" />
</div>

# Using tf-idf

Today, we'll be using tf-idf to compare different related subreddits, in order to find the most distinctive words for the discourse community we're interested in. We'll also use tf-idf to find similar posts to ones we're interested in.

**After completing this notebook, you will be able to:**
- Understand how tf-idf can be used to compare related datasets
- Use most-distinctive words to aid you in your close reading

Make sure the files "trp-submissions.csv", "seduction-submissions.csv", "dating_advice-submissions.csv", "mgtow-submissions.csv", and "mensrights-submissions.csv" are in the same folder as this notebook.

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
import os
import pickle
import re 
import string
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

## Test using toy dataset

Let's try our idea with a toy dataset. Here we have three documents about Python, but with different meanings.

In [4]:
document1 = """Python is a 2000 made-for-TV horror movie directed by Richard
Clabaugh. The film features several cult favorite actors, including William
Zabka of The Karate Kid fame, Wil Wheaton, Casper Van Dien, Jenny McCarthy,
Keith Coogan, Robert Englund (best known for his role as Freddy Krueger in the
A Nightmare on Elm Street series of films), Dana Barron, David Bowe, and Sean
Whalen."""

document2 = """Python, from the Greek word (πύθων/πύθωνας), is a genus of
nonvenomous pythons[2] found in Africa and Asia. Currently, 7 species are
recognised.[2] A member of this genus, P. reticulatus, is among the longest
snakes known."""

document3 = """Monty Python (also collectively known as the Pythons) are a British 
surreal comedy group who created the sketch comedy television show Monty Python's 
Flying Circus, which first aired on the BBC in 1969. Forty-five episodes were made 
over four series."""

document4 = """Python is an interpreted, high-level, general-purpose programming language. 
Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes 
code readability with its notable use of significant whitespace. Its language constructs and 
object-oriented approach aim to help programmers write clear, logical code for small and 
large-scale projects."""

document5 = """The Colt Python is a .357 Magnum caliber revolver formerly
manufactured by Colt's Manufacturing Company of Hartford, Connecticut.
It is sometimes referred to as a "Combat Magnum".[1] It was first introduced
in 1955, the same year as Smith &amp; Wesson's M29 .44 Magnum. The now discontinued
Colt Python targeted the premium revolver market segment."""

document6 = """The Pythonidae, commonly known simply as pythons, from the Greek word python 
(πυθων), are a family of nonvenomous snakes found in Africa, Asia, and Australia. 
Among its members are some of the largest snakes in the world. Eight genera and 31
species are currently recognized."""

testList = [document1, document2, document3, document4, document5, document6]

First, we create a matrix of word counts, and transform them into tf-idf values using scikit-learn. 

In [5]:
cv = CountVectorizer(max_df=0.85, decode_error='ignore', stop_words = 'english')
wordCountVector = cv.fit_transform(testList)

In [6]:
tfidfTransformer = TfidfTransformer(smooth_idf=True,use_idf=True)
tfidfTransformer.fit(wordCountVector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Next, we need a way to sort the sparse matrix that the TfidfTransformer yields. Don't worry too much about the details for now:

In [7]:
def sortCoo(cooMatrix):
    tuples = zip(cooMatrix.col, cooMatrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extractTopN(featureNames, sortedItems, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    # use only top n items from vector
    sortedItems = sortedItems[:topn]
    scoreVals = []
    featureVals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sortedItems:
        
        #keep track of feature name and its corresponding score
        scoreVals.append(round(score, 3))
        featureVals.append(featureNames[idx])
 
    # create a tuples of feature,score
    results= {}
    for idx in range(len(featureVals)):
        results[featureVals[idx]] = scoreVals[idx]
    
    return results

In [8]:
# map index to word
featureNames=cv.get_feature_names()
 
# generate tf-idf for the given document
tfIdfVector=tfidfTransformer.transform(cv.transform([document3]))
 
# sort the tf-idf vectors by descending order of scores
sortedItems=sortCoo(tfIdfVector.tocoo())
 
# extract only the top n
keywords=extractTopN(featureNames,sortedItems,5)
 
# print the results
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


===Keywords===
monty 0.425
comedy 0.425
television 0.212
surreal 0.212
sketch 0.212


Looks like it works!

# Load Reddit documents 

We'll start small by importing only 100 posts of each subreddit. It's usually good to start small and see if everything works as expected, then scale up.

In [3]:
trp = pd.read_csv("TRP-submissions.csv", lineterminator="\n")

FileNotFoundError: File b'TRP-submissions.csv' does not exist

In [None]:
sed = pd.read_csv("seduction-submissions.csv", lineterminator="\n")

In [None]:
mgtow = pd.read_csv("mgtow-submissions.csv", lineterminator="\n")

In [None]:
# Let's justget 1000 posts each, based on highest score
trp = trp.sort_values(by=['score'], ascending=False)[:100]
sed = sed.sort_values(by=['score'], ascending=False)[:100]
mgtow = mgtow.sort_values(by=['score'], ascending=False)[:100]

## Tokenizing & POS tagging

Let's use a tokenizer function that also extracts nouns and lemmatizes.

In [None]:
def posFilter(df, name):
    """POS tags and filters DF by nouns"""
    dfLength = len(df)
    total = []
    counter = 0
    clean = df[~df['selftext'].isin(['[removed]', '[deleted]' ])].dropna(subset=['selftext'])
    cleaner = df.dropna(subset=['selftext'])
    for text in cleaner['selftext']:
        text = text.lower()
        # remove tags
        text = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", text)
        # remove special characters and digits
        text = re.sub("(\\d|\\W)+"," ", text)
        # tokenize
        tokens = word_tokenize(text)
        # leave only nouns
        pos = nltk.pos_tag(tokens)     
        nouns = [tup[0] for tup in pos if tup[1] == "NN"]
        # lemmatize
        lemmas = ' '.join([wordnet_lemmatizer.lemmatize(token) for token in nouns])
        with open(name + "-nouns.txt", 'a', encoding='utf8') as f:
            f.write(lemmas)
        counter += 1
        if counter % 100 == 0:
            print("Saved " + str(counter) + " out of " + str(dfLength) + " entries") 

In [None]:
posFilter(trp, "trp")

In [None]:
posFilter(sed, "sed")

In [None]:
posFilter(mgtow, "mgtow")

## Using TF-IDF to find distinctive words

We'll implement scikit-learn's tf-idf functionality to find distinctive words for each document (i.e. subreddit). So we'll treat each subreddit we feed into the `TfidfTransformer` as a document.

In [None]:
with open("trp-nouns.txt") as f:
    trpNouns = f.read()
with open("sed-nouns.txt") as f:
    sedNouns = f.read()
with open("mgtow-nouns.txt") as f:
    mgtowNouns = f.read()

In [None]:
redditList = [trpNouns, sedNouns, mgtowNouns]

Now it's up to you. Initialize the `CountVectorizer` and `TfidfTransformer`, and repeat what we did with the test, but this time with this data!

In [None]:
# Your code here








## Using TF-IDF to find similar documents

We can also use tf-idf to work out the similarity between any pair of documents. So given one post or comment, we could see which posts or comments are most similar. This can be useful if you're trying to find other examples of a pattern you have found and want to explore further.

This time, our "documents" will not be entire subreddits, but posts/submissions within one subreddit. Let's import the comments and run the vectorizer without all the lemmatizing so we can actually read them.

In [10]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')
wordCountVector = tf.fit_transform([content for content in trp['selftext']])

This function finds similar words using the wordCountVector we just created. It uses scikit-learn's "linear kernel", which uses cosine similarity to find documents that are most alike.

We'll start by finding a post with a clear topic. This one's on picking up women (surprise...)

In [11]:
trp['selftext'][1]

'**A complete guide to picking up 9s and 10s**  \nToday I want to tell you everything I know about getting the highest calibre girls from cold approach.  \nThis guide will cover: frame control, inner game, and passing tests — which I consider to be the holy trinity of “9 and 10 game”.  \nThis guide will NOT cover: body language, pulling, or handling logistics. Obviously, the latter are extremely important, but they’ve been adequately covered elsewhere, and there just isn’t space to include them here.  \n\n\n&nbsp;\n\n\n**My background**  \nPicked on in school, small and sickly, didn’t have a girlfriend until 18. Was dumped by her and spent the first 2 years of college pretty much celibate.  \nGot into redpill ideas through the old “Citizen Renegade” blog (which is now Heartiste). From there stumbled on RSD’s infield videos.  \nStarted going out and approaching regularly. Approach anxiety and ceaseless rejection for months, but I kept at it. The odd success here and there.  \nAfter abou

Now let's find the top document(s) based on cosine similarity:

In [13]:
def findSimilar(wordCountVector, index, top_n = 1):
    cosineSimilarities = linear_kernel(wordCountVector[index:index+1], wordCountVector).flatten()
    relatedDocsIndices = [i for i in cosineSimilarities.argsort()[::-1] if i != index]
    return [(index, cosineSimilarities[index]) for index in relatedDocsIndices][0:top_n]

In [14]:
for index, score in findSimilar(wordCountVector, 1):
       print(score, trp['selftext'][index])

0.2274080263007057 **“There is little that can withstand a man who can conquer himself.”** – Louis XIV

&nbsp;

What is a strong frame? In any social interaction, one person is reacting more than the other. One person needs more approval, validation, and acceptance from the other person than vice versa. Whoever is reacting less has the stronger frame. Whoever has the stronger frame has more leverage and social power.


&nbsp;

External factors influence this dynamic.  Research has found, for example, that subordinates tend to match the vocal patterns of their superiors, but superiors do not match the vocal patterns of their subordinates.


&nbsp;

However, ultimately, a truly strong frame doesn’t rely on titles or external status, it is something you carry with you wherever you are and with whomever you are interacting with.


&nbsp;

**What A Strong Frame Isn’t**


&nbsp;

A strong frame isn’t being nice to make a girl like you. It isn’t using lines to appear ‘hard-to-get’ to make her

## Exercise: hypothesis generation using distinctive words

- Using the above method, find out the most-distinctive words for this subreddit. Try it out for the other subreddits, too!
- Think about a hypothesis or research question you could construct about this dataset, based on these distinctive words.
- 
- If you have time, look up comparable subreddits to the one you will be looking at, and try to use the API to download some posts from it to run your own comparative test.

Post your thoughts here: https://docs.google.com/document/d/1sSc55WHOYZZVvfDWgL5SEydTQ9mWgbnDumRxPnE2vhw/edit?usp=sharing