# Assignment 03 - BOW, Features and TF-IDF


# Exercises

## Get The Data

Before starting this tutorial we will need some text files to process. To make it as easy as possible, the following two lines will download and extract the necessary data.


In [None]:
!curl -c ./cookie -s -L "https://drive.google.com/uc?export=download&id=1dNFLyLBK-0RkAu5Pzb_Yn9VghVl1Lxjf" > /dev/null
!curl -Lb ./cookie "https://drive.google.com/uc?export=download&confirm=`awk '/download/ {print $NF}' ./cookie`&id=1dNFLyLBK-0RkAu5Pzb_Yn9VghVl1Lxjf" -o bdata.zip
!unzip -q bdata.zip

## Part A

Dictionaries can be use for counting the frequency of some category of words in text as we saw last class, using sentiment (from the [AFINN sentiment lexicon](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010)) in the time series data of tweets as an example.

This notebook uses data from the AFINN sentiment lexicon; for other dictionaries in wide use, see [MPQA](https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/) (free for use with registration) and [LIWC](http://liwc.wpengine.com) (commercial).

In [None]:
import json
import nltk
import pandas as pd #common alias used in Python
import matplotlib
%matplotlib inline

In [None]:
# function to read in json file of tweets and return a list of (date, tokenized text)
def read_tweets_from_json(filename):
    
    tweets=[]
    with open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            created_at=tweet["created_at"]
            date = pd.to_datetime(created_at) #
            text=tweet["text"]
            tokens=nltk.casual_tokenize(text)
            tweets.append((date, tokens))
    return tweets

In [None]:
# read in list of (date, tokens) tweets and count whether each tweet contains 
# a (lowercased) term in the argument dictionary.  (BOW)
# Return as pandas dataframe for easier slicing/plotting)
def dictionary_document_count(data, dictionary):
    counted=[]
    for date, tokens in data:
        val=0
        for word in tokens:
            if word.lower() in dictionary:
                val=1
        counted.append((date, val))
    df=pd.DataFrame(counted, columns=['date','document frequency'])
    return df

In [None]:
tweets=read_tweets_from_json("./bdata/trump_tweets.json")

In [None]:
immigration_dictionary=set(["wall", "border", "borders", "immigrants","immigration"])

In [None]:
def plot_time(counts):
    
    # for this exercise, let's just keep tweets published after 2015
    counts=counts[(counts['date'] > '2015-01-01')]
    
    # counts is a pandas dataframe; let's aggregate the counts by month.  
    # Can also aggregate by "D" for day, "W" for week, "Y" for year.
    means=counts.resample('M', on='date').mean() 
    
    means.plot()

1. The AFINN dictionary is a sentiment lexicon, where words are rated on a five-point affect scale (-5 = most negative, 5 = most positive).  Write a function `read_AFINN_dictionary` to read in this file and create two dictionaries  (Exercises Part A from Lab 03) -- one for positive terms and one for negative terms.  

How did you decide the cutoff point for positive and negative?

In [None]:
def read_AFINN_dictionary(filename):
   
    
    return set(positive), set(negative)

In [None]:
positive, negative=read_AFINN_dictionary("./bdata/AFINN-111.txt")

2. Create a plot using the negative sentiment dictionary you created.

3. Create a new dictionary of your own for a concept you'd like to measure in `trump_tweets.json` or `aoc_tweets.json`.  The dictionary must contain at least 10 terms; you're free to create one for any category (except sentiment!).Create a plot using that dictionary and data.

4. (Extra) For each of the terms in your dictionary, write a function `print_examples(tweets, dictionary)` to find one tweet that contains that term and print it out for your inspection.  Is that term used in the same sense you expected?

In [None]:
def print_examples(data, dictionary):
    

In [None]:
print_examples(tweets, immigration_dictionary)

## Part B

Feature engineering for text classification. Consider the data under ```data/bdata/text_classification_sample_data``` and the **Feature Exatraction** block.

Your task is to create two new feature functions (like `dictionary_feature` and `unigram_feature` below), and include them in the `build_features` function.

In [None]:
import sys
from collections import Counter
from sklearn import preprocessing
from sklearn import linear_model
import pandas as pd
from scipy import sparse
import numpy as np

In [None]:
def read_data(filename):
    X=[]
    Y=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            label=cols[0]
            text=cols[1]
            X.append(text)
            Y.append(label)
    return X, Y

In [None]:
# The directory should contain train.tsv, dev.tsv and test.tsv
directory="data/bdata/text_classification_sample_data"

1. Briefly describe your data (including the categories you're predicting)

In [None]:
trainX, trainY=read_data("%s/train.tsv" % directory)
devX, devY=read_data("%s/dev.tsv" % directory)

In [None]:
def majority_class(trainY, devY):
    labelCounts=Counter()
    for label in trainY:
        labelCounts[label]+=1
    majority=labelCounts.most_common(1)[0][0]
    
    correct=0.
    for label in devY:
        if label == majority:
            correct+=1
            
    print("%s\t%.3f" % (majority, correct/len(devY)))

Here are two examples of features we've computed -- one feature class noting the presence of a word in an external dictionary, and one feature class for the word identity (i.e., unigram).  

We'll implement each feature class as a function that takes a single document as input (as a list of tokens) and returns a dict corresponding to the feature we're creating.

In [None]:
# Here's a sample dictionary we can create by inspecting the output of the Mann-Whitney test (in 2.compare/)
dem_dictionary=set(["republican","cut", "opposition", "Trump"])
repub_dictionary=set(["growth","economy", "Hillary"])

def political_dictionary_feature(tokens):
    feats={}
    for word in tokens:
        if word in dem_dictionary:
            feats["word_in_dem_dictionary"]=1
        if word in repub_dictionary:
            feats["word_in_repub_dictionary"]=1
    return feats

In [None]:
def unigram_feature(tokens):
    feats={}
    for word in tokens:
        feats["UNIGRAM_%s" % word]=1
    return feats

2. Add first new feature function here.  Describe your feature and why you think it will help.

In [None]:
def new_feature_class_one(tokens):
    feats={}
    feats["_FILL_IN_FEATURES_HERE_"]=1
    return feats

3. Add second new feature function here. Describe your feature and why you think it will help.

In [None]:
def new_feature_class_two(tokens):
    feats={}
    feats["_FILL_IN_FEATURES_HERE_"]=1
    return feats

This is the main function to aggregate together all of the information from different feature classes.  

Each document has a feature dict (`feats`), and we'll update that dict with the new dict that each separate feature class is returning. *Hint:make sure that the keys each feature function is creating are unique (why?)*

In [None]:
def build_features(trainX, feature_functions):
    data=[]
    for doc in trainX:
        feats={}

        # sample text data is already tokenized; if yours is not, do so here
        tokens=doc.split(" ")
        
        #for each new function of in our feature-functions we will update our features
        for function in feature_functions:
            feats.update(function(tokens))

        data.append(feats)
    return data

In [None]:
# converts a dictionary of feature names to unique numerical ids
def create_vocab(data):
    feature_vocab={}
    idx=0
    for doc in data:
        for feat in doc:
            if feat not in feature_vocab:
                feature_vocab[feat]=idx
                idx+=1
                
    return feature_vocab

In [None]:
# converts a dictionary of feature names to a sparse representation
# we can fit in a scikit-learn model.  This is important because almost all feature 
# values will be 0 for most documents (note: why?), and we don't want to save them all in 
# memory.

def features_to_ids(data, feature_vocab):
    new_data=sparse.lil_matrix((len(data), len(feature_vocab)))
    for idx,doc in enumerate(data):
        for f in doc:
            if f in feature_vocab:
                new_data[idx,feature_vocab[f]]=doc[f]
    return new_data

In [None]:
# This function evaluates a list of feature functions on the training/dev data arguments
def pipeline(trainX, devX, trainY, devY, feature_functions):
    trainX_feat=build_features(trainX, feature_functions)
    devX_feat=build_features(devX, feature_functions)

    # just create vocabulary from features in *training* data
    feature_vocab=create_vocab(trainX_feat)

    trainX_ids=features_to_ids(trainX_feat, feature_vocab)
    devX_ids=features_to_ids(devX_feat, feature_vocab)
    
    logreg = linear_model.LogisticRegression(C=1.0, solver='lbfgs', penalty='l2', max_iter=10000)
    logreg.fit(trainX_ids, trainY)
    print("Accuracy: %.3f" % logreg.score(devX_ids, devY))  

In [None]:
majority_class(trainY,devY)

4. Explore the impact of different feature functions by evaluating them below:

In [None]:
features=[political_dictionary_feature]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[political_dictionary_feature, unigram_feature]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_one]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

In [None]:
features=[unigram_feature, new_feature_class_one, new_feature_class_two]
pipeline(trainX, devX, trainY, devY, features)

5. (Extra) If we did some preprocessing in our tokens, should we get new results?. Create a function to clean your data (e.g., stopword removal, stemm, lemmatization) and see if that affects your results

## Part C

In [None]:
from collections import defaultdict, Counter
import math
import operator
import gzip

In [None]:
window=2
vocabSize=10000

In [None]:
filename="data/bdata/wiki.10K.txt"

wiki_data=open(filename, encoding="utf-8").read().lower().split(" ")


In [None]:
# We'll only create word representation for the most frequent K words

def create_vocab(data):
    word_representations={}
    vocab=Counter()
    for i, word in enumerate(data):
        vocab[word]+=1

    topK=[k for k,v in vocab.most_common(vocabSize)]
    for k in topK:
        word_representations[k]=defaultdict(float)
    return word_representations

In [None]:
# word representation for a word = its unigram distributional context (the unigrams that show
# up in a window before and after its occurence)

#`count_unigram_context` counts an individual unigram in the bag of words around a target as a "context" variable

def count_unigram_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        for j in range(start, end):
            if i != j:
                word_representations[word][data[j]]+=1

In [None]:
# `count_directional_context` counts the sequence of words before and after the word as a single 
# "context"--and specifies the direction it occurs (to the left or right of the word).

def count_directional_context(data, word_representations):
    for i, word in enumerate(data):
        if word not in word_representations:
            continue
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        left="L: %s" % ' '.join(data[start:i])
        right="R: %s" % ' '.join(data[i+1:end])
        
        word_representations[word][left]+=1
        word_representations[word][right]+=1

In [None]:
# normalize a word represenatation vector that its L2 norm is 1.
# we do this so that the cosine similarity reduces to a simple dot product

def normalize(word_representations):
    for word in word_representations:
        total=0
        for key in word_representations[word]:
            total+=word_representations[word][key]*word_representations[word][key]
            
        total=math.sqrt(total)
        for key in word_representations[word]:
            word_representations[word][key]/=total
        

In [None]:
def dictionary_dot_product(dict1, dict2):
    dot=0
    for key in dict1:
        if key in dict2:
            dot+=dict1[key]*dict2[key]
    return dot

In [None]:
def find_sim(word_representations, query):
    if query not in word_representations:
        print("'%s' is not in vocabulary" % query)
        return None
    
    scores={}
    for word in word_representations:
        cosine=dictionary_dot_product(word_representations[query], word_representations[word])
        scores[word]=cosine
    return scores

In [None]:
# Find the K words with highest cosine similarity to a query in a set of word_representations
def find_nearest_neighbors(word_representations, query, K):
    scores=find_sim(word_representations, query)
    if scores != None:
        sorted_x = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
        for idx, (k, v) in enumerate(sorted_x[:K]):
            print("%s\t%s\t%.5f" % (idx,k,v))

In [None]:
# Let's find the contexts shared between two words that have the most contribution
# to the cosine similarity

def find_shared_contexts(word_representations, query1, query2, K):
    if query1 not in word_representations:
        print("'%s' is not in vocabulary" % query1)
        return None
    
    if query2 not in word_representations:
        print("'%s' is not in vocabulary" % query2)
        return None
    
    context_scores={}
    dict1=word_representations[query1]
    dict2=word_representations[query2]
    
    for key in dict1:
        if key in dict2:
            score=dict1[key]*dict2[key]
            context_scores[key]=score

    sorted_x = sorted(context_scores.items(), key=operator.itemgetter(1), reverse=True)
    #https://docs.python.org/3.6/howto/sorting.html - for more info on sorted()
    
    for idx, (k, v) in enumerate(sorted_x[:K]):
        print("%s\t%s\t%.5f" % (idx,k,v))

1. Fill out a function `scale_tfidf` below.  This function takes as input a dict of word_representations and scales the value for each context in `word_representations[word]` by its tf-idf score.  Use the term frequency for tf and ${N \over |\{d \in D : t \in d\}|}$ for idf.  

Here, tf measure the count of a *context* term for a particular *word*, and idf measures the number of distinct *words* a particular *context* is seen with.  This function should modify `word_representations` in place.

In [None]:
def scale_tfidf(word_representations):            

In [None]:
tf_idf_word_representations=create_vocab(wiki_data)
#pipeline for corpus applying tf-idf

In [None]:
word_representations=create_vocab(wiki_data)
#pipeline for corpus without tf-idf

1. Compare the results the results of tf-idf scaling with the non-scaled results above.  How does scaling change the quality of the nearest neighbors, or the sensibility of the significant contexts?  Provide examples to support your claims using `find_nearest_neighbors` and `find_shared_contexts`.