# Data Exploration

I began by familiarizing myself with the Reuters dataset. If you don't have the data, use nltk.download() from a Python module.

In [4]:
import nltk

nltk.corpus.reuters

<CategorizedPlaintextCorpusReader in u'/home/austin/nltk_data/corpora/reuters.zip/reuters/'>

In [5]:
len(nltk.corpus.reuters.fileids())

10788

In [6]:
nltk.corpus.reuters.fileids()[:5]

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833']

In [7]:
from nltk.corpus import reuters
reuters.fileids()[:5]

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833']

Each document has "categories" which I can treat as special information later on.

In [8]:
print reuters.categories()

test_article = reuters.fileids()[6]
reuters.categories(test_article)

[u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', u'cotton', u'cotton-oil', u'cpi', u'cpu', u'crude', u'dfl', u'dlr', u'dmk', u'earn', u'fuel', u'gas', u'gnp', u'gold', u'grain', u'groundnut', u'groundnut-oil', u'heat', u'hog', u'housing', u'income', u'instal-debt', u'interest', u'ipi', u'iron-steel', u'jet', u'jobs', u'l-cattle', u'lead', u'lei', u'lin-oil', u'livestock', u'lumber', u'meal-feed', u'money-fx', u'money-supply', u'naphtha', u'nat-gas', u'nickel', u'nkr', u'nzdlr', u'oat', u'oilseed', u'orange', u'palladium', u'palm-oil', u'palmkernel', u'pet-chem', u'platinum', u'potato', u'propane', u'rand', u'rape-oil', u'rapeseed', u'reserves', u'retail', u'rice', u'rubber', u'rye', u'ship', u'silver', u'sorghum', u'soy-meal', u'soy-oil', u'soybean', u'strategic-metal', u'sugar', u'sun-meal', u'sun-oil', u'sunseed', u'tea', u'tin', u'trade', u'veg-oil', u'wheat', u'wpi', u'yen', u'zinc']


[u'coffee', u'lumber', u'palm-oil', u'rubber', u'veg-oil']

The beginning of each article has a title in all capital letters.

In [9]:
reuters.words(test_article)[:6]

[u'INDONESIAN', u'COMMODITY', u'EXCHANGE', u'MAY', u'EXPAND', u'The']

I'd like to experiment with weighing these tokens heavier than the rest of the corpus later.

## Getting Started

Before doing anything fancy, I should clean up the data.

In [10]:
print reuters.categories(reuters.fileids()[0])
print reuters.words(reuters.fileids()[0])[:10], "\n"

categories = []
documents = []

for file_id in reuters.fileids():
    temp = []
    for category in reuters.categories(file_id):
        temp.append(category.encode('utf-8'))
    categories.append(temp)
    
    temp = []
    for word in reuters.words(file_id):
        temp.append(word.encode('utf-8').lower())
    documents.append(temp)

print categories[0]
print documents[0][:10]

[u'trade']
[u'ASIAN', u'EXPORTERS', u'FEAR', u'DAMAGE', u'FROM', u'U', u'.', u'S', u'.-', u'JAPAN'] 

['trade']
['asian', 'exporters', 'fear', 'damage', 'from', 'u', '.', 's', '.-', 'japan']


I'm beginning with a very simple model intentionally - ignoring case, ignoring all punctuation, etc. - so that I can add onto it and see if/how things improve.

## Building the Engine

The next step is to create the inverted index. I'm omitting categories for now.

In [11]:
def createInvertedIndex(documents):
    idx = {}
    
    for i,document in enumerate(documents):
        for word in document:
            if word in idx:
                if i in idx[word]:
                    idx[word][i] += 1
                else:
                    idx[word][i] = 1
            else:
                idx[word] = {i:1}
    return idx
        
idx = createInvertedIndex(documents)
for i,pair in enumerate(idx['thailand']):
    if i < 5:
        print pair,':',idx['thailand'][pair]
    else:
        break

3 : 2
3333 : 1
6758 : 1
1546 : 1
4183 : 1


It's then possible to evaluate my first search query with term frequency as the only information metric.

In [12]:
def searchTF(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        for doc_num in idx[word]:
            if doc_num in scores:
                scores[doc_num] += idx[word][doc_num]
            else:
                scores[doc_num] = idx[word][doc_num]
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1], pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)
    

def printResults(results, corpus, n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.2f - %s length: %d'%(r[0],corpus[r[1]][:8],len(corpus[r[1]])))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.2f - %s length: %d'%(r[0],corpus[r[1]][:8],len(corpus[r[1]])))

idx = createInvertedIndex(documents)
scores = searchTF('hostage crisis', documents, idx)
printResults(scores, documents, 10)


Top 10 from recall set of 88 items:
	5.00 - ['brazil', 'says', 'debt', 'crisis', 'is', 'world', 'problem', 'brazilian'] length: 420
	4.00 - ['portuguese', 'economy', 'remains', 'buoyant', 'despite', 'crisis', 'portugal', "'"] length: 583
	4.00 - ['papandreou', 'shows', '"', 'restricted', 'optimism', '"', 'over', 'crisis'] length: 307
	3.00 - ['treasury', "'", 's', 'baker', 'under', 'fire', 'for', 'wall'] length: 871
	3.00 - ['cash', 'crisis', 'hits', 'ugandan', 'coffee', 'board', 'uganda', "'"] length: 556
	2.00 - ['tropical', 'forest', 'death', 'could', 'spark', 'new', 'debt', 'crisis'] length: 651
	2.00 - ['ec', 'commission', 'given', 'plan', 'to', 'save', 'steel', 'industry'] length: 339
	2.00 - ['ex', '-', 'arco', '&', 'lt', ';', 'arc', '>'] length: 292
	2.00 - ['diplomats', 'call', 'u', '.', 's', '.', 'attack', 'on'] length: 560
	2.00 - ['oecd', 'trade', ',', 'growth', 'seen', 'slowing', 'in', '1987'] length: 763


The resulting document set is less than spectacular. The 9th query looks promising to be an actual article on a hostage crisis, but it, like the others, lacks 'hostage'

In [13]:
print 'hostage' in documents[scores[8][1]]
print 'crisis' in documents[scores[8][1]]

False
True


In [14]:
scores = searchTF('hostage', documents, idx)
printResults(scores, documents, 10)


Top 10 from recall set of 1 items:
	1.00 - ['unocal', '&', 'lt', ';', 'ucl', '>', 'plans', 'increase'] length: 435


Maybe the query isn't the most fair since only one document in the entire corpus contains 'hostage'. I'll try another.

In [115]:
scores = searchTF('u s national debt', documents, idx)
print_results(scores, documents, 5)

69  ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea ran with the U. S. in 1986 will 

53  COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. is unlikely to back away from its basic demand 

50  ASIAN EXPORTERS FEAR DAMAGEFROMU. S .- JAPAN RIFT
trade friction between the U. S. And Japan has raised fears among many of Asia' s exporting nations that the row could inflict far- reaching economic damage, businessmen and officials said. They told Reuter correspondents in Asian capitals a U. S. Move 

47  JAPANREJECTSU.S . OBJECTIONS TO FAIRCHILDSALE A
Ministry official dismissed ar

The results are noticeably better. One thing that stands out, however, is that longer queries seem to dominate the top, and the weighting of all tokens is the same, when really they should be diminished by their prevalence.

## Improving the Results

First, we can change the weights of terms by implementing IDF to our score, then later on, adjust the document scores by their length. In eyeballing the results, US seems to appear more which is good; it's weight has likely beein increased significantly by the addition of IDF.

In [113]:
import math

def idfCalc(term, idx, n):
    return math.log( float(n) / (1 + len(idx[term])))

def searchTFIDF(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1], pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDF('u s national debt', documents, idx)
print_results(scores, documents, 10)

74.2293301544  ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea ran with the U. S. in 1986 will 

62.3012977666 U.S . URGES BANKS TO WEIGH PHILIPPINE DEBT PLAN
U. S. is urging reluctant commercial banks to seriously consider accepting a novel Philippine proposal for paying its interest bill and believes the innovation is fully consistent with its Third World debt strategy, a Reagan administration official said. The official' s comments also suggest that 

61.7681979233  COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. is unlikely to back away from its basic demand 

52.5897

Now to do something about the document length. I'll start by simply dividing the score by the length of the document. This should give me a sort of relevance per token metric.

In [111]:
def searchTFIDFNorm(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        docs = []
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1] / len(corpus[pair[0]]), pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDFNorm('u s national debt', documents, idx)
print_results(scores, documents, 10)

0.356455395684 
national bank says bought dollars against yen swiss national bank says bought dollars against yen

0.316849240608  VIACOM INTERNATIONAL INC GETS ANOTHER NEW NATIONAL AMUSEMENTS BID VIACOM INTERNATIONAL INC GETS ANOTHER NEW NATIONAL AMUSEMENTSBID

TypeError: 'NoneType' object is not iterable

The results appear to be disastrous. Very short queries are winning out because occurences of 'national' for example aren't being mitigated very heavily by a document of length 16. Perhaps a more moderate approach like normalizing by the log of the length would be effective.

In [109]:
import math

def searchTFIDFNorm(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        docs = []
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1] / math.log(len(corpus[pair[0]])), pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDFNorm('u s national debt', documents, idx)
print_results(scores, documents, 20)

10.467009227  ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea ran with the U. S. in 1986 will 

9.16050591129  COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. is unlikely to back away from its basic demand 

9.04147901591 U.S . URGES BANKS TO WEIGH PHILIPPINE DEBT PLAN
U. S. is urging reluctant commercial banks to seriously consider accepting a novel Philippine proposal for paying its interest bill and believes the innovation is fully consistent with its Third World debt strategy, a Reagan administration official said. The official' s comments also suggest that 

8.173027

In [120]:
#from sys.stdout import write
import sys
import string
from __future__ import print_function

def printNicely(x, num_tokens):
    num_lines, words = printNicelyTitle(x, num_tokens)
    printNicelyCorpus(x, num_tokens, words)
    
def printNicelyTitle(x, num_tokens):
    words = reuters.words(reuters.fileids()[x])
    print(len(words), end=' ')
    for i,word in enumerate(words):
        if len(word) > 1:
            if word[1].islower():
                sys.stdout.write('\n')
                return num_tokens-1, words[i+1:]
        if i < len(words)-1:
            if words[i+1] not in set(string.punctuation) and len(words[i+1]) != 1:
                sys.stdout.write(' ')
        sys.stdout.write(word)

def printNicelyCorpus(x, num_tokens, words):
    for i,word in enumerate(words[:num_tokens]): 
        sys.stdout.write(word)
        if i < len(words)-1:
            if words[i+1] not in set(string.punctuation):
                sys.stdout.write(' ')
    sys.stdout.write('\n\n')
    
printNicely(123, 50)

def print_results(scores, corpus, num_results, num_chars=50):
    for pair in scores[:num_results]:
        print(str(pair[0]) + ' ', end='')
        printNicely(pair[1], num_chars)

999 ENERGY/U.S . PETROCHEMICAL INDUSTRY
oil feedstocks, the weakened U. S. dollar and a plant utilization rate approaching 90 pct will propel the streamlined U. S. petrochemical industry to record profits this year, with growth expected through at least 1990, major company executives predicted. This bullish outlook 



In [121]:
import math

def searchTFIDFNorm(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        docs = []
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1] / math.log(len(corpus[pair[0]])), pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDFNorm('u s national debt', documents, idx)
print_results(scores, documents, 5, 50)



10.467009227 1202  ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea ran with the U. S. in 1986 will 

9.16050591129 848  COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. is unlikely to back away from its basic demand 

9.04147901591 983 U.S . URGES BANKS TO WEIGH PHILIPPINE DEBT PLAN
U. S. is urging reluctant commercial banks to seriously consider accepting a novel Philippine proposal for paying its interest bill and believes the innovation is fully consistent with its Third World debt strategy, a Reagan administration official said. The official' s comments also suggest th