### Austin Zielinski
# Search Engine

## Data Exploration


I began by familiarizing myself with the Reuters dataset. If you don't have the data, use nltk.download() from a Python module.

In [1]:
import nltk

nltk.corpus.reuters

<CategorizedPlaintextCorpusReader in u'/home/austin/nltk_data/corpora/reuters.zip/reuters/'>

In [2]:
len(nltk.corpus.reuters.fileids())

10788

In [3]:
nltk.corpus.reuters.fileids()[:5]

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833']

In [4]:
from nltk.corpus import reuters
reuters.fileids()[:5]

['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833']

Each document has "categories" which I can treat as special information later on.

In [5]:
print reuters.categories()

test_article = reuters.fileids()[6]
reuters.categories(test_article)

[u'acq', u'alum', u'barley', u'bop', u'carcass', u'castor-oil', u'cocoa', u'coconut', u'coconut-oil', u'coffee', u'copper', u'copra-cake', u'corn', u'cotton', u'cotton-oil', u'cpi', u'cpu', u'crude', u'dfl', u'dlr', u'dmk', u'earn', u'fuel', u'gas', u'gnp', u'gold', u'grain', u'groundnut', u'groundnut-oil', u'heat', u'hog', u'housing', u'income', u'instal-debt', u'interest', u'ipi', u'iron-steel', u'jet', u'jobs', u'l-cattle', u'lead', u'lei', u'lin-oil', u'livestock', u'lumber', u'meal-feed', u'money-fx', u'money-supply', u'naphtha', u'nat-gas', u'nickel', u'nkr', u'nzdlr', u'oat', u'oilseed', u'orange', u'palladium', u'palm-oil', u'palmkernel', u'pet-chem', u'platinum', u'potato', u'propane', u'rand', u'rape-oil', u'rapeseed', u'reserves', u'retail', u'rice', u'rubber', u'rye', u'ship', u'silver', u'sorghum', u'soy-meal', u'soy-oil', u'soybean', u'strategic-metal', u'sugar', u'sun-meal', u'sun-oil', u'sunseed', u'tea', u'tin', u'trade', u'veg-oil', u'wheat', u'wpi', u'yen', u'zinc']


[u'coffee', u'lumber', u'palm-oil', u'rubber', u'veg-oil']

The beginning of each article has a title in all capital letters.

In [6]:
reuters.words(test_article)[:6]

[u'INDONESIAN', u'COMMODITY', u'EXCHANGE', u'MAY', u'EXPAND', u'The']

I'd like to experiment with weighing these tokens heavier than the rest of the corpus later.

## Getting Started

Before doing anything fancy, I should clean up the data.

In [7]:
print reuters.categories(reuters.fileids()[0])
print reuters.words(reuters.fileids()[0])[:10], "\n"

categories = []
documents = []

for file_id in reuters.fileids():
    temp = []
    for category in reuters.categories(file_id):
        temp.append(category.encode('utf-8'))
    categories.append(temp)
    
    temp = []
    for word in reuters.words(file_id):
        temp.append(word.encode('utf-8').lower())
    documents.append(temp)

print categories[0]
print documents[0][:10]

[u'trade']
[u'ASIAN', u'EXPORTERS', u'FEAR', u'DAMAGE', u'FROM', u'U', u'.', u'S', u'.-', u'JAPAN'] 

['trade']
['asian', 'exporters', 'fear', 'damage', 'from', 'u', '.', 's', '.-', 'japan']


I'm beginning with a very simple model intentionally - ignoring case, ignoring all punctuation, etc. - so that I can add onto it and see if/how things improve.

I built a function to print each article neatly.

In [8]:
#from sys.stdout import write
import sys
import string

def printNicely(x, num_tokens):
    num_tokens, words = printNicelyTitle(x, num_tokens)
    printNicelyCorpus(x, num_tokens, words)
    
def printNicelyTitle(x, num_tokens):
    words = reuters.words(reuters.fileids()[x])
    sys.stdout.write(' ')
    for i,word in enumerate(words):
        if len(word) > 2:
            if word[1].islower():
                sys.stdout.write('\n')
                return num_tokens-i, words[i+1:]
        if i < len(words)-1:
            if words[i+1] not in set(string.punctuation) and len(words[i+1]) != 1:
                sys.stdout.write(' ')
        sys.stdout.write(word)
    return 0, ['']

def printNicelyCorpus(x, num_tokens, words):
    for i,word in enumerate(words[:num_tokens]): 
        sys.stdout.write(word)
        if i < len(words)-1:
            if words[i+1] not in set(string.punctuation):
                sys.stdout.write(' ')
    sys.stdout.write('\n\n')
    
printNicely(123, 50)

def print_results(scores, corpus, num_results, num_chars=50):
    for pair in scores[:num_results]:
        sys.stdout.write('%f ' % pair[0])
        printNicely(pair[1], num_chars)
        

 ENERGY/U.S . PETROCHEMICAL INDUSTRY
oil feedstocks, the weakened U. S. dollar and a plant utilization rate approaching 90 pct will propel the streamlined U. S. petrochemical industry to record profits this year, with growth expected through at least 1990, 



## Building the Engine

The next step is to create the inverted index. I'm omitting categories for now.

In [9]:
def createInvertedIndex(documents):
    idx = {}
    
    for i,document in enumerate(documents):
        for word in document:
            if word in idx:
                if i in idx[word]:
                    idx[word][i] += 1
                else:
                    idx[word][i] = 1
            else:
                idx[word] = {i:1}
    return idx
        
idx = createInvertedIndex(documents)
for i,pair in enumerate(idx['thailand']):
    if i < 5:
        print pair,':',idx['thailand'][pair]
    else:
        break

3 : 2
3333 : 1
6758 : 1
1546 : 1
4183 : 1


It's then possible to evaluate my first search query with term frequency as the only information metric.

In [10]:
def searchTF(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        for doc_num in idx[word]:
            if doc_num in scores:
                scores[doc_num] += idx[word][doc_num]
            else:
                scores[doc_num] = idx[word][doc_num]
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1], pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)
    

def printResults(results, corpus, n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.2f - %s length: %d'%(r[0],corpus[r[1]][:8],len(corpus[r[1]])))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.2f - %s length: %d'%(r[0],corpus[r[1]][:8],len(corpus[r[1]])))

idx = createInvertedIndex(documents)
scores = searchTF('hostage crisis', documents, idx)
print_results(scores, documents, 10)

5.000000   BRAZIL SAYS DEBT CRISIS IS WORLD PROBLEM
Finance Minister Dilson Funaro said his country' s foreign debt crisis could only be solved by changes in the international financial system. Speaking to a business conference he said" It is not Brazil that has to make adjustments with the 

4.000000   PORTUGUESE ECONOMY REMAINS BUOYANT DESPITE CRISIS
' s economy, which has been enjoying one of its most buoyant periods in more than a decade, may now be strong enough to shrug off the country' s latest government crisis, analysts said. But the April 3 ousting 

4.000000   PAPANDREOUSHOWS " RESTRICTEDOPTIMISM " OVER CRISIS
Prime Minister Andreas Papandreou expressed" restricted optimism" about a crisis with Turkey over disputed oil rights in the Aegean Sea. Papandreou was speaking to reporters after briefing opposition political leaders on the latest developments in the row as a 

3.000000  TREASURY' S BAKER UNDER FIRE FOR WALL STREET DROP As
Washington sought to restore investor confide

The resulting document set is less than spectacular. The 9th query looks promising to be an actual article on a hostage crisis, but it, like the others, lacks 'hostage'

In [11]:
print 'hostage' in documents[scores[8][1]]
print 'crisis' in documents[scores[8][1]]

False
True


In [12]:
scores = searchTF('hostage', documents, idx)
printResults(scores, documents, 10)


Top 10 from recall set of 1 items:
	1.00 - ['unocal', '&', 'lt', ';', 'ucl', '>', 'plans', 'increase'] length: 435


Maybe the query isn't the most fair since only one document in the entire corpus contains 'hostage'. I'll try another.

In [13]:
scores = searchTF('u s national debt', documents, idx)
print_results(scores, documents, 5)

69.000000   ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea 

53.000000   COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. 

50.000000   ASIAN EXPORTERS FEAR DAMAGEFROMU. S .- JAPAN RIFT
trade friction between the U. S. And Japan has raised fears among many of Asia' s exporting nations that the row could inflict far- reaching economic damage, businessmen and officials said. They told 

47.000000   JAPANREJECTSU.S . OBJECTIONS TO FAIRCHILDSALE A
Ministry official dismissed arguments made by senior U. S. Government officials seeking to block the sale of a U. S. Microchip mak

The results are noticeably better. One thing that stands out, however, is that longer queries seem to dominate the top, and the weighting of all tokens is the same, when really they should be diminished by their prevalence.

## Improving the Results

First, we can change the weights of terms by implementing IDF to our score, then later on, adjust the document scores by their length. In eyeballing the results, US seems to appear more which is good; it's weight has likely beein increased significantly by the addition of IDF.

In [14]:
import math

def idfCalc(term, idx, n):
    return math.log( float(n) / (1 + len(idx[term])))

def searchTFIDF(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1], pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDF('u s national debt', documents, idx)
print_results(scores, documents, 10)

74.229330   ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea 

62.301298  U.S . URGES BANKS TO WEIGH PHILIPPINE DEBT PLAN
U. S. is urging reluctant commercial banks to seriously consider accepting a novel Philippine proposal for paying its interest bill and believes the innovation is fully consistent with its Third World debt strategy, a Reagan administration 

61.768198   COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. 

52.589726   JAPANREJECTSU.S . OBJECTIONS TO FAIRCHILDSALE A
Ministry official dismissed arguments made by senior U. S. Government officials seeking to bl

Now to do something about the document length. I'll start by simply dividing the score by the length of the document. This should give me a sort of relevance per token metric.

In [15]:
def searchTFIDFNorm(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        docs = []
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1] / len(corpus[pair[0]]), pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDFNorm('u s national debt', documents, idx)
print_results(scores, documents, 10)

0.356455  
national bank says bought dollars against yen swiss national bank says bought dollars against yen

0.316849   VIACOM INTERNATIONAL INC GETS ANOTHER NEW NATIONAL AMUSEMENTS BID VIACOM INTERNATIONAL INC GETS ANOTHER NEW NATIONAL AMUSEMENTSBID

0.259240   VIACOM SAID IT HAS NEW NATIONALAMUSEMENTS , MCV HOLDINGS BIDS VIACOM SAID IT HAS NEW NATIONALAMUSEMENTS , MCV HOLDINGSBIDS

0.259240   BENEFICIAL CORP TO SELL WESTERN NATIONAL LIFE FOR 275 MLN DLRS BENEFICIAL CORP TO SELL WESTERN NATIONAL LIFE FOR 275 MLNDLRS

0.255533   UNIONNATIONAL &lt ;UNBC > SIGNS DEFINITIVE PACT
National Corp said it signed a definitive agreement under which its First National Bank and Trust Co of Washington unit will merge with& lt; Second National Bank of Masontown >. Under a previously announced merger agreement, each 

0.252366  U.S .VIDEO &lt ;VVCO. O >, FIRST NATIONAL INMERGERU.S .
Vending Corp said it completed acquiring First National Telecommunications INc from First National Entertainment Corp 

The results appear to be disastrous. Very short queries are winning out because occurences of 'national' for example aren't being mitigated very heavily by a document of length 16. Perhaps a more moderate approach like normalizing by the log of the length would be effective.

In [16]:
import math

def searchTFIDFNorm(query, corpus, idx):
    scores = {}
    
    for word in query.split():
        docs = []
        if word in idx:
            idf = idfCalc(word, idx, len(corpus))
            
            for doc_num in idx[word]:
                if doc_num in scores:
                    scores[doc_num] += idx[word][doc_num] * idf
                else:
                    scores[doc_num] = idx[word][doc_num] * idf
    
    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        results.append([pair[1] / math.log(len(corpus[pair[0]])), pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)

scores = searchTFIDFNorm('u s national debt', documents, idx)
print_results(scores, documents, 5, 50)

10.467009   ECONOMICSPOTLIGHT-U.S . DEFICIT WITH NICs
U. S. trade deficit with Taiwan and Korea is expected to widen this year, despite some economic and currency adjustments by the two newly industrialized countries, economists said." The surpluses that Taiwan and Korea 

9.160506   COFFEE TALKS FAILURE SEENPRESSURINGU.S .
of talks on re- establishing International Coffee Organization, ICO, coffee quotas last week may put political pressure on the United States, particularly the State Department, to reassess its position, but the U. S. 

9.041479  U.S . URGES BANKS TO WEIGH PHILIPPINE DEBT PLAN
U. S. is urging reluctant commercial banks to seriously consider accepting a novel Philippine proposal for paying its interest bill and believes the innovation is fully consistent with its Third World debt strategy, a Reagan administration 

8.173028   JAPANREJECTSU.S . OBJECTIONS TO FAIRCHILDSALE A
Ministry official dismissed arguments made by senior U. S. Government officials seeking to block

I'm pretty happy with the results, so I think it's time to try some more queries on this setup and ask for feedback.

In [17]:
print '----------Query 1----------'
scores = searchTFIDFNorm('international currency', documents, idx)
print_results(scores, documents, 5, 50)

print '----------Query 2----------'
scores = searchTFIDFNorm('big earthquake', documents, idx)
print_results(scores, documents, 5, 50)

print '----------Query 3----------'
scores = searchTFIDFNorm('earthquake', documents, idx)
print_results(scores, documents, 5, 50)

print '----------Query 4----------'
scores = searchTFIDFNorm('quality of life', documents, idx)
print_results(scores, documents, 5, 50)

----------Query 1----------
6.148997   MIDDLE EAST CURRENCY MARKET SEES KEY CHANGES
East currency dealers meet in Abu Dhabi this weekend at a time of fundamental change in their business, which has seen a growing volume of trade shift from the Arab world to London. The 14th congress of the Inter- Arab 

4.658282  U.S . CORPORATEFINANCE - NEW ZEALAND DLR FRNS
- rate notes denominated in a foreign currency, a relatively new wrinkle on Wall Street, will probably be issued infrequently because the so- called" window of opportunity" opens and closes quickly, traders say

4.308503   VW SAYS 480 MLN MARKS MAXIMUM FOR CURRENCY LOSSES
for Volkswagen AG& lt; VOWG. F >, VW, linked to an alleged foreign currency fraud will not exceed the 480 mln marks provision already made, a VW spokesman said. The spokesman was commenting after VW 

4.228809   DOLLAR EXPECTED TO FALL DESPITE INTERVENTION
bank intervention in the foreign exchange markets succeeded in staunching the dollar' s losses today, but sen

The results point out a few weaknesses in the search engine. 1) words adjectives like 'big' confuse the results, especially when there aren't many documents with 'earthquake'

In [18]:
scores = searchTFIDFNorm('earthquake', documents, idx)
scores2 = searchTFIDFNorm('big', documents, idx)
print len(scores)
len(scores2)

36


95

Also, it doesn't handle phrases well like 'quality of life'. It does handle very keyword heavy queries well though.

In [19]:
scores = searchTFIDFNorm('asia commodities', documents, idx)
print_results(scores, documents, 5, 50)

4.527399   NET CHANGE IN EXPORT COMMITMENTS -- USDA
U. S. Agriculture Department gave the net change in export commitments, including sales, cancellations, foreign purchases and cumulative exports, in the current seasons through the week ended April 2, with comparisons, as follows, in 

2.989904   DEAK BUYS JOHNSON MATTHEY COMMODITIES
International, a foreign currency and precious metals firm, announced the acquisition of Johnson Matthey Commodities of New York from Minories Finance Limited, a unit of the Bank of England. The purchase valued at 14. 8 mln dlrs follows the recent 

2.978849   DEAK BUYS JOHNSON MATTHEY COMMODITIES
International, a foreign currency and precious metals firm, announced the acquisition of Johnson Matthey Commodities of New York from Minories Finance Limited, a unit of the Bank of England. The purchase valued at 14. 8 mln dlrs follows the recent 

2.969081  U.S . DOLLAR LOSSES PROPEL BROAD COMMODITY GAINS
from gold to grains to cotton posted solid gains in a f

## Treating the Title Differently

Earlier I hypothesized that I could change the weight of the title to make it more effective; I know plan to implement that by making the total weight of the title equal to the total wieght of the document.

My justification for this is that the title should represent the information presented in the corpus, therefore, I think it's information contribution should be weighed the same as the corpus.

In terms of a product on a website, if the item is an iPhone 6 than iPhone and 6 will be very heavily weighted, which is basically what you want. An item with a long description full of spam will have to divide that weight among more tokens, and it probably will also have a weak description, meaning the title will have low weight and low relevancy.

I'm going to make a separate idx for the titles and a separate IDF. First I'll separate titles from corpus using code I engineered for my printing function. Unfortunately I couldn't get it to work for all documents, but I might still be able to demonstrate the concept on a smaller corpus. WARNING: MAY TAKE A WHILE

In [20]:
import numpy

reuters_docs = []
new_documents = []

articles = numpy.random.choice(len(reuters.fileids()), 500, replace=False)

for i in articles:
    reuters_docs.append(reuters.words(reuters.fileids()[i]))


for document in reuters_docs:
    title_corpus = []
    for i,word in enumerate(document):
        if i > 1:
            if len(word) > 2:
                if word[1].islower():
                    title = []
                    for j in range(0, i):
                        title.append(document[j].encode('utf-8'))
                    title_corpus.append(title)
                    corpus = []
                    for j in range(i, len(document)):
                        corpus.append(document[j].lower().encode('utf-8'))
                    title_corpus.append(corpus)
                    
    new_documents.append(title_corpus)

Let's see how it turned out

In [21]:
for i in range(0, 5):
    print new_documents[i][0]
    print new_documents[i][1][:15]
    print

['WEISFIELD', "'", 'S', 'INC', '&', 'lt', ';', 'WEIS', '>', 'QUARTERLY', 'DIVIDEND']
['qtly', 'div', '12', '-', '1', '/', '2', 'cts', 'vs', '12', '-', '1', '/', '2', 'cts']

['SOVIET', 'INDUSTRIAL', 'OUTPUT', 'UP', 'IN', 'FIRST', 'QUARTER']
['soviet', 'industrial', 'output', 'in', 'the', 'first', 'quarter', 'of', 'this', 'year', 'grew', 'by', '2', '.', '5']

['PAKISTAN', "'", 'S', 'TRADE', 'DEFICIT', 'NARROWS', 'IN', 'FEBRUARY']
['pakistan', "'", 's', 'trade', 'deficit', 'narrowed', 'to', '2', '.', '64', 'billion', 'rupees', '(', 'provisional', ')']

['ADVO', '-', 'SYSTEM', 'INC', '&', 'lt', ';', 'ADVO', '>', 'SEES', 'BREAK', '-', 'EVEN', '2ND', 'QTR', 'ADVO', '-']
['system', 'inc', 'said', 'it', 'could', 'report', 'a', 'break', 'even', 'second', 'quarter', 'ending', 'march', '28', ',']

['&', 'lt', ';', 'BARRICINI', 'FOODS', 'INC', '>', '1ST', 'QTR', 'LOSS']
['shr', 'loss', 'three', 'cts', 'vs', 'loss', 'three', 'cts', 'net', 'loss', '78', ',', '456', 'vs', 'loss']



Now that we have the titles separated we can treat them as separate information. Unfortunately I couldn't get the title weighting to equal the corpus weighting, so I had to settle with a separate IDF for titles.

In [35]:
def createInvertedIndex(documents):
    idx_t = {}
    for i,title in enumerate(documents[0]):
        for word in title:
            word = word.lower()
            if word in idx_t:
                if i in idx_t[word]:
                    idx_t[word][i] += 1
                else:
                    idx_t[word][i] = 1
            else:
                idx_t[word] = {i:1}
               
    idx_c = {}
    for i,corpus in enumerate(documents[1]):
        for word in corpus:
            word = word.lower()
            if word in idx_c:
                if i in idx_c[word]:
                    idx_c[word][i] += 1
                else:
                    idx_c[word][i] = 1
            else:
                idx_c[word] = {i:1}
    
    return idx_t, idx_c

import math

def searchTFIDFTitle(query, corpus, idx_t, idx_c):
    scores = {}
    
    for word in query.split():
        word = word.lower()
        if word in idx_t:
            idf = idfCalc(word, idx_t, len(corpus))
            
            for doc_num in idx_t[word]:
                if doc_num in scores:
                    scores[doc_num] += idx_t[word][doc_num] * idf
                else:
                    scores[doc_num] = idx_t[word][doc_num] * idf
        
        if word in idx_c:
            idf = idfCalc(word, idx_c, len(corpus))
            
            for doc_num in idx_c[word]:
                if doc_num in scores:
                    scores[doc_num] += idx_c[word][doc_num] * idf
                else:
                    scores[doc_num] = idx_c[word][doc_num] * idf

    results = []
    for pair in [[score[0],score[1]] for score in zip(scores.keys(), scores.values())]:
        divisor = len(corpus[pair[0]])
        if divisor < 0:
            divisor = 1
        else:
            divisor = math.log(len(corpus[pair[0]]))
        results.append([pair[1] / divisor, pair[0]])
        
    return sorted(results, key=lambda x: x[0] * -1)
    
idx_t, idx_c = createInvertedIndex(new_documents)
scores = searchTFIDFTitle('u s national debt', new_documents, idx_t, idx_c)
print_results(scores, new_documents, 5, 50)

1.448965   ASIAN EXPORTERS FEAR DAMAGEFROMU. S .- JAPAN RIFT
trade friction between the U. S. And Japan has raised fears among many of Asia' s exporting nations that the row could inflict far- reaching economic damage, businessmen and officials said. They told 

1.341036   INDONESIAN COMMODITY EXCHANGE MAY EXPAND
Indonesian Commodity Exchange is likely to start trading in at least one new commodity, and possibly two, during calendar 1987, exchange chairman Paian Nainggolan said. He told Reuters in a telephone interview that trading in palm oil, sawn timber, 

1.341036   WESTERN MINING TO OPEN NEW GOLD MINE IN AUSTRALIA
Mining Corp Holdings Ltd& lt; WMNG. S>( WMC) said it will establish a new joint venture gold mine in the Northern Territory at a cost of about 21 mln dlrs. The mine, to 

1.341036   BOND CORP STILL CONSIDERING ATLAS MININGBAIL - OUT
Corp Holdings Ltd& lt; BONA. S> and Atlas Consolidated Mining and Development Corp& lt; ATLC. MN> are still holding talks on a bail- out pac

Not quite what I expected, but a little better than normalizing by length.

## Conclusions

In making a search engine I learned that it's easy to improve specific queries, but hard to be generally good. Just when you think you've kind of nailed relevancy, there's a query that can break it. Also, the more complicated english entered as a query, the more likely it is to confuse the algorithm without stop word catchers, stemmers, or even a grammar parser.

I would enjoy making a search engine with multithreading or on a cluster, and look forward to implementing one in the future.