# Title: Search Engine Alpha 2.0
<p><b>Abstract:</b> Search engine build using Reuters data set</p>
<p><b>Authors:</b> Uriel Antonio & Ernesto Louie Cortez</p>
<p><b>Date:</b>    04/30/2016</p>

In [1]:
from xml.etree import ElementTree
import re
from StringIO import StringIO
from bs4 import BeautifulSoup
import os 
import pandas as pd
from collections import Counter

## Data Cleaning

<p><b>Abstract: </b>Reuters data is organized in a generalized markup format.  Using BeatifulSoup, articles are scraped for their Title, content, date and place acording to their respective markup tags</p>

<p><b>Challenges: </b>1) We initially made the assumption that all articles would have a date, place, title, and body.  During alpha 1 stage, it was found that titles and content became mismatched approximately 30 articles in.  Conditional statements were created to insert placeholders when content was missing.  2) The cleaned data was also littered with mark up tags for special symbols that also needed to be cleaned.    </p>


<p><b></b></p>

In [2]:
totstring=""

with open("Data/reut2-000.sgm",'r') as inF:
    for line in inF:
        string2=re.sub("&.*?>","",line,flags=re.UNICODE)
        string3=re.sub("\n"," ",string2,flags=re.UNICODE)
        string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","",string3.lower())
        totstring+=string
    
soup= BeautifulSoup(totstring)

items_date=list()
items_places=list()
items_title=list()
items_body=list()


for a in soup.findAll("reuters"):
    if a.date != None:
        items_date.append(a.date.getText())
    else:
        items_date.append("N/D")
    if a.places != None:
        items_places.append(a.places.getText()) 
    else:
        items_places.append("N/L")
    if a.title != None:
        items_title.append(a.title.getText())  
    else:
        items_title.append("Untitled")
    if a.content != None:
        items_body.append(a.content.getText())
    else:
        items_body.append("No Content.")
  

corpus = items_title[0:25]
print(corpus)


[u'bahia cocoa review', u'standard oil  to form financial unit', u'texas commerce bancshares  files plan', u'talking point/bankamerica  equity offer', u'national average prices for farmerowned reserve', u'argentine 1986/87 grain/oilseed registrations', u'red lion inns files plans offering', u'usx  debt dowgraded by moodys', u'champion products  approves stock split', u'computer terminal systems  completes sale', u'cobanco inc  year net', u'ohio mattress  may have lower 1st qtr net', u'am international inc  2nd qtr jan 31', u'brownforman inc  4th qtr net', u'national intergroup to offer permian units', u'economic spotlight  bankamerica ', u'national health enhancement  new program', u'dean foods  sees strong 4th qtr earnings', u'bonus wheat flour for north yemen   usda', u'credit card disclosure bills introduced', u'hughes capital unit signs pact with bear stearns', u'magma lowers copper 075 cent to 66 cts', u'brownforman  sets stock split ups payout', u'esquire radio and electronics in



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [3]:
def print_results(results,n, head=True):
    ''' Helper function to print results
    '''
    if head:    
        print('\nTop %d from recall set of %d items:' % (n,len(results)))
        for r in results[:n]:
            print('\t%0.2f - %s'%(r[0],items_title[r[1]]))
    else:
        print('\nBottom %d from recall set of %d items:' % (n,len(results)))
        for r in results[-n:]:
            print('\t%0.2f - %s'%(r[0],items_title[r[1]]))

## Inverted Index

In [5]:
import math
from collections import Counter

def create_inverted_index(corpus):
    idx={}
    for i, doc in enumerate(corpus):
        for word in doc.split():
            if word in idx:
                if i in idx[word]:
                    idx[word][i] += 1
                else:    
                    idx[word][i] = 1
            else:
                idx[word] = {i:1}
    return idx



## TF-IDF

<p><b>Abstract: </b>Determining the aboutness of a document was improved by using a combined score for both the title and the content of the article.   This assisted in getting articles with numerous term matches in the title to be ranked higher.</p>

<p><b>Challenges: </b> Unresolved issues are synonym matching.  Implementing scrapped date and place data.</p>

In [7]:
def get_results_tfidf(qry, idx_body, n, idx_title, nn):
    score = Counter()
    score2 = Counter()
    for term in qry.split():
        if term in idx_body:
            i = math.log(float(n)/(1+len(idx_body[term])))
            for doc in idx_body[term]:
                score[doc] += idx_body[term][doc] * i
                
        if term in idx_title:
            i = math.log(float(n)/(1+len(idx_body[term])))
            
            for doc in idx_title[term]:
                score2[doc] += idx_title[term][doc]* i
                
    score = score + score2;

    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])
    
    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    #type(score2)
    #print(score2)
    return sorted_results

idx_body = create_inverted_index(items_body)
idx_title= create_inverted_index(items_title)


#results = get_results_tfidf('this is the end of the file', idx_body, len(items_body),idx_title, len(items_title))
#results = get_results_tfidf('japan the oil bankrupt this', idx_body, len(items_body),idx_title, len(items_title))
#results = get_results_tfidf('stock market decline', idx_body, len(items_body),idx_title, len(items_title))
results = get_results_tfidf('japan trade talk', idx_body, len(items_body),idx_title, len(items_title))

print_results(results,10)


Top 10 from recall set of 97 items:
	69.71 - japan us set to begin highlevel trade talks
	36.81 - brazil criticises advisory committee structure
	33.75 - china calls for better trade deal with us
	20.90 - us wheat bonus to soviet called dormant
	20.65 - japan likely to let us banks deal securities
	19.39 - us asks japan to end agriculture import controls
	19.39 - us asks japan end agriculture import controls
	17.21 - pemex signs 500 mln dlr japan loan for pipeline
	16.58 - japan to try to open market to us car parts
	14.06 - talks show new canadian confidence group says
