# Search And Ranking

Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that **Google**’s rapid rise from an academic project to the world’s most popular search engine was based largely on the PageRank algorithm, a variation that you’ll learn about in this chapter.

Throughout this chapter, you’ll learn all the necessary steps to crawl, index,and search a set of pages, and even rank their results in many different ways

## What’s in a Search Engine?

The first step in creating a search engine is to develop a way to **collect** the documents.
In some cases, this will involve **crawling** (starting with a small set of documents and following links to others) and in other cases it will begin with a fixed collection of documents, perhaps from a corporate intranet.

After you collect the documents, they need to be **indexed**. This usually involves creating a big table of the documents and the locations of all the different words.Depending on the particular application, the documents themselves do not necessarily have to be stored in a database; the index simply has to store a reference (such as a file system path or URL) to their locations.

The final step is, of course, **returning** a ranked list of documents from a query.Retrieving every document with a given set of words is fairly straightforward once you have an index, but the real magic is in **how the results are sorted**.This chapter will look at several metrics based on the content of the page, such as word **frequency**, and then cover metrics based on information external to the content of the page, such as the PageRank algorithm, which looks at **how other pages link to the page in question**.

To work through the examples in this chapter, you’ll need to create a Python module called searchengine, which has two classes: 
> * one for crawling and creating the database, 
> * and the other for doing full-text searches by querying the database. 

The examples will use SQLite, but they can easily be adapted to work with a traditional client-server database.

In [1]:
class crawler:
    # Initialize the crawler with the name of database
    def __init__(self, dbname):
        pass

    def __del__(self):
        pass

    def dbcommit(self):
        pass
    # Auxilliary function for getting an entry id and adding
    # it if it's not present

    def getentryid(self, table, field, value, createnew=True):
        return None
    # Index an individual page

    def addtoindex(self, url, soup):
        print 'Indexing %s' % url
    # Extract the text from an HTML page (no tags)

    def gettextonly(self, soup):
        return None
    # Separate the words by any non-whitespace character

    def separatewords(self, text):
        return None
    # Return true if this url is already indexed

    def isindexed(self, url):
        return False
    # Add a link between two pages

    def addlinkref(self, urlFrom, urlTo, linkText):
        pass
    # Starting with a list of pages, do a breadth
    # first search to the given depth, indexing pages
    # as we go

    def crawl(self, pages, depth=2):
        pass
    # Create the database tables

    def createindextables(self):
        pass


## A Simple Crawler

I’ll assume for now that you don’t have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I’ll show you how to build a simple crawler. It will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. This process is called crawling or spidering. 

To do this, your code will have to download the pages, pass them to the indexer (which you’ll build in the next section), and then parse the pages to find all the links to the pages that have to be crawled next. Fortunately, there are a couple of libraries that can help with this process. 

For the examples in this chapter, I have set up a copy of several thousand files from Wikipedia, which will remain static at http://kiwitobes.com/wiki. 

You’re free to run the crawler on any set of pages you like, but you can use this site if you want to compare your results to those in this chapter.



### Using urllib2

In [2]:
import urllib2
##proxy = 'http://127.0.0.1:3128'
##opener = urllib2.build_opener( urllib2.ProxyHandler({'http':proxy}) ) 
# urllib2.install_opener( opener )
c=urllib2.urlopen('http://www.sina.com.cn')
contents=c.read( )
print contents[0:100]

<!DOCTYPE html>
<!-- [ published at 2017-01-05 22:09:32 ] -->
<html>
<head>
    <meta http-equiv="Co


### Crawler Code

Using urllib2 and Beautiful Soup you can build a crawler that will take a list of URLs to index and crawl their links to find other pages to index. 

In [3]:
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# Create a list of words to ignore
ignorewords=set(['the','of','to','and','a','in','is','it'])

In [4]:
# Starting with a list of pages, do a breadth
# first search to the given depth, indexing pages
# as we go
def crawl(self,pages,depth=2):
  for i in range(depth):
    newpages={}
    for page in pages:
      try:
        c=urllib2.urlopen(page)
      except:
        print "Could not open %s" % page
        continue
      try:
        soup=BeautifulSoup(c.read())
        self.addtoindex(page,soup)

        links=soup('a')
        for link in links:
          if ('href' in dict(link.attrs)):
            url=urljoin(page,link['href'])
            if url.find("'")!=-1: continue
            url=url.split('#')[0]  # remove location portion
            if url[0:4]=='http' and not self.isindexed(url):
              newpages[url]=1
            linkText=self.gettextonly(link)
            self.addlinkref(page,url,linkText)

        self.dbcommit()
      except:
        print "Could not parse page %s" % page

    pages=newpages

In [5]:
# import mySearchengine
# reload(mySearchengine)
# pagelist=['http://www.google.com']
# crawler=mySearchengine.crawler('')
# crawler.crawl(pagelist)

In [6]:
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
# proxy = 'http://127.0.0.1:3128'
# opener = urllib2.build_opener( urllib2.ProxyHandler({'http':proxy}) ) 
# urllib2.install_opener( opener )
c=urllib2.urlopen('http://www.baidu.com')
contents=c.read( )
soup = BeautifulSoup(contents)
links = soup('span')
for i in range(len(links)):
    print links[i]

<span class="bg s_ipt_wr"><input id="kw" name="wd" class="s_ipt" value="" maxlength="255" autocomplete="off" /></span>
<span class="bg s_btn_wr"><input type="submit" id="su" value="百度一下" class="bg s_btn" /></span>
<span class="tools"><span id="mHolder"><div id="mCon"><span>输入法</span></div><ul id="mMenu"><li><a href="javascript:;" name="ime_hw">手写</a></li><li><a href="javascript:;" name="ime_py">拼音</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">关闭</a></li></ul></span></span>
<span id="mHolder"><div id="mCon"><span>输入法</span></div><ul id="mMenu"><li><a href="javascript:;" name="ime_hw">手写</a></li><li><a href="javascript:;" name="ime_py">拼音</a></li><li class="ln"></li><li><a href="javascript:;" name="ime_cl">关闭</a></li></ul></span>
<span>输入法</span>
