# CSE 7337 Information Retrieval and Web Search Project 1


                                                                               Name: Zheng Li
                                                                               Email: zli1@smu.edu
                                                                               Civil and Environmental Engineering

   ## 1. Project overview

    The goal of this project is to develop a web crawler to cralw web information. The specific requirements of this project is as follows:
    
    1. The input parameters of this crawler includes the number of pages to retrieve and a list of stop words to exclude during text processing

    2. Retrieve and save text content from URLs (.txt, htm, html, .php)

        i) Pdf extension will not be included in this project, since it requires extra implementation to convert Pdf file to text. And the process of conversion is not consistent for all types of pdf files.

    3. The crawler should be robust to deal with broken links and report the URLs of broken links. 
    
    4. List the URLs of all pages given test data and report all out-going links of the test data. Besides, display the contents of the <TITLE> tag

    5. Implement duplicate detection

    6. Save and list the URLs of graphics (gif, jpg, jpeg, png)
    
    7. Construct Term-Document frequency matrix and report the top 20 most common words based on URLs retrieved


## 2. Crawler structure

    The crawler is built primarily using Python library Scarpy together with other libraries(nltk, pandas, beautifulsoup).
    There are a few parameters/configurations regarding the setting of crawler:
        1) The crawler's agent name is Zheng@SMU
        2) The download delay of crawler is 5 seconds
        3) The crawler should follow the robots.txt if it exists in the root directory
        4) The number of page limit for cralwer is set to be 100 in this project
        5) The stop words for this crawler is using stop words list from NLTK (Natural Language Toolkit Toolkit: https://www.nltk.org/) Python libraries.
        6) The crawler is set only to crawl the URLs within given domain.(e.g http://lyle.smu.edu/~fmoore)
        
    URLs storage:
    
    The crawler uses two pipilines during the retrieving process: 
        1) Pipeline 1: save/append retrieved information to in json format. Once crawler is closed, save the content as a json file.（See class JsonWriterPipeline for more detail）
            i) Several type of information will be saved while crawling including: URL(URL), URL_format(string) such as .php, .htm, .txt, Status(numeric), Title(string), Text(list), Images(list), SubLinks(list), ShouldNotRetrieve(Boolean)
            ii) Majority of the tasks peformed in this project is based the data structure that generated from json. 
            iii) A overview of data table is also included in the notebook. (See the result section)
        2) Pipeline 2: crawl URLs, extract target content, sublinks recursively. The crawling process is done in a recursive function. The steps are: 
            i) Crawler starts with a start_url
            ii) Crawler call [parse] function to extract text contents and sublinks from URL
            iii) Check if the content of the URL has already been retrieved, if so, passed this page.(Duplicate detection)
            iv) If iii) not, save the text content to a dictionary with its text as key and corresponding URL as value.
            v) Looping though its sublinks, for each link call [parse] function recursively and repeat steps ii) and iii).
     
    Dictionary to store URLs/contents:
    I stored the content of each URL as a gaint string and saved as the key of a dictionary. The reason of this setting is for the convenience of implementation of duplicate detection. (It's easier to search via keys than values.) The URL is saved as its value in the dictionary. So to know the URL given the content, it is simply URL = dict['content']. 
   
             
    Sublink Extraction:
    
    All the hyperlinks within one URL are extracted using Scrapy built command - LinkExtractor. The LinkExtractor will only extracts links that lead to another text-based 'actual' page(including .txt). Therefore, any hyperlinks that lead to certain files (.pdf, .xlsx) will not be extracted. (e.g link: https://s2.smu.edu/~fmoore/index-fall2017.htm will be extracted. link:https://s2.smu.edu/~fmoore/misc/algorithm-1-7.pdf will NOT be extracted.)
    
    
    Duplicate detection:
    
    A Exact Duplication detection scheme has been implemented within this crawler. A exact visible text content comparison is used. The crawler will report a duplicate if two differet URLs have the exact same text content. The text content of each URL is stored as a dictionary in the form of {'Text Content':URL}(Text content is the key and URL is the value). This is a simple/straight forward way to perform duplication detection although there are a number of ways to complete the same task. Note: This implemtation only consider the visible text content, any disimilarity due to the HTML structure will not be detected.
    
    Text processing:
    
    Once the visible text content has been retrieved, I performed a couple of text normalization/tokenization techniques: 
        1) remove non-breaking space e.g. xa0
        2) remove \n\r\t
        3) exclude stop words e.g. "an","the".
        4) lowercase all letters e.g. "C" -> "c"
        5) remove extra spaces e.g. "blue   sky"->"blue sky"
        6) remove special characters e.g. "good!"->"good"
        7) remove single letter e.g. "r" -> ""
        8) remove any non-space strings that start with non-alphabets e.g."4Huang"->""
        
    Extra Implementation:
    
    One of implementations that I did is that I have overwited the source code of scrapy in its robots.txt implementation in order to let it work for this project.(See class: RobotsTxtMiddleware in the source code). The reason is as follows:
        1) The default implementation of robots.txt is configured by setting "ROBOTSTXT_ENABLED" and "ROBOTSTXT_OBEY" to True. 
        2) The default code only looks for the existence of robots.txt in its ROOT directory which may not apply for this project. The original code only looks for http://lyle.smu.edu/robots.txt while for the project, the robots.txt is located at http://lyle.smu.edu/~fmoore/robots.txt. 
        3) The edited part of code is also commented in the source code for reference.
        
        
        

## 3. Source Code

    The source code for this crawler is written in Jupyter Notebook with Python 3 kernel. Please see the source code and its implementation in the following section.

In [1]:
## This notebook is used as the submission for project 1: CSE 7337 Information Retrieval and Web Search
## Author: Zheng Li
## Ph.D candidate
## Civil and Environmental Engineering
## Southern Methodist University
## Part of the code is adapted and modified from online resources in the following:
## https://www.digitalocean.com/community/tutorials/how-to-crawl-a-web-page-with-scrapy-and-python-3
## https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html
## https://doc.scrapy.org/en/latest/index.html

In [2]:
# Settings for notebook
try:
    import scrapy
except:
    !pip install scrapy
    import scrapy
try:
    import pandas as pd
except:
    !pip install pandas
    import pandas as pd

from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
from scrapy.exceptions import IgnoreRequest
from bs4 import BeautifulSoup
import urllib.request
import unicodedata
import logging
import sys

from nltk.corpus import stopwords
import json
import numpy as np

In [3]:
class JsonWriterPipeline(object):
    def open_spider(self, spider):
        print('json opened')
        self.file = open('crawl_result.j1', 'w')

    def close_spider(self, spider):
        print('json closed')
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item


In [4]:
"""
This is a middleware to respect robots.txt policies. This code is copied from Scrapy source code. 
I modified this source code in order for the program to be able to look for robots.txt that 
is not located at root diretory(e.g. /~fooore/robots.txt).
"""

import logging

from six.moves.urllib import robotparser

from twisted.internet.defer import Deferred, maybeDeferred
from scrapy.exceptions import NotConfigured, IgnoreRequest
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached
from scrapy.utils.log import failure_to_exc_info
from scrapy.utils.python import to_native_str

logger = logging.getLogger(__name__)


class RobotsTxtMiddleware(object):
    DOWNLOAD_PRIORITY = 1000

    def __init__(self, crawler):
        if not crawler.settings.getbool('ROBOTSTXT_OBEY'):
            raise NotConfigured

        self.crawler = crawler
        self._useragent = crawler.settings.get('USER_AGENT')
        self._parsers = {}

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self, request, spider):
        if request.meta.get('dont_obey_robotstxt'):
            return
        d = maybeDeferred(self.robot_parser, request, spider)
        d.addCallback(self.process_request_2, request, spider)
        return d

    def process_request_2(self, rp, request, spider):
        if len(urlparse_cached(request).path.split('/'))>0: # Added by Zheng
            path = '/'+'/'.join(urlparse_cached(request).path.split('/')[2:]) # Added by Zheng
        else:
            path = '/'+urlparse_cached(request).path # Added by Zheng
        if rp is None:
            return
#         if not rp.can_fetch(to_native_str(self._useragent), request.url):
#       Edit by Zheng, if robots locates at root directory, comment command below and uncomment above command
        if not rp.can_fetch(to_native_str(self._useragent), path):
            logger.debug("Forbidden by robots.txt: %(request)s",
                         {'request': request}, extra={'spider': spider})
            print(100*'*')
            print("URL Forbidden by robots.txt: %s"%request.url)
            print(100*'*')
            self.crawler.stats.inc_value('robotstxt/forbidden')
            raise IgnoreRequest("Forbidden by robots.txt")

    def robot_parser(self, request, spider):
        url = urlparse_cached(request)
        if len(url.path.split('/'))>1:
            subdir = url.path.split('/')[1] # Added by Zheng
            netloc = '/'.join([url.netloc,subdir]) # Added by Zheng
        else:
            netloc = url.netloc # Added by Zheng
#         netloc = url.netloc  # Commented by Zheng if robots located at root directory uncomment this line and comment above two lines
        if netloc not in self._parsers:
            self._parsers[netloc] = Deferred() 
            robotsurl = "%s://%s/robots.txt" % (url.scheme,netloc) 
            robotsreq = Request(
                robotsurl,
                priority=self.DOWNLOAD_PRIORITY,
                meta={'dont_obey_robotstxt': True}
            )
            dfd = self.crawler.engine.download(robotsreq, spider)
            dfd.addCallback(self._parse_robots, netloc)
            dfd.addErrback(self._logerror, robotsreq, spider)
            dfd.addErrback(self._robots_error, netloc)
            self.crawler.stats.inc_value('robotstxt/request_count')

        if isinstance(self._parsers[netloc], Deferred):
            d = Deferred()
            def cb(result):
                d.callback(result)
                return result
            self._parsers[netloc].addCallback(cb)
            return d
        else:
            return self._parsers[netloc]

    def _logerror(self, failure, request, spider):
        if failure.type is not IgnoreRequest:
            logger.error("Error downloading %(request)s: %(f_exception)s",
                         {'request': request, 'f_exception': failure.value},
                         exc_info=failure_to_exc_info(failure),
                         extra={'spider': spider})
        return failure

    def _parse_robots(self, response, netloc):
        self.crawler.stats.inc_value('robotstxt/response_count')
        self.crawler.stats.inc_value(
            'robotstxt/response_status_count/{}'.format(response.status))
        rp = robotparser.RobotFileParser(response.url)
        body = ''
        if hasattr(response, 'text'):
            body = response.text
        else:  # last effort try
            try:
                body = response.body.decode('utf-8')
            except UnicodeDecodeError:
                # If we found garbage, disregard it:,
                # but keep the lookup cached (in self._parsers)
                # Running rp.parse() will set rp state from
                # 'disallow all' to 'allow any'.
                self.crawler.stats.inc_value('robotstxt/unicode_error_count')
        # stdlib's robotparser expects native 'str' ;
        # with unicode input, non-ASCII encoded bytes decoding fails in Python2
        rp.parse(to_native_str(body).splitlines())

        rp_dfd = self._parsers[netloc]
        self._parsers[netloc] = rp
        rp_dfd.callback(rp)

    def _robots_error(self, failure, netloc):
        if failure.type is not IgnoreRequest:
            key = 'robotstxt/exception_count/{}'.format(failure.type)
            self.crawler.stats.inc_value(key)
        rp_dfd = self._parsers[netloc]
        self._parsers[netloc] = None
        rp_dfd.callback(None)

In [5]:
class cse7337Spider(CrawlSpider):
    ###################INPUT PARAMETERS######################
    count_MAX = 100
    # I am using the stop words collection from NLTK package
    stop_words = list(set(stopwords.words('english'))) 
    #########################################################
    name = 'cse7337_spider'
    allowed_domains = ['lyle.smu.edu','s2.smu.edu']
    start_urls = ['http://lyle.smu.edu/~fmoore']
    download_delay = 5.0
    httpcache_enabled = True
    count = 0
    handle_httpstatus_list = [404] 
    content_seen = {}
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'crawl_result.json',                        # Used for pipeline 2
        'ROBOTSTXT_ENABLED':True,
        'ROBOTSTXT_OBEY': True,                                 # obey robots.txt rule
        'DOWNLOADER_MIDDLEWARES' : {'__main__.RobotsTxtMiddleware':1000,
                                   'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware':None
                                   }
                    }            
    def parse(self,response):
        def strip_not_start_with_alphabets(string):
            str_list = string.split()
            new_string=[]
            for i in str_list:
                if i[0].isalpha():
                    new_string.append(i)    
            return ' '.join(new_string)
        def strip_stop_words(string,stop_words):
            for stop_word in stop_words:
                string = string.replace(' '+stop_word+' ',' ')
            return string
        def strip_single_letter(string):
            str_list = string.split()
            new_string=[]
            for i in str_list:
                if len(i)>1:
                    new_string.append(i)
            return ' '.join(new_string)    
        def strip_punctuation(string):
            x = ''.join(string.replace(',',''))
            x = x.replace("'",'')
            x = x.replace('"','')
            x = x.replace(':',' ')
            x = x.replace(';',' ')
            x = x.replace('!',' ')
            x = x.replace('(',' ')
            x = x.replace(')',' ')
            x = x.replace('.',' ')
            x = x.replace('-',' ')
            x = x.replace('/',' ')
            x = x.replace('@',' ')
            x = ' '.join(x.split())
            return x
        def text_processing(response):
            try:
                html = urllib.request.urlopen(response.url).read()  
                soup = BeautifulSoup(html, 'lxml')
                for tag in soup.find_all(['script', 'style','head', 'title']):
                    tag.decompose() 
                text = ' '.join(soup.findAll(text=True))
                #text= soup.getText(strip=False)
            except:
                text=response.xpath('(//text())').extract()
            #################### Text Processing########################
            # 1) remove non-breaking space e.g:\xa0,  
            text = unicodedata.normalize("NFKD",''.join(text))
            # 2) remove \n\r\t
            text = text.replace('\n'," ").replace('\r'," ").replace('\t'," ")
            # 3) lowercase all letters
            text = text.lower()
            # 4) remove stop words
            text = strip_stop_words(text,self.stop_words)
            # 5) remove extra spaces
            text = " ".join(text.split())
            # 6) remove special characters
            text = strip_punctuation(text)
            # 7) remove single letter
            text = strip_single_letter(text)
            # 8) remove any non-space strings that start with non-alphabets
            text = strip_not_start_with_alphabets(text)
            return text
            ############################################################
        if (self.count <self.count_MAX):
            print(100*'-')
            print('retrieve count: ',self.count,' => ',response.url)
            self.count = self.count + 1
            # detect link format
            if response.url.endswith('.txt'):
                url_format = 'txt'
            elif response.url.endswith('.php'):
                url_format = 'php'
            elif response.url.endswith('.htm'):
                url_format = 'htm'
            elif response.url.endswith('.html'):
                url_format = 'html'
            else:
                url_format = 'NaN'
            # extract/process all the text content from url
            text = text_processing(response)
            if text in self.content_seen.keys():
                print(23*'*')
                print('* Duplicate detected! *')
                print(23*'*')
                print(response.url,' duplicates==> ',self.content_seen[text])
            else:
                self.content_seen[text] = response.url   
                # extract all sublinks and image-links
                imgs = response.css('img ::attr(src)').extract()
                href = response.css('a ::attr(href)').extract()
                extractor_all = LinkExtractor(allow_domains=())
                links_all = extractor_all.extract_links(response)
                images = []
                sublinks = []
                for img in imgs:
                    images.append(response.urljoin(img))
                for img in href:
                    if img.endswith(('.jpg','.jpeg','.png','.gif')):
                        images.append(response.urljoin(img))
                for link in links_all:
                    sublinks.append(link.url)

                #Extract only internal links to follow (allow to crawl)allow=r'fr/'
                extractor_internal = LinkExtractor(allow=r'~fmoore/')
                links_internal = extractor_internal.extract_links(response)
                links_external =[]
                for link in links_all:
                    if (link in links_internal):
                        print('before internal link',response.url)
                        print('internal sublinks:',link.url)
                        yield scrapy.Request(link.url, callback=self.parse)
                    else:
                        print('external sublinks:',link.url)
                        links_external.append(link.url)

                yield {
                    'URL':response.url,
                    'URL_format':url_format,
                    'Status':response.status,
                    'Title': response.css('title::text').extract(),
                    'SubLinks':sublinks,
                    'Images': images,
                    'Text':  text,
                    'ShouldNotRetrieve': links_external 
                    }
        else:
            scrapy.exceptions.CloseSpider('Retrieve page limit reached')

In [6]:
def run_crawler():
    import os
    if os.path.isfile('crawl_result.json'):
        os.remove('crawl_result.json')
    if os.path.isfile('crawl_result.j1'):
        os.remove('crawl_result.j1')
    process = CrawlerProcess({
        'USER_AGENT': 'Zheng@SMU'
    })
    process.crawl(cse7337Spider)
    process.start()
run_crawler()

2018-03-18 23:26:27 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-03-18 23:26:27 [scrapy.utils.log] INFO: Versions: lxml 3.7.3.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.1 |Anaconda custom (x86_64)| (default, May 11 2017, 13:04:09) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.0.0 (OpenSSL 1.0.2l  25 May 2017), cryptography 1.8.1, Platform Darwin-15.6.0-x86_64-i386-64bit
2018-03-18 23:26:27 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'crawl_result.json', 'LOG_LEVEL': 30, 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Zheng@SMU'}


json opened
----------------------------------------------------------------------------------------------------
retrieve count:  0  =>  https://s2.smu.edu/~fmoore/
external sublinks: http://lyle.smu.edu
before internal link https://s2.smu.edu/~fmoore/
internal sublinks: http://lyle.smu.edu/~fmoore/index-fall2017.htm
before internal link https://s2.smu.edu/~fmoore/
internal sublinks: https://s2.smu.edu/~fmoore/schedule.htm
----------------------------------------------------------------------------------------------------
retrieve count:  1  =>  https://s2.smu.edu/~fmoore/schedule.htm
before internal link https://s2.smu.edu/~fmoore/schedule.htm
internal sublinks: https://s2.smu.edu/~fmoore/index.htm
before internal link https://s2.smu.edu/~fmoore/schedule.htm
internal sublinks: https://s2.smu.edu/~fmoore/index_duplicate.htm
before internal link https://s2.smu.edu/~fmoore/schedule.htm
internal sublinks: https://s2.smu.edu/~fmoore/does_not_exist.htm
before internal link https://s2.smu.ed

## 4. Results and Discussion

    The raw data was saved as json format local and then read as pandas dataframe, the raw data contains the information of each URL that the crawler has request. 

In [7]:
df = pd.read_json('crawl_result.json')
df = df[['URL','URL_format','Status','Title','Text','Images','SubLinks','ShouldNotRetrieve']]
print(df.shape)
df

(12, 8)


Unnamed: 0,URL,URL_format,Status,Title,Text,Images,SubLinks,ShouldNotRetrieve
0,https://s2.smu.edu/~fmoore/,,200,[Freeman Moore - SMU Spring 2018],spring freeman moore phd email fmoore smu ed s...,[https://s2.smu.edu/~fmoore/SMU-CSE-LOGO-COLOR...,"[http://lyle.smu.edu, http://lyle.smu.edu/~fmo...",[http://lyle.smu.edu]
1,https://s2.smu.edu/~fmoore/schedule.htm,htm,200,[SMU CSE 5337/7337 Spring 2018 Schedule],cse preliminary schedule page maintained lates...,[https://s2.smu.edu/~fmoore/misc/permuterminde...,"[https://s2.smu.edu/~fmoore/index.htm, https:/...","[http://www.gedpage.com/soundex.html, http://w..."
2,https://s2.smu.edu/~fmoore/index-fall2017.htm,htm,200,[Freeman Moore - SMU Fall 2017],fall freeman moore phd email fmoore lyle smu e...,[https://s2.smu.edu/~fmoore/SMU-CSE-LOGO-COLOR...,"[http://lyle.smu.edu, http://lyle.smu.edu/~fmo...","[http://lyle.smu.edu, http://oracle.com.edgesu..."
3,https://s2.smu.edu/~fmoore/misc/count_letters.txt,txt,404,[404 Not Found],found found requested url misc count_letters t...,[],[],[]
4,https://s2.smu.edu/~fmoore/this_aint_gonna_wor...,htm,404,[404 Not Found],found found requested url this_aint_gonna_work...,[],[],[]
5,https://s2.smu.edu/~fmoore/misc/count_letters_...,txt,404,[404 Not Found],found found requested url misc count_letters_d...,[],[],[]
6,https://s2.smu.edu/~fmoore/does_not_exist.htm,htm,404,[404 Not Found],found found requested url does_not_exist htm f...,[],[],[]
7,https://s2.smu.edu/~fmoore/misc/exam1.html,html,200,[CSE 7337 Spring 2018 distance students exam 1...,cse distance student exam location inclass vs ...,[],[],[]
8,https://s2.smu.edu/~fmoore/misc/useragent.php,php,200,[CSE 5337/7337 User-Agent],html user agent information received crawler d...,[https://s2.smu.edu/~fmoore/SMU-CSE-LOGO-COLOR...,[],[]
9,https://s2.smu.edu/~fmoore/misc/porter_stemmer...,html,200,[Porter Stemmer Online],javascript porter stemmer online find porter s...,[],"[http://tartarus.org/~martin/PorterStemmer/, h...","[http://tartarus.org/~martin/PorterStemmer/, h..."


## URL of all pages (retrieved including broken links, excluding duplicate links)

    The links shown below is all the links extracted by the Scrapy built-in LinkExtractor, it includes broken links and excludes duplicate links which will be shown in the next block. Noted that links to actual non-text files will not be included.
    
    One thing worth-mentioning is that the a couple of 'http' URL has also been crawled but not shown in the following. The reason is that the crawler automatically redirects that link to 'https'. e.g. redirect "http://lyle.smu.edu/~fmoore" to "https://lyle.smu.edu/~fmoore". Although those links can be stored as well, I did not explicitly ask the crawler to save it for the sake of simplicity. 

In [8]:
[print(x) for x in df.loc[:,'URL']];

https://s2.smu.edu/~fmoore/
https://s2.smu.edu/~fmoore/schedule.htm
https://s2.smu.edu/~fmoore/index-fall2017.htm
https://s2.smu.edu/~fmoore/misc/count_letters.txt
https://s2.smu.edu/~fmoore/this_aint_gonna_work.htm
https://s2.smu.edu/~fmoore/misc/count_letters_duplicate.txt
https://s2.smu.edu/~fmoore/does_not_exist.htm
https://s2.smu.edu/~fmoore/misc/exam1.html
https://s2.smu.edu/~fmoore/misc/useragent.php
https://s2.smu.edu/~fmoore/misc/porter_stemmer_example.html
https://s2.smu.edu/~fmoore/misc/levenshtein.html
https://s2.smu.edu/~fmoore/index-final.htm


## All out-going links (external links that did not retrieve)

    The out-going links are not allowed to crawl, this step is achieved by setting allow_domain in both crawler setting and LinkExtractor. 

In [9]:
out_going_links = []
for x in df.loc[:,'ShouldNotRetrieve']:
    if len(x)>0:
        for link in x:
            out_going_links.append(link) 
[print(x) for x in set(out_going_links)];

http://9ol.es/porter_js_demo.html
http://www.smu.edu/EnrollmentServices/Registrar/Enrollment/FinalExamSchedule/Spring2018
http://www.gedpage.com/soundex.html
http://lyle.smu.edu
http://en.wikipedia.org/wiki/Stop_words
http://lucene.apache.org/core/
http://oracle.com.edgesuite.net/timeline/oracle/
http://en.wikipedia.org/wiki/Tf*idf
http://search.carrot2.org/stable/search
http://tartarus.org/~martin/PorterStemmer/
https://smu.instructure.com
http://en.wikipedia.org/wiki/Document_classification


## Display the contents of  TITLE tag

    The title of each retrieved URL is requested using CSS command: 'title::text'.
    
    Note that I did not include the title for link towards pdf file, image or xlsx file. This is simply because that those link does not have a HTML markup.

In [10]:
[print(x.ljust(55),':',y) for x,y in zip(df.loc[:,'URL'],df.loc[:,'Title'])];

https://s2.smu.edu/~fmoore/                             : ['Freeman Moore - SMU Spring 2018']
https://s2.smu.edu/~fmoore/schedule.htm                 : ['SMU CSE 5337/7337 Spring 2018 Schedule']
https://s2.smu.edu/~fmoore/index-fall2017.htm           : ['Freeman Moore - SMU Fall 2017']
https://s2.smu.edu/~fmoore/misc/count_letters.txt       : ['404 Not Found']
https://s2.smu.edu/~fmoore/this_aint_gonna_work.htm     : ['404 Not Found']
https://s2.smu.edu/~fmoore/misc/count_letters_duplicate.txt : ['404 Not Found']
https://s2.smu.edu/~fmoore/does_not_exist.htm           : ['404 Not Found']
https://s2.smu.edu/~fmoore/misc/exam1.html              : ['CSE 7337 Spring 2018 distance students exam 1 location']
https://s2.smu.edu/~fmoore/misc/useragent.php           : ['CSE 5337/7337 User-Agent']
https://s2.smu.edu/~fmoore/misc/porter_stemmer_example.html : ['Porter Stemmer Online']
https://s2.smu.edu/~fmoore/misc/levenshtein.html        : ['Levenshtein Distance demo']
https://s2.smu.edu/~fmoor

## Implement extract duplicate detection, report if any URLs refer to already seen content

    Duplicates detection is implemented and the duplicate URL is shown in the programming running log, please see program running log above. 

## List all broken links within test data

      Once the crawler sent a request to a broken link, it will recieve a response status value other than 200. So broken links is found by selecting all the links requested with status value not equal to 200. 

In [11]:
[print(x) for x in df.loc[df['Status']!=200,'URL']];

https://s2.smu.edu/~fmoore/misc/count_letters.txt
https://s2.smu.edu/~fmoore/this_aint_gonna_work.htm
https://s2.smu.edu/~fmoore/misc/count_letters_duplicate.txt
https://s2.smu.edu/~fmoore/does_not_exist.htm


## List the URLs of graphic (gif, jpg, jpeg, png)

    The URL of graphic is found via using the combination of CSS command: "img ::attr(src)" and "a ::attr(href)".

In [12]:
from IPython.display import Image
from IPython.core.display import HTML 
graphic = []
for x in df.loc[:,'Images']:
    if len(x)>0:
        for link in x:
            graphic.append(link) 
[print(x) for x in set(graphic)];

https://s2.smu.edu/~fmoore/misc/bigram-example.jpg
https://s2.smu.edu/~fmoore/SMU-CSE-LOGO-COLOR.png
https://s2.smu.edu/~fmoore/SMU-CSE-LOGO-COLOR.gif
https://s2.smu.edu/~fmoore/misc/permutermindex-example.jpg
https://s2.smu.edu/~fmoore/SMU-CSE-LOGO.gif


## Term document Frequency matrix

##  Before/After text normalization, cleaning example

    Please see 3.Crawler Stucture for more details about text processing. 

In [13]:
docs = df.loc[df['Status']==200,['Text']]
docs.loc[0,'Text']

'spring freeman moore phd email fmoore smu ed spring cse fall cse keep looking course calendar latest information spring th caruth syllabus syllabus contents web site sole responsibility dr freeman moore necessarily represent opinions policies southern methodist university administrator site dr freeman moore may contacted fmoore smu edu'

##  Implement stemming

    The stemming and following term-document frequency matrix analysis were implemented using Python libaries NLTK and Pandas.

In [14]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
for idx in docs.index:
    stemmed=[]
    for i in word_tokenize(docs.loc[idx,'Text']):
        stemmed.append(ps.stem(i))
    docs.loc[idx,'Text']=' '.join(stemmed)
docs.loc[0,'Text']

'spring freeman moor phd email fmoor smu ed spring cse fall cse keep look cours calendar latest inform spring th caruth syllabu syllabu content web site sole respons dr freeman moor necessarili repres opinion polici southern methodist univers administr site dr freeman moor may contact fmoor smu edu'

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
count_vect = CountVectorizer() # an object capable of counting words in a document!
summary_text = docs['Text'].values.tolist()
bag_words = count_vect.fit_transform(summary_text)

In [16]:
print(len(count_vect.vocabulary_),'vocabulary in total')
print(count_vect.vocabulary_)

215 vocabulary in total
{'spring': 175, 'freeman': 71, 'moor': 122, 'phd': 138, 'email': 52, 'fmoor': 69, 'smu': 168, 'ed': 50, 'cse': 37, 'fall': 61, 'keep': 101, 'look': 109, 'cours': 34, 'calendar': 17, 'latest': 103, 'inform': 93, 'th': 192, 'caruth': 20, 'syllabu': 183, 'content': 32, 'web': 206, 'site': 167, 'sole': 170, 'respons': 159, 'dr': 48, 'necessarili': 124, 'repres': 155, 'opinion': 128, 'polici': 140, 'southern': 172, 'methodist': 120, 'univers': 199, 'administr': 1, 'may': 117, 'contact': 31, 'edu': 51, 'preliminari': 144, 'schedul': 162, 'page': 134, 'maintain': 114, 'activ': 0, 'date': 40, 'topic': 196, 'jan': 98, 'overview': 133, 'introduct': 95, 'ir': 97, 'chpt': 23, 'hmwk': 80, 'assign': 10, 'dark': 38, 'articl': 9, 'feb': 63, 'boolean': 13, 'retriev': 160, 'term': 189, 'post': 143, 'list': 107, 'algorithm': 5, 'internet': 94, 'advertis': 2, 'porter': 141, 'stemmer': 178, 'crawl': 35, 'due': 49, 'mercat': 119, 'score': 163, 'vector': 203, 'space': 173, 'model': 12

In [17]:
pd.options.display.max_rows =300
columns=['doc'+str(i+1) for i in range(0,bag_words.shape[0])]
tdf = pd.DataFrame(data=np.transpose(bag_words.toarray()),
                   index=count_vect.get_feature_names(),
                   columns=columns)

####  Term-document Matrix

    I simply assigned the 'doc_ID' to each unique URL as its name.

In [18]:
tdf

Unnamed: 0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8
activ,0,2,0,0,0,0,0,0
administr,1,0,1,0,0,0,0,1
advertis,0,1,0,0,0,0,0,0
agent,0,1,0,0,1,0,0,0
alejandro,0,0,0,1,0,0,0,0
algorithm,0,1,0,0,0,1,0,0
alon,0,0,0,0,0,1,0,0
analysi,0,1,0,0,0,0,0,0
apr,0,8,0,0,0,0,0,0
articl,0,3,1,0,0,0,0,0


## Report 20 most common stemmed words with in its document frequency

In [19]:
tdf['Term Frequency']=tdf[columns].sum(axis=1)

In [20]:
tdf_sorted = tdf.sort_values(by='Term Frequency',ascending=False)
tdf_sorted.head(20) # show top 20 most common stemmed word

Unnamed: 0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,Term Frequency
chpt,0,14,0,0,0,0,0,0,14
hmwk,0,11,0,0,0,0,0,0,11
freeman,3,0,3,0,0,0,0,3,9
moor,3,0,3,0,0,0,0,3,9
mar,0,9,0,0,0,0,0,0,9
cse,2,1,2,1,0,0,0,3,9
due,0,8,0,0,0,0,0,0,8
site,2,0,2,0,0,2,0,2,8
may,1,5,1,0,0,0,0,1,8
apr,0,8,0,0,0,0,0,0,8
