# Problem Statement
Using python&#39;s multiprocessing and any one of threading/gevent module, task is to write a

web-scraper which takes a huge file as an input ( 1Million rows ) which contains a url in

each line.

The scraper then uses BeatuifulSoup to parse the content and finds if the content contains

&quot;jquery.js&quot;. If it does, dump the url into a file &quot;accepted.csv&quot; or if it doesn&#39;t, dump it into file

&quot;rejected.csv&quot;.  

In a nutshell, the scraper classifies if the site uses jquery or not.

The scraper should perform as efficiently as possible, utilizing the maximum compute

power on any machine ( it&#39;s fine if some tuning is hardcoded ), and it should process all the

rows and reach the end state gracefully. ( i.e. Exception handling, honey-trap defense etc ).

The prerequisite for judging would be a stable code which runs smoothly for a couple 100

rows. Deciding factor would be the efficiency and speed of processing the batch of 1Million

urls, without compromizing the stability

In [506]:
import mechanize
from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib2
import pandas as pd
import re
import gevent

# Create Input file through web scraping

In [524]:
urlList = []
def get_url_list(rootURL):
    """
    Scrap URL from webpage
    """
    br = mechanize.Browser()
    # ignore robots
    br.set_handle_robots(False)
    try:
        response = br.open(rootURL)
    except:    
        print "Error to scrap  URL :  %s" %(rootURL)
        response = None
    if response:
        for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer(('script','link','a'))): 
            url = None
            if link.has_key('src'):
                url = link['src']
            elif link.has_key('href'):
                url = link['href']
            if url:
                if re.findall("^(www|http)", url):
                    url = url 
                else:
                    #In relative url add base url
                    url = "/".join([rootURL,url.strip("/")])
                #Converted jquery-1.3.0.js to jquery.js
                if 'jquery' in url:
                    version_removed_url = re.sub('[-0-9.]', '', url,6) 
                    urlList.append(version_removed_url)
                urlList.append(url)

# We can change inputURLS list value to any URL

In [508]:
inputURLS = ['https://code.jquery.com/jquery','http://www.adcuratio.com','http://www.golabs.in','http://www.amazon.in']
for inputURL in inputURLS:
        get_url_list(inputURL)        

# Dump all URL into input.csv file

In [509]:
pd.DataFrame({'URL':list(set(urlList))}).to_csv('input.csv',index=None)

# Load input.csv file

In [510]:
inputFile = pd.read_csv('input.csv')

In [511]:
pd.set_option('display.max_colwidth',1000)

In [512]:
inputFile.shape

(534, 1)

In [513]:
inputFile.head()

Unnamed: 0,URL
0,https://code.jquery.com/jquery/jquery-1.4.2.js
1,https://code.jquery.com/jquery/jquery-1.11.0.min.js
2,http://www.golabs.in/static/css/style.css
3,https://code.jquery.com/jquery/jquery-3.0.0.js
4,http://contribute.jquery.org/


# Code to parse input.csv file content

In [514]:
inputCSVURLList = [str(url) for url in inputFile['URL']]
inputURLLength = len(inputCSVURLList)

In [515]:
print "Total No Of URL %s" %(inputURLLength)

Total No Of URL 534


In [516]:
def check_accepted_rejected_url(URL):
    """
    To check if URL contains jquery.js then retuen true other wise false
    """
    accepted = False
    soup = BeautifulSoup(URL)
    #matched = soup.findAll(text=re.compile('jquery.js'))
    text = soup.text
    if 'jquery.js' in text:
        accepted = True
    return accepted
         

In [534]:
def classify_accept_rejected_url(startIndex,endIndex):
    """
    Classify and write url in accepted and rejected csv file
    """
    print "Start index %s and End index %s" %(startIndex,endIndex)
    inputURList = inputCSVURLList[startIndex:endIndex]
    for url in inputURList:
        is_accepted = check_accepted_rejected_url(url)
        if is_accepted:
            pd.DataFrame({"URL":[url]}).to_csv('accepted.csv',index=None,header=None,mode='a') 
        else:
            pd.DataFrame({"URL":[url]}).to_csv('rejected.csv',index=None,header=None,mode='a')

In [539]:
def asynchronous(batchSize=10,batchCount=None):
    """
    Use gevent module for python asynchronous call.We can configure asynchronous call 
    with totalbatch size and total batch we want to execute concurently
    """
    if not batchCount:
        batchCount = int(inputURLLength/batchSize)
    threads = [gevent.spawn(classify_accept_rejected_url,batchSize*currentBatch,batchSize*currentBatch+batchSize) for currentBatch in range(0,batchCount)]
    print "Threads ",threads
    gevent.joinall(threads)

# Execute program with configurable batch size and total no of batch to execute

In [536]:
#If we are not set any batch size it will take default batch size and process all record
asynchronous()

Threads  [<Greenlet at 0x7f00c406acd0: classify_accept_rejected_url(0, 10)>, <Greenlet at 0x7f00c406aa50: classify_accept_rejected_url(10, 20)>, <Greenlet at 0x7f00c406a190: classify_accept_rejected_url(20, 30)>, <Greenlet at 0x7f00c4e81d70: classify_accept_rejected_url(30, 40)>, <Greenlet at 0x7f00c4e81c30: classify_accept_rejected_url(40, 50)>, <Greenlet at 0x7f00c4e81eb0: classify_accept_rejected_url(50, 60)>, <Greenlet at 0x7f00c4008eb0: classify_accept_rejected_url(60, 70)>, <Greenlet at 0x7f00c4008e10: classify_accept_rejected_url(70, 80)>, <Greenlet at 0x7f00c4008f50: classify_accept_rejected_url(80, 90)>, <Greenlet at 0x7f00c4008550: classify_accept_rejected_url(90, 100)>, <Greenlet at 0x7f00c4008af0: classify_accept_rejected_url(100, 110)>, <Greenlet at 0x7f00c40080f0: classify_accept_rejected_url(110, 120)>, <Greenlet at 0x7f00c4421410: classify_accept_rejected_url(120, 130)>, <Greenlet at 0x7f00c4421c30: classify_accept_rejected_url(130, 140)>, <Greenlet at 0x7f00c44219b0: c

# Read accepted.csv and rejected.csv file for vefify 

In [537]:
accepted_csv = pd.read_csv('accepted.csv',header=None)
accepted_csv

Unnamed: 0,0
0,https://codejquerycom/jquery/jquery.js


In [538]:
rejected_csv = pd.read_csv('rejected.csv',header=None)
rejected_csv

Unnamed: 0,0
0,https://code.jquery.com/jquery/jquery-1.4.2.js
1,https://code.jquery.com/jquery/jquery-1.11.0.min.js
2,http://www.golabs.in/static/css/style.css
3,https://code.jquery.com/jquery/jquery-3.0.0.js
4,http://contribute.jquery.org/
5,https://code.jquery.com/jquery/jquery-1.6.2.js
6,https://code.jquery.com/jquery/jquery-1.2.1.pack.js
7,http://irc.jquery.org/
8,http://www.amazon.in/g-ecx.images-amazon.com
9,http://www.amazon.de/ref=footer_de
