# Web Crawling procedures

## This tutorial provides explanations about web crawling techniques. 
**The first program is amde to extract and crawl all outgoing links from a webpage. This program:**
**first creates a directory for the project**
**Creates a file that stores a list of all the href links on this homepage whose url is provided by the user**
**Next, it crawls each of these links, and once a webpage has been crawled, its url is now moved into the 'crawled' file while all the url(s) taht have yet to be crawled exist in the 'queue' file**
**To ensure that we only crawl one given directory, our program ensures that the domain name of teh pages it crawls matches the domain of the base url. This prevents it from essentially crawling the entire internet.**

**The next part of the tutorial is based on Selenium. We will use Selenium in order to submit assignments automatically to the school's website without having to open it and sign in each time. **

## Creating the crawler

In [1]:
import os

In [2]:
#from urllib.request import urlopen
from urllib import urlopen   #is this the same as from urllib import urlopen

In [3]:
import requests

In [4]:
from urllib2 import urlopen

In [5]:

from urllib2 import urlopen #doesnt work

In [6]:
import parse #from urllib import parse  #FIX THIS
#from urllib.parse import urlparse #fix This

In [8]:
#from urllib2 import parse #doesnot work

In [10]:
#from urllib.parse import urlparse

In [11]:
import threading
from queue import Queue
from html.parser import HTMLParser

In [12]:
import threading#spiders
from queue import Queue#jobs

In [14]:
#a method to create the project directory if it is a new project
def create_project_dir(directory):
    if not os.path.exists(directory):#only create if it doesn't already exist
        print('Creating project '+ directory)
        os.makedirs(directory)

In [15]:
create_project_dir('theFirstProject') #only prints the name of the project when this cell is run for the first time because after that, it is already created and the if statement returns a false

In [16]:
#creating queue and crawled files(if they don't already exist)
def create_data_files(project_name, base_url):
    queue=os.path.join(project_name+ '/queue.txt')
    crawled= os.path.join(project_name+'/crawled.txt')
    if not os.path.isfile(queue):
        write_file(queue, base_url) # so that when the program starts, it has one url in the waiting list
    if not os.path.isfile(crawled):
        write_file(crawled, '') #empty file so that the program knows this url has not been crawled yet
        

In [17]:
#creating the new file:
def write_file(path, data):
    with open(path, 'w') as f:
        f.write(data)
        f.close()

In [18]:
#create_data_files('theFirstProject', 'http://us.asos.com/women/')

In [19]:
#Function to add data onto an existing file
def append_to_file(path, data):
    with open(path, 'a') as file:
        file.write(data+'\n') #each link on a new line  

In [20]:
#function to delete the conetnts of a file
#creates a new file of the same name: i.e: opens that file and erases its contents
def delete_file_contents(path):
    open(path, 'w').close() 
#write mode selected  #pass # do nothing
       

In [21]:
#However, this process is slower than if variables were used. The advantage of this process
#is that if the system accidentally shuts down, we still have teh data we crawled saved kin files
#if variables were being used, the entire process would have to be done all over again

#tehrefore, we will use both: variables and methods

In [22]:
#Read a file and convert each line to set items
def file_to_set(file_name):
    results=set()
    with open(file_name, 'rt') as f:
        for line in f:
            results.add(line.replace('\n', ''))#deletes the newline part of the url
    return results

In [23]:
#iterate through a set, each item in the set will be a new line in the file
def set_to_file(links, file_name):
    with open(file_name, "w") as f:
        for link in sorted(links):#alphabetican order
            f.write(link + "\n")
        

# Collecting the Links on this base webpage (whose url is provided above):

In [24]:
class LinkFinder(HTMLParser):
    def __init__(self, base_url, page_url):
        super().__init__()
        self.base_url = base_url
        self.page_url = page_url
        self.links = set()
    
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for(attribute, value) in attrs: #stored in a tuple
                if attribute == 'href':#collecting the url of this link
                    #converting a relative url to one with a full domain name
                    url = parse.urljoin(self.base_url, value)#if it is a full url, it is saved as it is. Otherwise, the relative url is combined with the base url and then saved 
                    self.links.add(url)
                    
                    
    def page_links(self):
        return self.links
                    
    def error(self, message):
        pass

# Creating the spider class:
**We will have a bunch of links in our waiting list, waiting to be crawled. The spider will hold on to one of these links, connect to that page and grab all of the html of this webpage and feed it to the Linkfinder object which return all of the links found in the html. Once spider has all of the links from this webpage, it makes sure that this link is not already in the waiting list and that it has not already been crawled and then it adds the link to the waiting list. Moving the webpage it has just extracted the links from into the crawled_links file is also the responsibility of the spider so we can make sure a page is not crawled twice.**

In [25]:
class Spider:
   #making aclass variable that is shared among all the spiders 
    project_name = ''
    base_url = ''
    domain_name=''
    #also need to create variables for the queue and crawled files
    queue_file=''#the actual text file
    crawled_file=''#any spider cans et the value of these
    queue=set()#stored on he RAM as a buffer 
    crawled=set()
    def __init__(self, project_name, base_url, domain_name):
        Spider.project_name= project_name
        Spider.base_url = base_url
        Spider.domain_name= domain_name
        Spider.queue_file= Spider.project_name+'/queue.txt'
        Spider.crawled_file= Spider.project_name+'/crawled.txt'
        self.boot()
        self.crawl_page('First spider', Spider.base_url)
        
        
    @staticmethod
    def boot():#the first spider created must create the prohect directory and the 2 data files(queue and crawled)
        create_project_dir(Spider.project_name)
        create_data_files(Spider.project_name, Spider.base_url)#first spider therefore the url to the homepage is passed in
        Spider.queue = file_to_set(Spider.queue_file)
        Spider.crawled = file_to_set(Spider.crawled_file)
        
   

    @staticmethod
    def crawl_page(thread_name, page_url):#adding the base url to the crawled file
        if page_url not in Spider.crawled:
            print(thread_name+'crawling'+page_url)
            print('Queue: '+ str(len(Spider.queue)) + ' | Crawled: '+str(len(Spider.crawled)))
            Spider.add_links_to_queue(Spider.gather_links(page_url))
            Spider.queue.remove(page_url)#removing this page that has just been crawled so it no longer exists in the queue(the waiting list)
            Spider.crawled.add(page_url)
            Spider.update_files()
    
    
#The following function connects to the site, receives the html code which is initially in binary form. It 
#converts this into actual html format, passes it onto the linkFinder which parses through it and extracs all the links from this page 
    @staticmethod
    def gather_links(page_url):
        html_string=''
        try:
            response= urlopen(page_url)
            if 'text/html' in response.getheader('Content-Type'):
                html_bytes=response.read()#convert the 1's and 0's received from the browser into actual html format
                html_string=html_bytes.decode("utf-8")
            finder=LinkFinder(Spider.base_url, page_url)
            finder.feed(html_string)
        except Exception as e:
            print(str(e))
            return set()
        return finder.page_links()
    
    
    
    @staticmethod
    def add_links_to_queue(links):
        for url in links:
            if (url in Spider.queue) or (url in Spider.crawled):
                continue
            if Spider.domain_name != get_domain_name(url):
                continue
            Spider.queue.add(url)
            
 #so that the crawler does not crawl the entire internet, the base url mist be present in all the pages being crawled
              
                        
            
            
    @staticmethod
    def update_files():
        set_to_file(Spider.queue, Spider.queue_file)
        set_to_file(Spider.crawled, Spider.crawled_file)

In [26]:
try:
    from urllib.parse import urlparse
except ImportError:
     from urlparse import urlparse

In [27]:
#Get domain name(example.com)
def get_domain_name(url):
    try:
        results= get_sub_domain_name(url).split('.')
        return results[-2]+'.'+results[-1]#second to the last and the .com
    except:
        return ''

In [28]:
#Get subdomain name (name.example.com)
def get_sub_domain_name(url):
    try:
        return urlparse(url).netloc
    except:
        return''

In [29]:
print(get_domain_name('http://us.asos.com/women/'))

asos.com


In [30]:
PROJECT_NAME='latestProject'#constants convention allcaps
HOMEPAGE='https://www.narscosmetics.com/'
DOMAIN_NAME=get_domain_name(HOMEPAGE)
QUEUE_FILE= PROJECT_NAME+'/queue.txt'
CRAWLED_FILE = PROJECT_NAME+'/crawled.txt'
NUMBER_OF_THREADS= 4 #SPECIFIC TO THE OPERATING SYSTEM

In [31]:
queue= Queue()
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)

Creating project latestProject
First spidercrawlinghttps://www.narscosmetics.com/
Queue: 1 | Crawled: 0
addinfourl instance has no attribute 'getheader'


<__main__.Spider instance at 0x000000000722CBC8>

In [32]:
def crawl():
#check if there are items in the queue, if so, crawl them
    queued_links=file_to_set(QUEUE_FILE)
    if len(queued_links)>0:
        print(str(len(queued_links))+ 'links in the queue')
        create_jobs()#each queue link is a new job

In [33]:
def create_jobs():
    for link in file_to_set(QUEUE_FILE):
        queue.put(link)
    queue.join()
    crawl()

In [34]:
#create worker threads which will be killed once the main is exited
def create_workers():
    for _ in range(NUMBER_OF_THREADS):
        t = threading.Thread(target=work)
        t.daemon=True
        t.start()
        

In [35]:
#do the enxt job in the queue
def work():
    while True:
        url= queue.get()
        Spider.crawl_page(threading.current_thread().name, url)
        queue.task_done()

In [36]:
create_workers()
crawl()

## Part 2: creating a program to submit assignments automatically:

In [37]:
# os for file management
import os
file_tup=[]
# Build tuple of (class, file) to turn in
submission_dir = 'completed_assignments'
dir_list = list(os.listdir(submission_dir))
for directory in dir_list:
    file_list = list(os.listdir(os.path.join(submission_dir, directory)))
    if len(file_list) != 0:
        file_tup = (directory, file_list[0])
    
print(file_tup)

('CIS450', 'HW1.txt')


In [38]:
#import selenium
from selenium import webdriver

# Using Chrome to access web
driver = webdriver.Firefox()
driver.get('https://canvas.ksu.edu')