# Web Crawling procedures


**The following program extracts all outgoing links from a webpage and crawls each of these links. It can be thought of as a recursive process since the program will store a list of all hyperlinks on a given page, crawl that page and then crawl all the links it collected which would again involve collecting the outgoing links from these links before any crawling is done. 
It first creates a directory for the project, then it creates 2 files named ‘crawled’ and ‘queue’.  The queue file stores a list of all the href links on the homepage whose url is provided by the user. To make it user-friendly, the only user input requires for this process is this base url. Next, it crawls the homepage, then stores all the links from the first page in the queue and crawls this page. Once a webpage has been crawled, its url is moved into the 'crawled' file while the url(s) that have yet to be crawled exist in the 'queue' file. To ensure that we only crawl one given directory, our program ensures that the domain name of the pages it crawls matches the domain of the base url. This prevents it from crawling the entire internet.**


## Creating the crawler

## The following cell imports the necessary packages for this part of the tutorial. Below is an explanation of why they are needed:
**Import urlparse: This package allows us to split a URL string into its components, or sometimes do the reverse by combining URL components into a URL string. 
The function usually parses the URL into six components, returning a tuple of 6 items. This corresponds to the general structure of a URL:   scheme://netloc/path;parameters?query#fragment Each tuple item is a string, possibly empty. The components are not broken up in smaller parts (for example, the network location is a single string), and % escapes are not expanded. The (; , ? , # ) are not part of the result, except for a leading slash in the path component, which is retained if present. For example: if the following url ('http://www.cwi.nl:80/%7Eguido/Python.html') is parsed using ‘urlparse’, we get the following results: 
scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', parameters='', query='', fragment=''
As seen here, the parameters, query and fragment fields are left empty here since the given url is simple enough.**

**Import HTMLParser: An HTMLParser receives HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered. If we want this parser to behave differently and produce different results, it is possible to subclass it and override its methods to achieve this.**

**Import urlopen: Open the URL url, which can be either a string or a Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format
urllib.request is a module used for fetching URLs. One of its most important functions is urlopen which is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. This module can be used to fetch URLs for many URL schemes using their associated network protocols e.g. FTP or HTTP. (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of ftp://python.org/)**

**Import Queue: This provides a constructor for a FIFO (first in first out) queue. This is defined by the queue class as  *class queue.Queue(maxsize=0)* where maxsize is an integer that sets the upperbound limit on the number of items that can be placed in the queue. If maxsize is less than or equal to zero, the queue has an infinite size.**



**Note that the following program requires python 3 environment:**

In [2]:
from urllib.parse import urlparse
import os
from html.parser import HTMLParser
from urllib import parse
from urllib.request import urlopen
import threading
from queue import Queue

**Creating a method that creates a project directory if and only if the same project does not already exist. In Python, functions are defined using a 'def' statement. The general form look like this** 

**def function-name(Parameter list):**

**&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;statements, i.e. the function body**

**The parameter list in this method is directory. Line#3 specifically checks if this folder already exists before creating it. There is a print statement right after this check to let the user know if the program found the project. The last line is one that creates the directory.  **

In [3]:
#a method to create the project directory if it is a new project
def create_project_dir(directory):
    if not os.path.exists(directory):#only create if it doesn't already exist
        print('Creating project '+ directory)
        os.makedirs(directory)

In [4]:
create_project_dir('theFirstProject') #only prints the name of the project when this cell is run for the first time because after that, it is already created and the if statement returns a false

**Method that writes to files: The parameter list consists of the file path and the data that has to be written to this file. This method also takes care of closing the file after it has written to it.**

In [5]:
#creating the new file:
def write_file(path, data):
    with open(path, 'w') as f:
        f.write(data)
        f.close()

**Method to append data on files: The same parameter list as the method above. This method starts writing at the nd of the file and ensures that each link is added on a new line by using the newline character. **

In [6]:
#Function to add data onto an existing file
def append_to_file(path, data):
    with open(path, 'a') as file:
        file.write(data+'\n') #each link on a new line  

**Method to delete the contents of a file: The paarmeter list consists of 1 paarmeter which is the path of the file. This method is responsible for emptying the file by deleting all of its contents.**

In [7]:
#function to delete the conetnts of a file
#creates a new file of the same name: i.e: opens that file and erases its contents
def delete_file_contents(path):
    open(path, 'w').close() 
    
#write mode selected  #pass # do nothing
       

**Method to create the crawled and queue files: The parameters required include the name of the project which is a string type and the second paarmeter is the base url which is usually the url of the homepage. This url will be provided by the user. Line#3 and 4 create the files with the filename being the name of the project followed by an indication of which of the two types of file is being created. The following if blocks are included o make sure that these files do not already exist.(in case we are running this program for the second time on the same directory.) In Line#6, the queue file is being written to so that it contains the base url; this ensures that when the program starts, we have one url in the waiting list. The crawled file is created so it is empty as of now since no webpages have been crawled yet.**

In [8]:
#creating queue and crawled files(if they don't already exist)
def create_data_files(project_name, base_url):
    queue=os.path.join(project_name+ '/queue.txt')
    crawled= os.path.join(project_name+'/crawled.txt')
    if not os.path.isfile(queue):
        write_file(queue, base_url) # so that when the program starts, it has one url in the waiting list
    if not os.path.isfile(crawled):
        write_file(crawled, '') #empty file so that the program knows this url has not been crawled yet
        

In [9]:
#create_data_files('theFirstProject', 'http://us.asos.com/women/')

**However, this process is slower than if variables were used instead of a bunch of methods. The advantage of this process
is that if the system accidentally shuts down or crashes, we still have the data we crawled saved in the 2 files. If variables were being used, the entire process would have to be repeated. Tehrefore, we will use both variables and methods in this program.**

**Method to convert the file into a set: The method takes one parameter which is the filename. A set can contain each item only once. This method makes sure that we don't crawl one page multiple times which only slows down the process. It converts the links in each file into a set. Line#4 reads the text file (rt) line by line and adds each link to the set. **

In [10]:
#Read a file and convert each line to set items
def file_to_set(file_name):
    results=set()
    with open(file_name, 'rt') as f:
        for line in f:
            results.add(line.replace('\n', ''))#deletes the newline part of the url
    return results

**Method to convert a set to a file: The first parameter is the name of the set and the second is the name of the file. First, the file is cleared by calling the delete_file_contents method. Then we sort the set so all the links are now in lphabetical order. The method then iterates over these links and writes them to the file one link per line. **

In [11]:
#iterate through a set, each item in the set will be a new line in the file
def set_to_file(links, file_name):
    delete_file_contents(file_name)
    with open(file_name, "w") as f:
        for link in sorted(links):#alphabetican order
            f.write(link + "\n")
        

# Collecting the Links on this base webpage (whose url is provided above):

**This class is amde to inherit from the HTMLParser so it has the functionality of an HTMLParser.**

**The first method is the initializer method which calls the initializer method of the super class. This method initializes the homepage's url. We create a set to store all the links extracted from the homepage, this is named self.links.  **

**handle_starttag is a method in HTMLParser. We are overriding this method to check if the tag passed in as aparameter is an 'a' tag which means we have a link. If it is, we create a for loop to iterate through its attributes (attrs). The attributes of an 'a' tag refer to the components that it consist of for example: href, class, etc are attributes and what they are set to is their value. Each iteration is over the tuple of attribute and value in the attrs field. Since we only want to store the url, we specifically look for the href attribute, then we concatenate the url to the base url to create the full url and add it to the set.**

**page_links is a method to return teh set of links that was created in the initializer method. **

**The error method needs to be implemented everytime a class inherits from the HTMLParser class.**

In [12]:
class LinkFinder(HTMLParser):
    def __init__(self, base_url, page_url):
        super().__init__()
        self.base_url = base_url
        self.page_url = page_url
        self.links = set()
    
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for(attribute, value) in attrs: #stored in a tuple
                if attribute == 'href':#collecting the url of this link
                    #converting a relative url to one with a full domain name
                    url = parse.urljoin(self.base_url, value)#if it is a full url, it is saved as it is. Otherwise, the relative url is combined with the base url and then saved 
                    self.links.add(url)
                    
                    
    def page_links(self):
        return self.links
                    
    def error(self, message):
        pass

# Creating the spider class:
**We will have a bunch of links in our waiting list, waiting to be crawled. The spider will hold on to one of these links, connect to that page and grab all of the html of this webpage and feed it to the Linkfinder object which return all of the links found in this html. Once spider has all of the links from this webpage, it processes each link and makes sure that it isn't already in the waiting list and that it hasn't already been crawled after which it adds the link to the waiting list. The Spider class also moves the webpage it has just extracted the links from into the crawled_links file to ensure that a page is not crawled twice.**

**The class first declares variables for the name of the project, the url of the homepage, the domain name of this url and the queue and crawled files and their corresponding sets. **

**The initializer method sets teh values of these variables and calls the crawl_page method.**

**The boot method is a static one which means we don't need to create an instance of this class in order to call this method. This method is mainly responsible for calling the appropriate methods that create a folder and the 'crawled' and 'queue' files for the project. It also calls the method to convert tehse files to sets again just so we don't crawl one page more than once. **

**The crawl_page method takes 2 parameters: first is the thread_name to let the user know which page is being crawled. In the initializer method, when crawl_page is called, we passed in 'First Spider' which meant the program is carwling the first page. The second paaameter is the url of the pag eto be crawled. The method checks if this url is present in the crawled set and if it isn't, it proceeds to process and crawl it. Line#34 prints the number of links in the wueue file and those in the crawled file to give the user an idea of how long it's going to take to complete the process. The next line extracts all the links on this page and adds them to the queue. We now need to remove this webpage that we are currently working on from the queue and add it to the crawled set.**

**The gather_links method connects to the site and receives the html code which is initially in binary form. It 
converts this into actual html format, passes it onto the linkFinder which parses through it and extracs all the links from this page. It creates a variable of the string datatype. There is a try and catch block on the part where it links to the webpage and receives the html of the page. The if statement checks that we retrieve html code and not any other format such as pdf or an executable. The next line is responsible for formatting the binary code into readable html form. The method then creates an instance of the LinkFinder class and calls the feed function by passing the html code we retrieved. The rest of the methods don’t need to be called manually. The except block contains the code that will be executed if the code in the try block throws an exception. This includes printing the exception and returning an empty set since this method has a return type of set and can’t be exited without returning one. Finally, outside of the except block, we return the updated page links set.**

**The add_links_to_queue method takes one parameter which is a set of links and is responsible for adding the extracted links to the queue. It contains a for loop to iterate through each url in this set and checks if it is present in either one of the sets (crawled or queue); if so, the continue keyword instructs the method to move on to the next url in the set. There is also a check to ensure that the domain name of the url matches that of the base url, if it doesn’t, we don’t need to crawl this page.**  

**The update_file method calls the methods to convert the queue and crawled sets to text files.**


In [13]:
class Spider:
   #making aclass variable that is shared among all the spiders 
    project_name = ''
    base_url = ''
    domain_name=''
    #also need to create variables for the queue and crawled files
    queue_file=''#the actual text file
    crawled_file=''#any spider cans et the value of these
    queue=set()#stored on he RAM as a buffer 
    crawled=set()
    def __init__(self, project_name, base_url, domain_name):
        Spider.project_name= project_name
        Spider.base_url = base_url
        Spider.domain_name= domain_name
        Spider.queue_file= Spider.project_name+'/queue.txt'
        Spider.crawled_file= Spider.project_name+'/crawled.txt'
        self.boot()
        self.crawl_page('First spider', Spider.base_url)
        
        
    @staticmethod
    def boot():#the first spider created must create the prohect directory and the 2 data files(queue and crawled)
        create_project_dir(Spider.project_name)
        create_data_files(Spider.project_name, Spider.base_url)#first spider therefore the url to the homepage is passed in
        Spider.queue = file_to_set(Spider.queue_file)
        Spider.crawled = file_to_set(Spider.crawled_file)
        
   

    @staticmethod
    def crawl_page(thread_name, page_url):#adding the base url to the crawled file
        if page_url not in Spider.crawled:
            print(thread_name+'crawling'+page_url)
            print('Queue: '+ str(len(Spider.queue)) + ' | Crawled: '+str(len(Spider.crawled)))
            Spider.add_links_to_queue(Spider.gather_links(page_url))
            Spider.queue.remove(page_url)#removing this page that has just been crawled so it no longer exists in the queue(the waiting list)
            Spider.crawled.add(page_url)
            Spider.update_files()
    
    
    @staticmethod
    def gather_links(page_url):
        html_string=''
        try:
            response= urlopen(page_url)
            if response.getheader('Content-Type') ==  'text/html':
                html_bytes=response.read()#convert the 1's and 0's received from the browser into actual html format
                html_string=html_bytes.decode("utf-8")
            finder=LinkFinder(Spider.base_url, page_url)
            finder.feed(html_string)
        except Exception as e:
            print(str(e))
            return set()
        return finder.page_links()
    
    
    
    @staticmethod
    def add_links_to_queue(links):
        for url in links:
            if (url in Spider.queue) or (url in Spider.crawled):
                continue
            if Spider.domain_name != get_domain_name(url):
                continue
            Spider.queue.add(url)
            
 #so that the crawler does not crawl the entire internet, the base url mist be present in all the pages being crawled
              
                        
            
            
    @staticmethod
    def update_files():
        set_to_file(Spider.queue, Spider.queue_file)
        set_to_file(Spider.crawled, Spider.crawled_file)

In [14]:
try:
    from urllib.parse import urlparse
except ImportError:
     from urlparse import urlparse

**The following method extracts the domain name from the URL taht it takes in as a parameter.**

In [15]:
#Get domain name(example.com)
def get_domain_name(url):
    try:
        results= get_sub_domain_name(url).split('.')
        return results[-2]+'.'+results[-1]#second to the last and the .com
    except:
        return ''

**This method calls the urlparse function on the URL taht it takes in as a parameter and splits it as excplained above.**

In [16]:
#Get subdomain name (name.example.com)
def get_sub_domain_name(url):
    try:
        return urlparse(url).netloc
    except:
        return''

In [17]:
print(get_domain_name('http://us.asos.com/women/'))

asos.com


**In the next cell, we provide user input including the project name and base URL. **

In [18]:
PROJECT_NAME='asos_crawled'#constants convention allcaps
HOMEPAGE='http://us.asos.com/women/'
DOMAIN_NAME=get_domain_name(HOMEPAGE)
QUEUE_FILE= PROJECT_NAME+'/queue.txt'
CRAWLED_FILE = PROJECT_NAME+'/crawled.txt'
NUMBER_OF_THREADS= 4 #SPECIFIC TO THE OPERATING SYSTEM

In [19]:
queue= Queue()#where Queue is the package that was imported and queue is the thread. So we are contructing an instance of it. 
Spider(PROJECT_NAME, HOMEPAGE, DOMAIN_NAME)#the first spider is created

Creating project asos_crawled
First spidercrawlinghttp://us.asos.com/women/
Queue: 1 | Crawled: 0


<__main__.Spider at 0xa2b93da5c0>

**The crawl method converts the queue file created above to a set. It then checks if there are any links present in this set, if so it prints the number of links present.**

In [20]:
def crawl():
#check if there are items in the queue, if so, crawl them
    queued_links=file_to_set(QUEUE_FILE)
    if len(queued_links)>0:
        print(str(len(queued_links))+ 'links in the queue')
        create_jobs()#each queue link is a new job

**The create_jobs method adds the URLs from the set to the thread queue. The join function makes sure that each thread waits for its turn to do its job and no 2 threads are working at the same time since we need to follow an order of tasks. **

In [21]:
def create_jobs():
    for link in file_to_set(QUEUE_FILE):
        queue.put(link)
    queue.join()
    crawl()

**The create_workers method has a for loop that is used to create threads by passing in another function as its parameter. This method is created later. Setting daemon to true ensures that this thread is killed when the main application exits. We need to call start on each thread to actually create it. **

In [22]:
#create worker threads which will be killed once the main is exited
def create_workers():
    for _ in range(NUMBER_OF_THREADS):
        t = threading.Thread(target=work)
        t.daemon=True
        t.start()
        

**Gets the URL from the queue, calls the crawl_page method which takes 2 parameters. The first of which is the name of the thread to let the user know where the program is and what it is doing and the second is the url thatvwe got from the queue. Calling the task_done method lets the operating system know that we are done with this URL so that it can send out the garbage collector to do its job and free up some memory. **

In [23]:
#do the enxt job in the queue
def work():
    while True:
        url= queue.get()
        Spider.crawl_page(threading.current_thread().name, url)
        queue.task_done()

In [24]:
create_workers()
crawl()

# Web Crawling using Selenium:

**Selenium is a software tool that can be used to automate web browsers. Selenium comprises of the Selenium web driver, selenium IDE, and selenium grid. Its most popular use is for functional and regression testing; it automates web applications for testing purposes. It is also the core technology in countless other browser automation tools, APIs and frameworks. Selenium is open source software which makes it possible for users all over the world to freely use this software to test applications. Moreover, Selenium supports java, python, C#, PHP, Pearl, Ruby among other programming languages. The different operating systems that support Selenium include Windows, Mac, Linux, iOS and android. Selenium has been updated over the years to work well with all major browsers including Chrome, Firefox, Safari, Opera and even Internet Explorer.**

**Selenium also supports parallel test execution which means that it can run the same test case on different browsers parallelly or different test cases on different instances of the same browser. If used with TestNG, selenium web driver can form html reports.**

**In this tutorial, we will be using Selenium web driver. This uses a collection of language specific bindings to drive a browser as per the user’s demands.**

## Creating a program to submit assignments automatically:

**The following program allows users to submit assignments to the school's website without logging in every time a new assignment has to be submitted.**

**First, we need to place the completed assignments in folders which are named after the classes that the assignments belong to. We need to import os because this module provides a portable way of using operating system dependent functionality. It provides functionality in order to rea/write a file line by line or one character at a time, manipulate paths, creating temporary files and directories, and for high-level file and directory handling.**

**We create a list of tuples, each element of this list represents an assignment. Tuple is a datatype that stores 2 components. These may or may not have the same datatype. It can be thought of as a pair; it keeps track of 2 things at once. The first item in this tuple corresponds to the class name while the second one gives the name of the file to submit. CompletedAssignments is the name of the main folder that contains subfolders which represent classes We use a function from the os module and an explicit cast to form a list of classes from the main folder 'CompletedAssignments'. We then create a for loop to iterate through this list of subfolders and create a list of all the files in the folder. After this, we insert this assignment into the list 'file_tup'. **

In [33]:
# os for file management
import os
 
file_tup=[] # Constructs a tuple (class, file to submit)
submission_dir = 'CompletedAssignments' 

dir_list = list(os.listdir(submission_dir)) #casting to list datatype
#the following for loop goes through all the subfolders in the CompletedAssignments folder
for directory in dir_list:
    file_list = list(os.listdir(os.path.join(submission_dir, directory))) #casts the assigments for each class into a list
    if len(file_list) != 0:
        file_tup = (directory, file_list[0]) #the first item in the tuple is the folder name(directory) and the second item is the index of the assignment being submitted
    
print(file_tup)

('CIS308', 'Testing_File.rtf')


**At this point, the program knows which file has to be submitted and whcich directory it is located in.**

In [34]:
import selenium
from selenium import webdriver

# Using Firefox to access web, this browser is contolled by the program we are writing
driver = webdriver.Firefox()
driver.get('https://canvas.ksu.edu')

**This opens up a Firefox browser window and directs us to the canvas login webpage. The user is prompted to enter their ID and password. In order to train the webdriver, we need to provide precise instructions. These include step by step procedures that it can follow i.e: where to click, what to type, etc. To make this work, we will use Selectors. A selector is a unique identifier for an element on a webpage. The selector for any given text, button or link on a webpage can be found by inspecting the html of this feature of the webpage.  **

**The following line of code will help us locate the id textbox. The cursor is placed in this box so it is ready to take user input.**

In [35]:
# Select the id box
id_box = driver.find_element_by_name('username')
#lands the cursor in the username textbox

**Now that the program can locate the username textbox, we can instruct it on how to enter the username. We use the send_keys function to do this.**

In [36]:
#types the username provided in the username textbox
id_box.send_keys('Enter username') #Enter username here

In [37]:
# Locates the textbox for enetring the password
pass_box = driver.find_element_by_name('password')
# Typing in the password
pass_box.send_keys('Enter password') # Enter password here
# Locates the login button
login_button = driver.find_element_by_name('submit')
# The following command clicks the login button for us
login_button.click()

**The previous cell has led us to the user dashboard which displays all the current courses and a list of assignments that are due shortly is dispalyed on the right side. Now we look for the course we are wanting to submit an assignment for. To do this, we refer to the first item in the tuple we created above. The first part of the tuple gave the name of the class. Then we need a block of if-elseif... else statements to go through the classes, we stop when we've found a match. When we find the right course, we need to place the cursor on it. To do this, we use 'find_element_by_link_text' which is another selector we can find by inspecting the page and provide the full name of the course.  **

In [38]:
# Locates teh courses button and clicks it
courses_button = driver.find_element_by_id('global_nav_courses_link')
courses_button.click()
# Get the name of the folder
folder = file_tup[0]
    
# Class to select depends on folder
if folder == 'MIS670':
    class_select = driver.find_element_by_link_text('Soc Med Anal/Web Min(14825)')
elif folder == 'STAT510':
    class_select = driver.find_element_by_link_text('Intro Prob & Stat 1(12712)')
elif folder == 'CIS308':
    class_select = driver.find_element_by_link_text('C/C++ Language Lab(10774)')

# Click on the specific class
class_select.click()


**Next step is to find the modules button using (link_text which can be found in the html script)and click it. **

In [40]:
#modules_select = driver.find_element_by_link_text('Modules')
#modules_select.click()

assignments_select = driver.find_element_by_link_text('Assignments')
assignments_select.click()


**Now we need to locate the assignment we are trying to submit. This may require us to scroll down **
**We need to scroll depending on which assignment is to be submitted.**

In [41]:
from selenium.webdriver.common.keys import Keys
#driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

#driver.execute_script("window.scrollTo(0, 1080)")
#driver.execute_script("window.scrollTo(0, 1080)")

In [42]:
#assignment_select = driver.find_element_by_link_text('Python Bootcamp: SocialMediaAnalyticsWebMining')

assignmentSub_select = driver.find_element_by_link_text('Lab 07')
assignmentSub_select.send_keys(Keys.END)
assignmentSub_select.click()

In [43]:
submitPage_select = driver.find_element_by_link_text('Submit Assignment')
submitPage_select.click()

In [44]:
# Choose File button

#choose_file = driver.find_element_by_name('attachments[0][uploaded_data]') #CODE

# Complete path of the file

#file_name= 'Testing_File.rtf' #CODE
#file_location = os.path.join("C:\\CompletedAssignments\MIS670\Testing_File.txt")

#file_location = os.path.join(submission_dir, folder, file_name) #CODE
# Send the file location to the button
#choose_file.send_keys(file_location) #CODE

In [46]:
# Choose File button
driver.find_element_by_name('attachments[0][uploaded_data]').send_keys('C:\\Users\Rida\Desktop\MIS670\CompletedAssignments\CIS308\Testing_File.rtf')
# Complete path of the file


**Closing the canvas app...**

In [28]:
# Locate submit button and click
submit_assignment = driver.find_element_by_id('submit_file_button')
submit_assignment.click()

In [47]:
driver.quit()

# Using Selenium to access facebook:

**The following program is used to log in to facebook. The next cell directs Firefox's homepage to facebook's login page. **

In [48]:
import selenium
from selenium import webdriver

# Using Firefox to access web, this browser is contolled by the program we are writing
fb = webdriver.Firefox()
fb.get('https://www.facebook.com')

In [49]:
#maximizing the window
fb.maximize_window()

**The next cell sets the implicit wait to 20 seconds so that the browser waits for all the web elements to be loaded before we can start using the html of this webpage.**

In [50]:
fb.implicitly_wait(20) #sets implicit wait to 20 seconds for every web element

**if facebook does not respond in 30 seconds, the page shows an error. If this command is not included,the browser will not wait for facebook to respond and instead start crawling a blank page which will throw unexpected exceptions.**

In [51]:
fb.set_page_load_timeout(30)

**The webdriver consists of a function to capture screenshots and save them as files. It returns a true if the screenshot was captured**

In [52]:
fb.get_screenshot_as_file("fb_initial.png")

True

**Locating the user ID textbox and passing in 'Selenium Driver' as the username using send_keys. Locating the password textbox and passing in 'python' usings end_keys again and then locating the login button and clicking it.**

In [53]:
fb.find_element_by_id("email").send_keys("Selenium Driver")
fb.find_element_by_name("pass").send_keys("python")
fb.find_element_by_id("loginbutton").click()

In [54]:
#capturing a screenshot
fb.get_screenshot_as_file("fb_final.png")

True

**Closing the browser window**

In [55]:
 #closes the browser window
fb.quit()

# Using Selenium to navigate YouTube:

In [59]:
from selenium import webdriver
driver = webdriver.Firefox()

In [60]:
driver.get("https://www.youtube.com/user/noobtoprofessional")

In [None]:
#opens the youtube video but also shows an error:
driver.find_element_by_xpath('//a[contains(text(), "Why you should learn Python Programming")]')

In [62]:
driver.quit()

# Crwaling a custom made webpage by xpath  :

In [63]:
from selenium import webdriver
driver = webdriver.Firefox()

In [64]:
driver.get("http://econpy.pythonanywhere.com/ex/001.html") # the url of the webpage 

**The xpath for buyers is found in one of the attributes (title) of div. We are storing all the buyers in a list**

In [65]:
#The following command returns a list of all the buyers on this webpage
buyers= driver.find_elements_by_xpath('//div[@title="buyer-name"]')

In [66]:
buyers[:2] # we now need to extract the text from this list of buyers for it to make sense to the users

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fe8f5826-574c-430c-b07a-5a9b6737df32", element="01048c61-0262-481d-95f3-6ba1d5502fdd")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fe8f5826-574c-430c-b07a-5a9b6737df32", element="83e5d502-cf6e-44c5-9c0a-322cd0c61e49")>]

**The xpath for the prics lies in one of the attributes (a class named 'item-price') of span. Also storing these in a list **

In [67]:
prices = driver.find_elements_by_xpath('//span[@class="item-price"]')


In [68]:
prices[:2]

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fe8f5826-574c-430c-b07a-5a9b6737df32", element="b048a112-86fd-4614-9c96-883775cb2a16")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fe8f5826-574c-430c-b07a-5a9b6737df32", element="501fe731-2c80-45c1-ac4d-1bb23c56ff65")>]

**The length of the buyers list and that of the prices list must be equal. **

In [69]:
len(buyers)

20

In [70]:
len(prices)

20

**The following for loop displays each buyer and the price they have paid for the items they bought.**

In [71]:
num = len(buyers)
for i in range(num):
    print(buyers[i].text + " : " + prices[i].text)

Carson Busses : $29.95
Earl E. Byrd : $8.37
Patty Cakes : $15.26
Derri Anne Connecticut : $19.25
Moe Dess : $19.25
Leda Doggslife : $13.99
Dan Druff : $31.57
Al Fresco : $8.49
Ido Hoe : $14.47
Howie Kisses : $15.86
Len Lease : $11.11
Phil Meup : $15.98
Ira Pent : $16.27
Ben D. Rules : $7.50
Ave Sectomy : $50.85
Gary Shattire : $14.26
Bobbi Soks : $5.68
Sheila Takya : $15.00
Rose Tattoo : $114.07
Moe Tell : $10.09


In [72]:
maxPage= 5
maxPageDigit=3 #3 digit page number is the maximum. i.e: the last page has a 3-digit page-number


**We can store this in a csv file**

In [73]:
with open('result.csv', 'w') as f:
    f.write("Buyers, Price \n")    

In [74]:
driver = webdriver.Firefox()

**Now we can populate the file using a for loop that iterates through the pages given above. The second line in the code prefixes the apprpriate amount of zeroes on the single or double digit page numbers. Then we need edit the url so we can account for the page numbers of the pages we want to crawl. Then we direct the browser to each of these URLs one at a time, create buyers and prices list for the page, open the file for appending and write these lists to the file. After crawling through the last page, we can close the browser.**

In [75]:
for i in range(1, maxPage+1):
    page_num = (maxPageDigit - len(str(i))) * "0" + str(i) #prefixes the number of zeroes in the beginning
    print (page_num)
    url = "http://econpy.pythonanywhere.com/ex/" + page_num + ".html"
    print(url)
    driver.get(url)
    buyers =  driver.find_elements_by_xpath('//div[@title="buyer-name"]')
    prices = driver.find_elements_by_xpath('//span[@class="item-price"]')
    num_page_items = len(buyers)
    with open('result.csv', 'a') as f:
        for i in range (num_page_items):
            f.write(buyers[i].text + "," + prices[i].text + "\n" )
driver.close()

001
http://econpy.pythonanywhere.com/ex/001.html
002
http://econpy.pythonanywhere.com/ex/002.html
003
http://econpy.pythonanywhere.com/ex/003.html
004
http://econpy.pythonanywhere.com/ex/004.html
005
http://econpy.pythonanywhere.com/ex/005.html


# Crawling with Beautiful Soup:

**Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with a parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. We will be using it to scrape data about Graphic cards from the e-commerce website Newegg**

In [76]:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

In [77]:

url= "https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card"

In [78]:
client = uReq(url)
page_html =client.read()
client.close()

**The constructor for BeautifulSoup requires the html code and the parser that will accompany it during scraping.**

In [79]:
#parsing html
page_soup = soup(page_html, "html.parser")

**The following line displays the header on the page whose URL is given above.**

In [80]:
page_soup.h1

<h1 class="page-title-text">Video Cards &amp; Video Devices</h1>

**Displaying the first paragraph entry on the page whose URL is given above.**

In [81]:
page_soup.p

<p>Newegg.com - A great place to buy computers, computer parts, electronics, software, accessories, and DVDs online. With great prices, fast shipping, and top-rated customer service - once you know, you Newegg.</p>

**Displaying an emtry form the body's span attribute**

In [82]:
page_soup.body.span

<span class="noCSS">Skip to:</span>

**Using BeautifulSoup, it is possible to look through a webpage's html and use the findAll function to find the objects with name 'div', attribute type 'class' and value 'item-container'.
The findAll method traverses the html code, starting at the given point, and finds all the Tags and NavigableString objects that match the criteria we provide. The signature for the findall method is this:**

**findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)**

**These arguments show up over and over again throughout the Beautiful Soup API. The most important arguments are name and the keyword arguments. The name argument restricts the set of tags by name.**

In [83]:
#iterates through each product
containers=page_soup.findAll("div", {"class":"item-container"})

In [84]:
len(containers)

12

**The next 3 cells clean the results from findAll to find the title (brand name of the product).**

**container.a retrieves everything in the a tag including href, img alt, class and title.**

**container.div.div.a retrieves the a tag within 2 div attributes. **

**container.div.div.a.img["title"] extracts exactly the value of title from img.**

In [85]:
contain=containers[0]
container=containers[0]
container.a #grabs everything in the a tag

<a class="item-img" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814126189&amp;ignorebbr=1">
<img alt="ASUS ROG Strix Radeon RX 570 O4G Gaming OC Edition GDDR5 DP HDMI DVI VR Ready AMD Graphics Card (ROG-STRIX-RX570-O4G-GAMING)" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-126-189-V07.jpg" src="//c1.neweggimages.com/WebResource/Themes/2005/Nest/blank.gif" title="ASUS ROG Strix Radeon RX 570 O4G Gaming OC Edition GDDR5 DP HDMI DVI VR Ready AMD Graphics Card (ROG-STRIX-RX570-O4G-GAMING)">
</img></a>

In [86]:
container.div.div.a

<a class="item-brand" href="https://www.newegg.com/ASUS/BrandStore/ID-1315">
<img alt="ASUS" class=" lazy-img" data-effect="fadeIn" data-src="//images10.newegg.com/Brandimage_70x28//Brand1315.gif" src="//c1.neweggimages.com/WebResource/Themes/2005/Nest/blank.gif" title="ASUS">
</img></a>

In [87]:
container.div.div.a.img["title"] #grabs the company name

'ASUS'

In [88]:
title_container=container.findAll("a", {"class":"item-title"})
title_container
#finds the direct class: a tag, object is all classes that start with item-title

[<a class="item-title" href="https://www.newegg.com/Product/Product.aspx?Item=N82E16814126189&amp;ignorebbr=1" title="View Details"><i class="icon-premier icon-premier-xsm"></i>ASUS ROG Strix Radeon RX 570 O4G Gaming OC Edition GDDR5 DP HDMI DVI VR Ready AMD Graphics Card (ROG-STRIX-RX570-O4G-GAMING)</a>]

In [89]:
title_container[0].text

'ASUS ROG Strix Radeon RX 570 O4G Gaming OC Edition GDDR5 DP HDMI DVI VR Ready AMD Graphics Card (ROG-STRIX-RX570-O4G-GAMING)'

**In the next cell, the name is 'li' and the attribute is a class 'price-ship'**

In [90]:
shipping_info=container.findAll("li", {"class":"price-ship"})


In [91]:
shipping_info[0]

<li class="price-ship">
        Free Shipping
    </li>

In [92]:
shipping_info[0].text

'\r\n        Free Shipping\r\n    '

In [93]:
shipping_info[0].text.strip()

'Free Shipping'

**The following cell combines all the concepts used in this program above. After crawling the given pages for brand name, product name and shipping charges, we can create a csv file to keep everything in order.**

In [94]:
f=open("NewEgg.csv", "w")
headers= "brand, name, shipping\n"
f.write(headers)
for  container in containers: 
    brand = container.div.div.a.img["title"]
    name = container.findAll("a", {"class":"item-title"})
    name_cleaned = name[0].text
    shipping=container.findAll("li", {"class":"price-ship"})
    shipping_cleaned=shipping[0].text.strip()
    print ("brand: " + brand)
    print ("product's name: " + name_cleaned)
    print ("shipping charges: " + shipping_cleaned)
    f.write(brand+ "," +name_cleaned.replace("," , "|") + ","+ shipping_cleaned+"\n")
f.close()   #indenteation must be done correctly 
    

brand: ASUS
product's name: ASUS ROG Strix Radeon RX 570 O4G Gaming OC Edition GDDR5 DP HDMI DVI VR Ready AMD Graphics Card (ROG-STRIX-RX570-O4G-GAMING)
shipping charges: Free Shipping
brand: EVGA
product's name: EVGA GeForce GTX 1070 HYBRID GAMING, 08G-P4-6178-KR, 8GB GDDR5, LED, All-In-One Watercooling, DX12 OSD Support (PXOC)
shipping charges: $4.99 Shipping
brand: MSI
product's name: MSI GeForce GTX 1060 DirectX 12 GTX 1060 GAMING X 6G Video Card
shipping charges: $4.99 Shipping
brand: Sapphire Tech
product's name: SAPPHIRE NITRO+ Radeon RX 580 DirectX 12 100411NT+4G-2L Video Card w/ Backplate (UEFI), SAMSUNG MEMORY
shipping charges: $4.99 Shipping
brand: GIGABYTE
product's name: GIGABYTE GeForce GTX 1060 DirectX 12 GV-N1060G1 GAMING-6GD REV 2.0 Video Card
shipping charges: $4.99 Shipping
brand: ZOTAC
product's name: ZOTAC GeForce GTX 1050 DirectX 12 ZT-P10500A-10L Video Card
shipping charges: $4.99 Shipping
brand: ASUS
product's name: ASUS ROG GeForce GTX 1080 STRIX-GTX1080-A8G-GA

# Using Selenium and BeautifulSoup to scrape data from CraigsList:


**Importing the packages needed to use Selenium's webdriver, WebDriverWait, Expected Conditions(EC) and the packages needed for BeautifulSoup. **

**The following program uses selenium along with BeautifulSoup to narrow the query and scrape CraigsListaccording to the specifications provided by the user. **

**We create a class whose initializer function sets the values of local variables, these will be provided by the user. This function initializes the URL and creates an instance of the Web driver from Selenium.**

**The test method is included for the purposes of debugging the program. All it does is prints the URL.**

**The load_craigslist_url method directs the browser to the URL given in the initializer method. We are using WebDriverWait to delay the extracting for 3 seconds. Moreover, we instruct the driver to wait for the presence of an expected condition (EC) presents itself. In this case, EC is the presence of a spefic element on the webpage. What element we are waiting for is given by ID which in this case is 'searchform'. Searchform corresponds to the entire webpage meaning that the crawler will not start extracting until the page is completely loaded. Selenium will wait until it can find the id searchform in the html. Here we searched by id, it is also possible to search by CSS element, class element, etc. The second argument specifies what type of Id we are looking for. When the page has loaded, we print a message indicating this. The page may not load instantaneouly and if the crawler tries to extract the lements before they are loaded, our program will not retrieve the desired results. To amke sure that the page was fully loaded before we extract from it, we include a delay statement However, if any exception occurs during this process, a message is printed saying that it took to long to load the page. W can handle this by increasing the delay time from 3 seconds (we could use trial and error to find out how long is too long).**

**The extract_post_titles method creates a list of all the postings by class name ‘result-row’. Then we extract only the text component from each element of this list and store it in another list. Lastly, the method returns this modified list.**

**The extract_post_urls method creates a list for the URLs. It receives the html code from the URL provided in the initializer method, uses BeautifulSoup’s constructor and passes in this code. Using BeautifulSoup, it is easy to collect only the links from the html script. To do this, we use a for loop to iterate through each element with the name ‘a’, attribute ‘class’, and value ‘result-title hdrlnk’ to specifically find links. The href attribute of each link is appended on to the list.**

**The closing method’s only task is to close the browser.**

In [95]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup 
from urllib.request import urlopen
class CraigsListScraper(object):
    def __init__(self, location, postal, max_price, radius):
        self.location=location
        self.postal=postal
        self.max_price=max_price
        self.radius=radius
        self.url=f"https://{location}.craigslist.org/search/sss?search_distance={radius}&postal={postal}&max_price={max_price}"
        self.driver= webdriver.Firefox()
        self.delay=3 #wait 3 seconds for all elements to load
        
    def test(self):
        print(self.url)
    
    def load_craigslist_url(self):
        self.driver.get(self.url)
        try:
            wait=WebDriverWait(self.driver, 3)
            wait.until(EC.presence_of_element_located((By.ID, "searchform"))) 
            print("Page is ready")
        except TimeOutException:
            print("Loading took too long") # if this message is displayed, we need to up the delay time
            
    def extract_post_titles(self):
        all_posts= self.driver.find_elements_by_class_name("result-row")#result-row is the name of the class
        post_title_list=[]
        for post in all_posts:
            print(post.text)
            post_title_list.append(post.text)
        return post_title_list
    def extract_post_urls(self):
        url_list = []
        html_page = urlopen(self.url)
        soup = BeautifulSoup(html_page, "lxml")
        for link in soup.findAll("a", {"class":"result-title hdrlnk"}):
            print(link)
            print(link["href"])
            url_list.append(link["href"])#because we only want the link not the whole a tag
        return url_list
    def closing(self):
        self.driver.close()

**The following cell represents the GUI for this program. It receives user input to specify the location, postal code, maximum price, and distance. We then create an instance of the CraigsListScraper class by passing these values. Lastly, we need to call the methods in the right order to open a browser, extract the titles of items, extarct their URLs and closing the browser. **

In [96]:
location = "ksu"
postal="66502"
max_price="1500"
radius="5"
scraper= CraigsListScraper(location, postal, max_price, radius)
scraper.test()
scraper.load_craigslist_url() #directs firefox to the url we just created
scraper.extract_post_titles()
scraper.extract_post_urls()
scraper.closing()

https://ksu.craigslist.org/search/sss?search_distance=5&postal=66502&max_price=1500
Page is ready
$800
image 1 of 4
Mar 28 Weld Wheels and Tires $800
$800
image 1 of 4
Mar 28 Weld Wheels and Tires $800 (wic > Manhattan)
$800
image 1 of 4
Mar 28 Weld Wheels and Tires $800 (ksc)
Mar 28 Aqua Cat II ( sailboat ) for sale Manhattan Ks $859 $850 (Manhattan Ks)
$150
image 1 of 3
Mar 28 Patio 4 Seat Set $150
$1000
Mar 28 2010 Polaris 400 AWD ATV Looks Like NEw $1000 (Manhattan)
$75
Mar 28 iPad 2 (64 GB) & ATT LG 3TE $75
$25
image 1 of 2
Mar 28 Victorian mirror with shelf $25 (Manhattan)
$14
image 1 of 3
Mar 28 A BUNCH of Par 64 and 56 Light Fixtures - PRICE DROP!!! $14 (Manhattan, KS)
$50
image 1 of 5
Mar 28 Electric Bass Case Road Runner $50 (Manhattan)
$17
image 1 of 23
Mar 28 DRONES: 17$ Mo/ ($0 Due Today) Fly Now Pay Later DJI™Parrot™Drone $17 (Manhattan)
Mar 28 ISO FEMALE NEW ZEALAND $1 (Manhattan)
$20
image 1 of 2
Mar 28 Book shelf $20 (Manhattan)
$5
image 1 of 3
Mar 28 Study lamp $5 (Ma