# CSE 801a - Intro to Big Data Analysis

## Assignment 01 - Web Crawler

<p><b>The below code implements a Web Crawler as part of Assignment Exercise 01 for CSE 801a Course.
    The logic of this code is to build a program which feeds in a static Web URL as an entry point and from there on implements a SPIDER which crawls through the links present on that website and further the links present on those links (Parent - Child - GrandChild type of implementation)</b></p>

<b>*************************************************************************************************</b>

<p>We will follow up with instructions to make a third person help in understanding the logic behind this code</p>

<br>
<b>First we'll import all the necessary libraries required to implement our Spider (Web Crawler)</b>

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse,urljoin
import urllib
import pandas as pd

<b>In the below code cell we are supplying the entry point to the crawler (The first link to start the crawler).
    Further we are defining 3 empty lists to store the crawled:
    <ol>
        <li>Source URL</li>
        <li>Target URL</li>
        <li>Title of Target URL</li>
    </ol>
    <br>
    After that we are extracting the domain of the website and parsing the URL using Netloc to genearte a standard URL to crawl relatively linked external pages in the same domain (home_url will ensure that)<b>

In [3]:
BaseURL = 'https://en.wikipedia.org/wiki/Manchester_United_F.C.'
URL_SOURCE = []
URL_TARGET = []
URL_TITLE  = []
domain = urlparse(BaseURL).netloc
parsed_url = urllib.parse.urlparse(BaseURL)
home_url = parsed_url.scheme + '://' + parsed_url.netloc + parsed_url.path #Preparing the Base or Home URL 
#for the supplied link to access and retrieve the relatively linked webpages within a page
parsed_url
home_url

'https://en.wikipedia.org/wiki/Manchester_United_F.C.'

<b>In the below cell we are creating a list to store all the commonly used image extenstion which will further be used to filter the hyperlinks pointing to an image through anchor (< a >) tag rather than an actual webpage</b>

In [4]:
image_list =['.jpg','.png','jpeg','.svg','.gif','.tif','.tiff','.bmp','.raw','.cr2','.nef','.orf','.sr2']

<b> Below cell is used to define a function "find_title" which will be used to extract the title of the supplied URL (Title of the target page in our case)</b>

In [5]:
def find_title(title_link):
    try:
        with urllib.request.urlopen(title_link) as response:
            title_links_1 = response.read()
            soup_title = BeautifulSoup(title_links_1)
            title_of_page = soup_title.title.text
    except: #Handling the case in which the call to the URL fails due to any kind of HTTP Error.
            #later such URLS with blank title would be skipped/ignored
        title_of_page = ''
    return title_of_page

<b> Bellow cell is implementing a function "link_list_generator" which feeds in a URL from where the function is called and further collects all the hyperlinks (External one on same domain) on that URL and returns it back to the point from where the function was called in form of a list of hyperlinks(URLS)</b>

In [6]:
def link_list_generator(URL_LINK):
    with urllib.request.urlopen(URL_LINK) as response:
        all_links = response.read()
        bs_all_links = BeautifulSoup(all_links)
        required_links = set()
        for link in bs_all_links.body.find_all('a'):
            href = link.attrs.get("href")
            if href == "" or href is None: #Ignoring URLs without an actual reference
                continue
            else:
                if href.find("#") != -1: #Ignoring internal links that is links to the same page
                    continue
                else:
                    href = urljoin(home_url, href) #Joining the relative URL with home url to obtain actual 
                                                    #External URL
                    if domain in href:         #Checking if the URL is in same domain
                        required_links.add(href)
    return required_links

In [7]:
print(URL_SOURCE) #To check if list contains anything

[]


<b>Below code will append the values for first iteration with the entry point URL</b>

In [8]:
if (len(URL_SOURCE) == 0): #Updating the list with first iteration as per the specified requirement
    URL_SOURCE.append('None')
    URL_TARGET.append(BaseURL)
    base_url_title = find_title(BaseURL)
    #print(f'Base {base_url_title}')
    URL_TITLE.append(base_url_title)

<b>Below cell is implementing our main function <i><u>crawler</u></i> which feeds in on a URL then performs certains checks and validations and populates our lists which are of interest to us namely <i>URL_SOURCE</i>, <i>URL_TARGET</i>, and <i>URL_TITLE</i>(for the target page)</b>

In [9]:
def crawler(URL):
    crawler_count = 0
    if (len(URL_SOURCE) > 0):
        requested_urls = link_list_generator(URL)
        for link1 in requested_urls:
            if link1 in URL_SOURCE:
                continue #Ignoring links already present in source URL
            else:
                image_contains = [image_check for image_check in image_list if(image_check in link1)]
                image_contains = bool(image_contains) #Checking with the link is of type image instead of actual link
                if image_contains == True:
                    continue #ignoring imaging type URLS
                else:
                    if (len(URL_SOURCE) < 100): #Ensuring that 100 URLS are not already crawled
                        target_url = link1
                        target_title = find_title(target_url)
                        if (crawler_count < 20): #Ensuring that only 20 links are crawled per parent URL
                                                 #Limited Crawling concept
                            if (target_url in URL_SOURCE) or (target_url in URL_TARGET) or (target_title in URL_TITLE):
                                continue #Ignoring target URLS already in Source or target and fetching the next ones
                            else:
                                if target_title == '':
                                    continue #Ignoring URLS without a title
                                else:
                                    #print(f'{crawler_count} : {target_title}')
                                    URL_SOURCE.append(URL)
                                    URL_TARGET.append(target_url)
                                    URL_TITLE.append(target_title)
                                    crawler_count += 1
                        else:
                            if (target_url in URL_SOURCE) or (target_url in URL_TARGET) or (target_title in URL_TITLE):
                                continue
                            else:
                                try:
                                    with urllib.request.urlopen(target_url) as response:
                                        URL_NEW = target_url
                                        crawler(URL_NEW) #Recursively Calling the Crawler function to crawl the remaing
                                                            #Urls for every 20 iterations per URL max to ensure
                                                            #One parent has only 20 direct childs
                                except urllib.error.HTTPError as e:
                                    continue
                    else:
                        break

<b>Below cell will implement the logic to call the <i><u>crawler</u></i> function using the entry URL</b>

#### Note: The crawler function might take 1-3 minutes in crawling all the required pages based on the network speeds and how heavy the site being crawled is

In [10]:
crawler(BaseURL)

<b> Below code cell converts the list and combines them to form a single pandas dataframe with 3 columns url_source storing the list URL_SOURCE and similarly the others </b>

In [11]:
df_web_crawler = pd.DataFrame(list(zip(URL_SOURCE, URL_TARGET, URL_TITLE)),columns=['url_source','url_target', 'page_title_target'])

<b> performing print, df.head(), df.tail() to check the list and dataframe contents</b>

In [12]:
print(len(URL_SOURCE))

100


In [13]:
df_web_crawler.head(10)

Unnamed: 0,url_source,url_target,page_title_target
0,,https://en.wikipedia.org/wiki/Manchester_Unite...,Manchester United F.C. - Wikipedia
1,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/Portal:English_f...,Portal:English football - Wikipedia
2,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/War_Damage_Commi...,War Damage Commission - Wikipedia
3,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/Everton_F.C.,Everton F.C. - Wikipedia
4,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/Category:Footbal...,Category:Football clubs in England - Wikipedia
5,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/Football_Focus,Football Focus - Wikipedia
6,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/Singapore_Exchange,Singapore Exchange - Wikipedia
7,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/1967_FA_Charity_...,1967 FA Charity Shield - Wikipedia
8,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/Turkish_Airlines,Turkish Airlines - Wikipedia
9,https://en.wikipedia.org/wiki/Manchester_Unite...,https://en.wikipedia.org/wiki/1985%E2%80%9386_...,1985–86 European Cup Winners' Cup - Wikipedia


In [14]:
df_web_crawler.tail(10)

Unnamed: 0,url_source,url_target,page_title_target
90,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/Cornelis_August_...,Cornelis August Wilhelm Hirschman - Wikipedia
91,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/1994_FIFA_World_Cup,1994 FIFA World Cup - Wikipedia
92,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/RICO,Racketeer Influenced and Corrupt Organizations...
93,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/Valais_Women%27s...,Valais Women's Cup - Wikipedia
94,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/Mediterranean_Fu...,Mediterranean Futsal Cup - Wikipedia
95,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/Burundi_national...,Burundi national football team - Wikipedia
96,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/Qatar_Airways,Qatar Airways - Wikipedia
97,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/WAFU_Zone_B_Wome...,WAFU Zone B Women's Cup - Wikipedia
98,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/Philippines_nati...,Philippines national football team - Wikipedia
99,https://en.wikipedia.org/wiki/FIFA,https://en.wikipedia.org/wiki/CECAFA_Cup,CECAFA Cup - Wikipedia


<b> This is the final step of the Spider/Crawler which converts the dataframe and stores it in a CSV file <i>WebCrawler.csv</i> with encoding as UTF-8 for the viewing purpose of the user</b>

In [15]:
df_web_crawler.to_csv('crawl.csv', index=False, encoding = 'utf-8', quoting=2)

<b> There is a bug that I have identified with python. Even if we do to_csv on a dataframe in Jupyter Notebooks and supply the encoding as <i>UTF-8</i> the windows machine overwrites it with its own equivalent charset of cp1252 so we have to live with it. But cp1252 is almost similar as the UTF-8 charset