# Web Crawler - ITM 891 Assignment 01 - Rahul Sivakumar

## Web Crawler to get links from Arsenal Football Club's Wikipedia page

In [1]:
%matplotlib inline

Assigning all necessary libraries needed for the program

In [2]:
import urllib
import bs4
import pandas as pd
import csv
import requests

Reading the page and Beautiful Soup

Link: [Wikipedia Arsenal F.C.](https://en.wikipedia.org/wiki/Arsenal_F.C.)

In [3]:
source_url = 'https://en.wikipedia.org/wiki/Arsenal_F.C.'

with urllib.request.urlopen(source_url,data=None,timeout=5) as response:
    soup = bs4.BeautifulSoup(response)


### Finding all 'a' tags in the soup to get links from them:


In [4]:
a_tags = soup.find_all('a')
a_tags

[<a id="top"></a>,
 <a href="/wiki/Wikipedia:Featured_articles" title="This is a featured article. Click here for more information."><img alt="This is a featured article. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected."><img alt="Page semi-protected" data-file-height="512" data-file-width="512" decoding="async" height="20" src="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/30px-Se

### Getting all links in 'href' attribute and storing these in a set 

In [5]:
unclean_links = set()
for x in a_tags:
    if 'href' in x.attrs:
        unclean_links.add(x.attrs['href'])

unclean_links

{'/wiki/Brazil',
 '/wiki/International_Standard_Serial_Number',
 'https://www.arsenal.com/news/club-names-new-leaders-ivan-heads-italy',
 '#cite_ref-120',
 '/wiki/PSV_Eindhoven',
 '/wiki/Bob_Wall_(football_administrator)',
 '/wiki/Hampton_%26_Richmond_Borough_F.C.',
 'https://books.google.com/books?id=DZOZsXVCvS8C',
 'https://ml.wikipedia.org/wiki/%E0%B4%86%E0%B4%B4%E0%B5%8D%E0%B4%B8%E0%B4%A3%E0%B5%BD_%E0%B4%8E%E0%B4%AB%E0%B5%8D.%E0%B4%B8%E0%B4%BF.',
 'http://www.arsenal.com/news/features/48523/behind-the-numbers-',
 '/wiki/List_of_Arsenal_F.C._players_(25%E2%80%9399_appearances)',
 '/wiki/List_of_Premier_League_highest_scoring_games',
 '/wiki/Library_of_Congress_Control_Number',
 'http://www.thearsenalhistory.com/?p=7726',
 '#cite_ref-235',
 '/wiki/Syst%C3%A8me_universitaire_de_documentation',
 '/wiki/Special:BookSources/978-0-7434-4033-2',
 '/wiki/File:Herbert_Chapman_bust_20050922.jpg',
 'https://en.wikipedia.org/w/index.php?title=Template:G-14&action=edit',
 '/wiki/Belgium',
 'http

In [6]:
len(unclean_links)

1924

### Using urljoin to join incomplete links with the source URL

In [7]:
joined_links = set()
for link in unclean_links:
    joined_link = urllib.parse.urljoin(source_url,link)
    joined_links.add(joined_link)
    
joined_links

{'https://en.wikipedia.org/wiki/PSV_Eindhoven',
 'https://www.arsenal.com/news/club-names-new-leaders-ivan-heads-italy',
 'https://en.wikipedia.org/wiki/Arsenal_F.C.#cite_note-68',
 'https://books.google.com/books?id=DZOZsXVCvS8C',
 'https://en.wikipedia.org/wiki/Arsenal_F.C.#cite_note-mertesacker-207',
 'https://en.wikipedia.org/wiki/1998_FA_Charity_Shield',
 'https://en.wikipedia.org/wiki/List_of_owners_of_English_football_clubs#Premier_League',
 'https://en.wikipedia.org/wiki/1947%E2%80%9348_Football_League',
 'https://ml.wikipedia.org/wiki/%E0%B4%86%E0%B4%B4%E0%B5%8D%E0%B4%B8%E0%B4%A3%E0%B5%BD_%E0%B4%8E%E0%B4%AB%E0%B5%8D.%E0%B4%B8%E0%B4%BF.',
 'http://www.arsenal.com/news/features/48523/behind-the-numbers-',
 'https://en.wikipedia.org/wiki/2007%E2%80%9308_Premier_League',
 'https://en.wikipedia.org/wiki/Special:BookSources/978-0-7535-4661-1',
 'https://en.wikipedia.org/wiki/Template:Football_in_London',
 'https://en.wikipedia.org/wiki/Arsenal_F.C.#cite_note-83',
 'https://en.wikipe

### Defragging these links to remove multiple links that direct to different sections of the same page

Storing these defragged links in a new set

In [8]:
defrag_links = set()
for link in joined_links:
    defragged = urllib.parse.urldefrag(link).url
    defrag_links.add(defragged)

defrag_links

{'https://en.wikipedia.org/wiki/PSV_Eindhoven',
 'https://www.arsenal.com/news/club-names-new-leaders-ivan-heads-italy',
 'https://books.google.com/books?id=DZOZsXVCvS8C',
 'https://en.wikipedia.org/wiki/1998_FA_Charity_Shield',
 'https://en.wikipedia.org/wiki/1947%E2%80%9348_Football_League',
 'https://ml.wikipedia.org/wiki/%E0%B4%86%E0%B4%B4%E0%B5%8D%E0%B4%B8%E0%B4%A3%E0%B5%BD_%E0%B4%8E%E0%B4%AB%E0%B5%8D.%E0%B4%B8%E0%B4%BF.',
 'http://www.arsenal.com/news/features/48523/behind-the-numbers-',
 'https://en.wikipedia.org/wiki/2007%E2%80%9308_Premier_League',
 'https://en.wikipedia.org/wiki/Special:BookSources/978-0-7535-4661-1',
 'https://en.wikipedia.org/wiki/Template:Football_in_London',
 'https://en.wikipedia.org/wiki/Cardiff_City_F.C.',
 'http://www.thearsenalhistory.com/?p=7726',
 'https://en.wikipedia.org/w/index.php?title=Template:G-14&action=edit',
 'https://en.wikipedia.org/wiki/1979%E2%80%9380_FA_Cup',
 'https://vec.wikipedia.org/wiki/Arsenal_Football_Club',
 'https://en.wikip

### We consider only those links which have the same domain as our source_url (en.wikipedia.org in this case)

Again, storing this subset into a new set called parse_links

In [9]:
parse_links = set()
source_parsed = urllib.parse.urlparse(source_url)
for x in defrag_links:
    parsed_link = urllib.parse.urlparse(x)
    if parsed_link.netloc == source_parsed.netloc:
        parse_links.add(x)

In [10]:
parse_links

{'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&action=edit',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&action=history',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&action=info',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&oldid=938937943',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&printable=yes',
 'https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Arsenal+F.C.',
 'https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Arsenal_F.C.&id=938937943',
 'https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Arsenal+F.C.',
 'https://en.wikipedia.org/w/index.php?title=Special:ElectronPdf&page=Arsenal+F.C.&action=show-download-screen',
 'https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Arsenal+F.C.',
 'https://en.wikipedia.org/w/index.php?title=Template:Arsenal_F.C.&action=edit',
 'https://en.wikipedia.org/w/index.php?title=Template:Arse

In [11]:
len(parse_links)

904

### Checking the header type of webpages and adding only those which are of type 'text/html' to a new set

We want only webpages and not other content like images

In [12]:
%%time

content_types = set()
web_urls = set()

for x in parse_links:
    try:
        with requests.head(x,timeout=5) as response:
            response.raise_for_status()
            content_types.add(response.headers['Content-Type'])
            if 'text/html' in response.headers['Content-Type']:
                web_urls.add(x)
    except:
        pass

Wall time: 4min 15s


In [13]:
web_urls

{'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&action=edit',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&action=history',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&action=info',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&oldid=938937943',
 'https://en.wikipedia.org/w/index.php?title=Arsenal_F.C.&printable=yes',
 'https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Arsenal+F.C.',
 'https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Arsenal_F.C.&id=938937943',
 'https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=Arsenal+F.C.',
 'https://en.wikipedia.org/w/index.php?title=Special:ElectronPdf&page=Arsenal+F.C.&action=show-download-screen',
 'https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Arsenal+F.C.',
 'https://en.wikipedia.org/w/index.php?title=Template:Arsenal_F.C.&action=edit',
 'https://en.wikipedia.org/w/index.php?title=Template:Arse

In [14]:
len(web_urls)

904

### Storing the set of links as a list and subsetting only the first 99 links

In [15]:
web_urls_list = list(web_urls)

web_urls_req = web_urls_list[0:99]
len(web_urls_req)

99

In [16]:
web_urls_req

['https://en.wikipedia.org/w/index.php?title=Template:UEFA_Cup_Winners%27_Cup_winners&action=edit',
 'https://en.wikipedia.org/wiki/Forward_(association_football)',
 'https://en.wikipedia.org/wiki/File:Arsenal_1888_squad_photo.jpg',
 'https://en.wikipedia.org/wiki/3D_television',
 'https://en.wikipedia.org/wiki/Tooting_%26_Mitcham_United_F.C.',
 'https://en.wikipedia.org/wiki/PSV_Eindhoven',
 'https://en.wikipedia.org/wiki/BBC_Sport',
 'https://en.wikipedia.org/wiki/2004%E2%80%9305_FA_Cup',
 'https://en.wikipedia.org/wiki/London_derbies',
 'https://en.wikipedia.org/wiki/1998_FA_Charity_Shield',
 'https://en.wikipedia.org/wiki/Meadow_Park_(Borehamwood)',
 'https://en.wikipedia.org/wiki/Patrick_Vieira',
 'https://en.wikipedia.org/wiki/Talk:Arsenal_F.C.',
 'https://en.wikipedia.org/wiki/Premier_League_parachute_and_solidarity_payments',
 'https://en.wikipedia.org/wiki/1937%E2%80%9338_in_English_football',
 'https://en.wikipedia.org/wiki/1947%E2%80%9348_Football_League',
 'https://en.wikip

### Creating a dataframe with one row:

url_source is 'None'

url_target is the source_url we have given 

page_title_target is title of source_url page

In [17]:
source_url_title = soup.title.text
data_crawl = {'url_source':"None",'url_target':source_url,'page_title_target':source_url_title}
crawl_df = pd.DataFrame(data_crawl,index = [0])

In [18]:
crawl_df

Unnamed: 0,url_source,url_target,page_title_target
0,,https://en.wikipedia.org/wiki/Arsenal_F.C.,Arsenal F.C. - Wikipedia


### Looping through list of web URLs and getting the page title for each URL

### Storing the source URL , target URL and target URL's title as a list and appending this list to the dataframe

In [19]:
for link in web_urls_req:
    with urllib.request.urlopen(link) as response:
        soup_url = bs4.BeautifulSoup(response)
    
    list_temp = [source_url,link,soup_url.title.text]
    crawl_df.loc[len(crawl_df)] = list_temp  

In [20]:
crawl_df

Unnamed: 0,url_source,url_target,page_title_target
0,,https://en.wikipedia.org/wiki/Arsenal_F.C.,Arsenal F.C. - Wikipedia
1,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/w/index.php?title=Tem...,Editing Template:UEFA Cup Winners' Cup winners...
2,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/Forward_(associa...,Forward (association football) - Wikipedia
3,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/File:Arsenal_188...,File:Arsenal 1888 squad photo.jpg - Wikipedia
4,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/3D_television,3D television - Wikipedia
...,...,...,...
95,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/Hanwell_Town_F.C.,Hanwell Town F.C. - Wikipedia
96,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/Special:BookSour...,Book sources - Wikipedia
97,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/Special:BookSour...,Book sources - Wikipedia
98,https://en.wikipedia.org/wiki/Arsenal_F.C.,https://en.wikipedia.org/wiki/EFL_Cup,EFL Cup - Wikipedia


### Converting the dataframe to CSV file. Quoting all entries in CSV file

In [21]:
crawl_df.to_csv('crawl.csv',index=False,quoting = csv.QUOTE_ALL)