# <font color='blue'>CCPS 844 Data Mining Module 3</font>

# <font color='red'>Web Scraping</font>

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

The Internet provides abundant sources of information for professionals and enthusiasts from various industries. Extracting data from websites however, can be tedious, especially if you need to repeatedly retrieve data in the same format everyday.

For Example: If you follow the stock market, getting closing prices everyday can be a pain, especially when you have to open several webpages to record them regularly. You can make your data extraction easier by building your own web scraper to retrieve stock indices automatically.

## <font color='red'>Scraping Rules</font>
* You should check a website's Terms and Conditions before you scrape them. Be careful to read the statements about legal use of data, as usually, the data you scrape should not be used for commercial purposes.
* Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human), one request for one webpage per second is good practice.
* The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed

In [1]:
from lxml import html
import requests

In [2]:
file = open("test.html", "r")
pg = file.read()
file.close()
pg

"ï»¿<html>\n    <head>\n    </head>\n    <body>\n        <p>\n            A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.\n        </p>\n        <p>\n            We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.\n        </p>\n    </body>\n</html>"

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(pg, 'html.parser')
soup

ï»¿<html>
<head>
</head>
<body>
<p>
            A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.
        </p>
<p>
            We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.
        </p>
</body>
</html>

In [4]:
soup.find_all('p')

[<p>
             A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.
         </p>,
 <p>
             We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.
         </p>]

In [5]:
pageText = []
for paragraph in soup.find_all('p'):
    pageText.append(paragraph.get_text())
    #Docstring: L.append(object) -> None -- append object to end of the list
    
pageText

["\n            A status_code of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.\n        ",
 '\n            We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document.\n        ']

**Select a tokenizer of your choice**
https://www.nltk.org/_modules/nltk/tokenize.html

In [6]:
from nltk.tokenize.punkt import PunktSentenceTokenizer

In [7]:
tknizer = PunktSentenceTokenizer()

In [8]:
#tknizer.tokenize()Docstring: Given a text, returns a list of the sentences in that text.
#Period "." (full stop) is used as a delimiter
tknizer.tokenize("Hi. How are you. A simple example.")

['Hi.', 'How are you.', 'A simple example.']

In [9]:
tknizer.tokenize(pageText[0])

['\n            A status_code of 200 means that the page downloaded successfully.',
 "We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error."]

In [10]:
from nltk.tokenize.regexp import WhitespaceTokenizer

In [11]:
toknizer = WhitespaceTokenizer()

In [12]:
print(toknizer.tokenize(pageText[0]))

['A', 'status_code', 'of', '200', 'means', 'that', 'the', 'page', 'downloaded', 'successfully.', 'We', "won't", 'fully', 'dive', 'into', 'status', 'codes', 'here,', 'but', 'a', 'status', 'code', 'starting', 'with', 'a', '2', 'generally', 'indicates', 'success,', 'and', 'a', 'code', 'starting', 'with', 'a', '4', 'or', 'a', '5', 'indicates', 'an', 'error.']


In [13]:
words = []
for txt in pageText:
    # following script can be used, if we don't want to use the toknizer library
    # dictionaryU1.extend(twt.lower().strip().strip('"').strip("'").split())
    words.extend(toknizer.tokenize(txt))
    #extend() Docstring: L.extend(iterable) -> None -- extend list by appending elements from the iterable
print(words)

['A', 'status_code', 'of', '200', 'means', 'that', 'the', 'page', 'downloaded', 'successfully.', 'We', "won't", 'fully', 'dive', 'into', 'status', 'codes', 'here,', 'but', 'a', 'status', 'code', 'starting', 'with', 'a', '2', 'generally', 'indicates', 'success,', 'and', 'a', 'code', 'starting', 'with', 'a', '4', 'or', 'a', '5', 'indicates', 'an', 'error.', 'We', 'can', 'use', 'the', 'BeautifulSoup', 'library', 'to', 'parse', 'this', 'document,', 'and', 'extract', 'the', 'text', 'from', 'the', 'p', 'tag.', 'We', 'first', 'have', 'to', 'import', 'the', 'library,', 'and', 'create', 'an', 'instance', 'of', 'the', 'BeautifulSoup', 'class', 'to', 'parse', 'our', 'document.']


In [14]:
dictForDF = {}
for aWord in words:
    dictForDF[aWord] = dictForDF.get(aWord,0)+1
print(dictForDF)

{'A': 1, 'status_code': 1, 'of': 2, '200': 1, 'means': 1, 'that': 1, 'the': 6, 'page': 1, 'downloaded': 1, 'successfully.': 1, 'We': 3, "won't": 1, 'fully': 1, 'dive': 1, 'into': 1, 'status': 2, 'codes': 1, 'here,': 1, 'but': 1, 'a': 5, 'code': 2, 'starting': 2, 'with': 2, '2': 1, 'generally': 1, 'indicates': 2, 'success,': 1, 'and': 3, '4': 1, 'or': 1, '5': 1, 'an': 2, 'error.': 1, 'can': 1, 'use': 1, 'BeautifulSoup': 2, 'library': 1, 'to': 3, 'parse': 2, 'this': 1, 'document,': 1, 'extract': 1, 'text': 1, 'from': 1, 'p': 1, 'tag.': 1, 'first': 1, 'have': 1, 'import': 1, 'library,': 1, 'create': 1, 'instance': 1, 'class': 1, 'our': 1, 'document.': 1}


In [15]:
import pandas as pd
dfWords = pd.DataFrame(words,columns=["words"])
dfWords.head(10)

Unnamed: 0,words
0,A
1,status_code
2,of
3,200
4,means
5,that
6,the
7,page
8,downloaded
9,successfully.


In [16]:
dfc = pd.DataFrame(dfWords.words.value_counts())
dfc

Unnamed: 0,words
the,6
a,5
and,3
to,3
We,3
status,2
BeautifulSoup,2
parse,2
indicates,2
with,2


In [17]:
dfc.columns

Index(['words'], dtype='object')

In [18]:
dfc = dfc.rename(index=str, columns={"words": "Count"})
dfc

Unnamed: 0,Count
the,6
a,5
and,3
to,3
We,3
status,2
BeautifulSoup,2
parse,2
indicates,2
with,2


In [19]:
print(dfc.to_dict())

{'Count': {'the': 6, 'a': 5, 'and': 3, 'to': 3, 'We': 3, 'status': 2, 'BeautifulSoup': 2, 'parse': 2, 'indicates': 2, 'with': 2, 'starting': 2, 'code': 2, 'an': 2, 'of': 2, 'tag.': 1, 'document,': 1, 'use': 1, 'library': 1, 'our': 1, 'class': 1, 'instance': 1, 'this': 1, 'create': 1, 'p': 1, 'library,': 1, 'import': 1, 'have': 1, 'extract': 1, 'text': 1, 'can': 1, 'first': 1, 'from': 1, 'A': 1, 'error.': 1, '5': 1, '200': 1, 'means': 1, 'that': 1, 'page': 1, 'downloaded': 1, 'successfully.': 1, "won't": 1, 'fully': 1, 'dive': 1, 'into': 1, 'codes': 1, 'here,': 1, 'but': 1, '2': 1, 'generally': 1, 'success,': 1, 'status_code': 1, '4': 1, 'or': 1, 'document.': 1}}


In [20]:
print(dictForDF)

{'A': 1, 'status_code': 1, 'of': 2, '200': 1, 'means': 1, 'that': 1, 'the': 6, 'page': 1, 'downloaded': 1, 'successfully.': 1, 'We': 3, "won't": 1, 'fully': 1, 'dive': 1, 'into': 1, 'status': 2, 'codes': 1, 'here,': 1, 'but': 1, 'a': 5, 'code': 2, 'starting': 2, 'with': 2, '2': 1, 'generally': 1, 'indicates': 2, 'success,': 1, 'and': 3, '4': 1, 'or': 1, '5': 1, 'an': 2, 'error.': 1, 'can': 1, 'use': 1, 'BeautifulSoup': 2, 'library': 1, 'to': 3, 'parse': 2, 'this': 1, 'document,': 1, 'extract': 1, 'text': 1, 'from': 1, 'p': 1, 'tag.': 1, 'first': 1, 'have': 1, 'import': 1, 'library,': 1, 'create': 1, 'instance': 1, 'class': 1, 'our': 1, 'document.': 1}


https://en.wikipedia.org/wiki/Bag-of-words_model

1. John likes to watch movies. Mary likes movies too.
2. John also likes to watch football games.

1. BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
2. BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

1. [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
2. [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]

In [21]:
file = open('001.html')
#001.html was copied from http://econpy.pythonanywhere.com/ex/001.html
page = file.read()
file.close()

**If instead of the 001.html file, there was a URL, we could have used the following method**
Signature: requests.get(url, params=None, **kwargs)
Docstring:
Sends a GET request.

:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
File:      c:\programdata\anaconda3\lib\site-packages\requests\api.py
Type:      function

In [22]:
soup = BeautifulSoup(page, 'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Items 1 to 20 -- Example Page 1</title>
<script type="text/javascript">
      var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-23648880-1']);
      _gaq.push(['_trackPageview']);
      _gaq.push(['_setDomainName', 'econpy.org']);
    </script>
</head>
<body>
<div align="center">1, <a href="http://econpy.pythonanywhere.com/ex/002.html">[<font color="green">2</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/003.html">[<font color="green">3</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/004.html">[<font color="green">4</font>]</a>, <a href="http://econpy.pythonanywhere.com/ex/005.html">[<font color="green">5</font>]</a></div>
<div title="buyer-info">
<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span><br/>
</div>
<div title="buyer-info">
<div title="buyer-name">Earl E. Byrd</div>
<span class="item-price">$8.37</span><br/>
</div>
<div title="buyer-info">
<div title="bu

**Signature: html.fromstring(html, base_url=None, parser=None, **kw)**

Docstring:

Parse the html, returning a single element/document.

This tries to minimally parse the chunk of text, without knowing if it
is a fragment or a document.

base_url will set the document's base_url attribute (and the tree's docinfo.URL)

File:      c:\programdata\anaconda3\lib\site-packages\lxml\html\__init__.py

Type:      function

In [23]:
tree = html.fromstring(page)

<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span>

In [24]:
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

In [25]:
print('Buyers: ', buyers)
print('Prices: ', prices)

Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68', '$15.00', '$114.07', '$10.09']


In [26]:
df = pd.DataFrame({"Buyers": buyers, "Prices": prices})
df.head(10)

Unnamed: 0,Buyers,Prices
0,Carson Busses,$29.95
1,Earl E. Byrd,$8.37
2,Patty Cakes,$15.26
3,Derri Anne Connecticut,$19.25
4,Moe Dess,$19.25
5,Leda Doggslife,$13.99
6,Dan Druff,$31.57
7,Al Fresco,$8.49
8,Ido Hoe,$14.47
9,Howie Kisses,$15.86


In [27]:
df.to_csv("outputCSV.csv", sep=',')

**© Dr. Muhammad Naeem Irfan. Can't be posted on the internet**