# Introduction to Web Scraping in Python

Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.

## Scrape and Parse texts from Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools.

**Websites have two main reasons to not allow web scraping**
1. To protect its data. For example: Google maps do not allow users to request too many results in a minute.
2. To prevent overuse of their servers. When bots start sending many requests website's servers slow down and thus other users will have slower connection to the website.

One useful package for web scraping that you can find in Python’s standard library is [urllib](https://docs.python.org/3/library/urllib.html), which contains tools for working with URLs.
**urllib** is for opening and reading URLs.

#### Let's look at the example and use **urllib**

In [1]:
from urllib.request import urlopen
url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)

To extract the HTML from the page:
1. Use html's read method to return sequence of bytes
2. Use decode method on 1st result to decode bytes to strings

In [2]:
page

<http.client.HTTPResponse at 0x7ff490dc0c40>

In [3]:
html_by = page.read()

In [6]:
html_by

b'<html>\n<head>\n<title>Profile: Aphrodite</title>\n</head>\n<body bgcolor="yellow">\n<center>\n<br><br>\n<img src="/static/aphrodite.gif" />\n<h2>Name: Aphrodite</h2>\n<br><br>\nFavorite animal: Dove\n<br><br>\nFavorite color: Red\n<br><br>\nHometown: Mount Olympus\n</center>\n</body>\n</html>\n'

In [7]:
html = html_by.decode("utf-8")
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



#### Let's try to get the title of the webpage
1. We need to get the index of the **\<title>**, and because title tags strings have been counted we need to add it to the index. 
2. Find the index of the closing **\<title>** tag
3. Get the title by slicing the html

In [13]:
len("<html>\n<head>\n")

14

In [8]:
html

'<html>\n<head>\n<title>Profile: Aphrodite</title>\n</head>\n<body bgcolor="yellow">\n<center>\n<br><br>\n<img src="/static/aphrodite.gif" />\n<h2>Name: Aphrodite</h2>\n<br><br>\nFavorite animal: Dove\n<br><br>\nFavorite color: Red\n<br><br>\nHometown: Mount Olympus\n</center>\n</body>\n</html>\n'

In [10]:
html.find("<title>")

14

In [11]:
html[14:21]

'<title>'

In [12]:
len("<title>")

7

In [14]:
title_index = html.find("<title>")
start_index = title_index + len("<title>")

In [15]:
start_index

21

In [16]:
print(start_index)
print(title_index)

21
14


In [17]:
end_index = html.find("</title>")
print(end_index)

39


In [18]:
title = html[start_index:end_index]
print(title)

Profile: Aphrodite


#### It is a lot of work just to get the title of the page. In the real world, websites are much more complex and complicated. We can use find many dedicated tools for html scraping but the most powerful and popular library for Python is [**Beautiful soup**](https://www.crummy.com/software/BeautifulSoup/)

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping.

**Run the command below to install**:
```bash
conda install beautifulsoup4
pip install beautifulsoup4
```

In [19]:
!pip install beautifulsoup4



In [21]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/aphrodite"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [24]:
soup

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/aphrodite.gif"/>
<h2>Name: Aphrodite</h2>
<br/><br/>
Favorite animal: Dove
<br/><br/>
Favorite color: Red
<br/><br/>
Hometown: Mount Olympus
</center>
</body>
</html>

In [26]:
soup.find("h2").text

'Name: Aphrodite'

In [27]:
for x in soup.find_all("h2"):
    print(x.text)

Name: Aphrodite


#### Example above does three things
1. Opens up a page using **urlopen** from **urllib.request**
2. Reads and decodes the page and saves as a variable
3. Creates a BeautifulSoup object and assigns it to the soup variable 

BeautifulSoup objects have a **.get_text()** method that can be used to extract all the text from the document and automatically remove any HTML tags

In [28]:
print(soup.get_text())



Profile: Aphrodite





Name: Aphrodite

Favorite animal: Dove

Favorite color: Red

Hometown: Mount Olympus






To get the title of the page, you can use **.title**, and **.string** to get the text

In [30]:
soup.title.text

'Profile: Aphrodite'

In [31]:
print(soup.title)
print(soup.title.string)

<title>Profile: Aphrodite</title>
Profile: Aphrodite


You can use **find()** to find the tags you want and get the source attributes.

In [32]:
image = soup.find("img")

In [33]:
image

<img src="/static/aphrodite.gif"/>

In [34]:
image['src']

'/static/aphrodite.gif'

![ANY TEXT](http://olympus.realpython.org/static/aphrodite.gif)

#### Exercise your web scraping on Unegui.mn
1. Go to https://www.unegui.mn/avto-mashin/-avtomashin-zarna/, Use inspection tool on your browser to see the html tags and attributes.
2. Scrape all the listing's **title** and **price**. Scrape only the first page!
3. Save your listings as a pandas DataFrame
Example below illustrates the final result

In [36]:
import pandas as pd
titles = ['Toyota FJ Cruiser, 2012/2020', 'Honda Crossroad, 2009/2019']
prices = ['62 сая', '17 сая']
results = pd.DataFrame([titles, prices], columns=['titles', 'prices'])

In [37]:
results

Unnamed: 0,titles,prices
0,"Toyota FJ Cruiser, 2012/2020","Honda Crossroad, 2009/2019"
1,62 сая,17 сая


In [None]:
announcement-block__price _verified

In [40]:
!pip install requests



In [41]:
import requests
from bs4 import BeautifulSoup

In [64]:
response = requests.get('https://www.unegui.mn/kompyuter-busad/notebook/')

In [65]:
soup = BeautifulSoup(response.content)

In [66]:
results = soup.find_all("div", {"class": "announcement-block__price"})

In [67]:
len(results)

60

In [75]:
a_tags = soup.find_all("a", {"class": "announcement-block__title"})

In [82]:
links = []
for a_tag in a_tags:
    actual_link = "https://www.unegui.mn" + a_tag['href']
    links.append(actual_link)

In [90]:
df = pd.DataFrame(columns=['title','price'])

for url in links[0:10]:
    response = requests.get(url)
    soup = BeautifulSoup(response.content)
    # title = soup.find("h1", {"class": "title-announcement"}).text
    title = soup.find("meta", {"itemprop":"title"})['content']
    # price = soup.find("div", {"class": "announcement-price__cost"}).text
    price = soup.find("meta", {"itemprop":"price"})['content']
    
    # df = df.append({'title':title,'price':price}, ignore_index=True)
    df = pd.concat([df, pd.DataFrame([{"title": title, "price": price}])], ignore_index=True)
    
    # results = soup.find_all("div", {"class": "announcement-block__price _verified"})
    # for item in results:
    #     title = item.find("meta", {"itemprop":"name"})['content']
    #     price = item.find("meta", {"itemprop":"price"})['content']
    #     df = df.append({'title':title,'price':price}, ignore_index=True)

TypeError: 'NoneType' object is not subscriptable

In [89]:
df['title']

'\n                          \n                            Dell inspirion 15 3511 / 11th i5 8gb 256gb ssd /\n                          \n                        '

In [30]:
results[0].find_all("meta", {"itemprop":"price"})

'890000.00'

1. Get a list of URLs to scrape
2. Loop through the URLs
3. Inside that loop, loop through the listings (65 per page)
4. Grab the data you need (title and price for 65 listings)
5. Append it to a dataframe
6. Go the next page