# Web Scraper Tips

Source: [link](https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071)

> Web scraping is a technique used to extract data from websites through an automated process.




#### Tips
 - Use __inspect__ on chrome web browser

In [29]:
page_response_content = """
<body>
<div id="listings_prices">
 <div class="item">
  <li class="item_name">Watch</li>
  <div class="main_price">Price: $66.68</div>
       <div class="discounted_price">Discounted price: $46.68</div>
   </div>
   <div class="item">
  <li class="item_name">Watch2</li>
  <div class="main_price">Price: $56.68</div>
   </div>
</div>

<div id="uzay">
 <div class="item">
  <li class="item_name">Elma</li>
  <div class="main_price">10 TL</div>
       <div class="discounted_price">Discounted price: $46.68</div>
   </div>
   <div class="item">
  <li class="item_name">Armut</li>
  <div class="main_price">5 TL</div>
   </div>
</div>
</body>
"""

In [30]:
from bs4 import BeautifulSoup
import requests

#page_link ='https://www.website_to_crawl.com'
# fetch the content from url
#page_response_content = requests.get(page_link, timeout=5).content
# parse html
page_content = BeautifulSoup(page_response_content, "html.parser")

# extract all html elements where price is stored
prices = page_content.find_all(class_='main_price')


In [31]:
# you can also access the main_price class by specifying the tag of the class
prices = page_content.find_all('div', attrs={'class':'main_price'})
prices

[<div class="main_price">Price: $66.68</div>,
 <div class="main_price">Price: $56.68</div>,
 <div class="main_price">10 TL</div>,
 <div class="main_price">5 TL</div>]

In [33]:
# you can also access the main_price class by specifying the tag of the class
prices = page_content.find_all('div', attrs={'id':'uzay'})
prices

[<div id="uzay">
 <div class="item">
 <li class="item_name">Elma</li>
 <div class="main_price">10 TL</div>
 <div class="discounted_price">Discounted price: $46.68</div>
 </div>
 <div class="item">
 <li class="item_name">Armut</li>
 <div class="main_price">5 TL</div>
 </div>
 </div>]

In [44]:
page_content.find(attrs={'id':'uzay'})

<div id="uzay">
<div class="item">
<li class="item_name">Elma</li>
<div class="main_price">10 TL</div>
<div class="discounted_price">Discounted price: $46.68</div>
</div>
<div class="item">
<li class="item_name">Armut</li>
<div class="main_price">5 TL</div>
</div>
</div>

In [45]:
page_content.find(attrs={'id':'uzay'}).find_all(attrs={'class':'main_price'})

[<div class="main_price">10 TL</div>, <div class="main_price">5 TL</div>]

## Permissions in robots.txt
 - https://twitter.com/robots.txt

## User Agent
 - Everytime you visit a website, it gets your browser information via user agent. 

In [49]:
# library to generate user agent
from user_agent import generate_user_agent
page_link ='https://uzay00.github.io'
# generate a user agent
headers = {'User-Agent': generate_user_agent(device_type="desktop", os=('mac', 'linux'))}
#headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.63 Safari/537.36'}
page_response = requests.get(page_link, timeout=5, headers=headers)

In [50]:
page_response

<Response [200]>

In [52]:
page_response_content = page_response.content
page_content = BeautifulSoup(page_response_content, "html.parser")
page_content

<!DOCTYPE HTML>

<!--
	Editorial by HTML5 UP
	html5up.net | @ajlkn
	Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)
-->
<html>
<head>
<title>Dr. Uzay Çetin</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<!--[if lte IE 8]><script src="assets/js/ie/html5shiv.js"></script><![endif]-->
<link href="assets/css/main.css" rel="stylesheet"/>
<!--[if lte IE 9]><link rel="stylesheet" href="assets/css/ie9.css" /><![endif]-->
<!--[if lte IE 8]><link rel="stylesheet" href="assets/css/ie8.css" /><![endif]-->
</head>
<body>
<!-- Wrapper -->
<div id="wrapper">
<!-- Main -->
<div id="main">
<div class="inner">
<!-- Header -->
<header id="header">
<a class="logo" href="http://tuvalu.santafe.edu/events/workshops/index.php/Uzay_%C3%87etin"><strong>Dr. Uzay Çetin</strong></a>
<ul class="icons">
<li><a class="icon fa-twitter" href="https://twitter.com/uzay00"><span class="label">Tw

# Get Eksi

In [1]:
from lxml import html
from requests import get, RequestException
from user_agent import generate_user_agent

url = 'https://eksisozluk.com/mustafa-kemal-ataturk--34712?p=2'

ua = generate_user_agent(device_type="desktop", os=('linux', 'mac'))

def get_content(url):
    try:
        res = get(url, headers={'User-Agent': ua})
        status = res.status_code
        if status == 200:
            return res.content
        else:
            print(status)
    except RequestException as ex:
        print('Request Error!')
        print(ex)
    except KeyboardInterrupt:
        print('The request is interrupted!')

def get_entries(url, num_fetch):
    entries = []
    page = 1

    while len(entries) < num_fetch:
        html_element = html.document_fromstring(get_content(url + '?p=' + str(page)))
        entries.extend(entry.text_content().strip() for entry in html_element.xpath('.//div[@class="content"]') if len(entries) < num_fetch)
        page += 1

    return entries

entries = get_entries(url, 20)
for i, entry in enumerate(entries):
    print('Entry %d: %s' % (i+1, entry))

Entry 1: bu cografya da dogmus olduguma sevinmem için bir kaç nedenden biri
Entry 2: o olmasaydi biz, "biz" olur muyduk bilemiyorum...
Entry 3: guzel insan keza bu kadarı yeter.
Entry 4: (bkz: 10 kasim)
Entry 5: sarı saçlı mavi gözlü yakışıklı, güzel elli, prezentabl geyikleri arasında kayanayıp gitmeye başlamış, büyük bir lider.kara kaşlı kara gözlü, kambur olsa idi, bu kadar çok saygı duyulmayacak mıydı?yalvarırım şu sarı saçlıyı, şu mavi gözlüyü, rakı-sigara-leblebi, tarlada karga kovalama edebiyatını bi kenara bırakın.ata, alemci akşamcıların jön dostu hareketli korkuluk değildir.maalesef yaptıklarından çok, düşünceleri ile sonsuza dek yaşamayı bilmiş birisidir.çelik erişçi yaklaşımları ile ancak yeni captainhowdyler yaratırsınız.
Entry 6: konuyla ilgili olarak aziz nesin'in bir masal ve hikayesini öneriyorumisimlerini de vermiyorum ki ararken okuyunuz.
Entry 7: dahi insan. ornek alınası kişi. sadece bir kez gormek bile yeterdi.. iyikio bıraktığı ülkeyi bugun görmüyor. yada görüyor

In [None]:
#!pip install user_agent