# What is Web Scraping
Suppose you want some information from a website? Let’s say a paragraph on Donald Trump! What do you do? Well, you can copy and paste the information from Wikipedia to your own file. But what if you want to get large amounts of information from a website as quickly as possible? Such as large amounts of data from a website to train a Machine Learning algorithm? In such a situation, copying and pasting will not work! And that’s when you’ll need to use Web Scraping. <br>
Source : [geeksforgeeks](https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/)

# Web Scraping vs API: What’s the best way to extract data
While web scraping gives you the option to extract data from any website through web scraping tools, APIs provide direct access to the type of data you would want.


In web scraping, the user can access the data till it is available on a website. However, access to the data might be either too limited or expensive when it comes to API.



# All About Googlebot? Web crawler? Spider? Huh?
All those terms mean the same thing: it’s a bot that crawls the web. Googlebot crawls web pages via links. It finds and reads new and updated content and suggests what should be added to the index. The index, of course, is Google’s brain. This is where all the knowledge resides. Google uses a ton of computers to send their crawlers to every nook and cranny of the web to find these pages and to see what’s on them. Googlebot is Google’s web crawler or robot and other search engines have their own.<br>
Source: [Yoast](https://yoast.com/what-is-googlebot/)

# How does Googlebot work?
Googlebot uses sitemaps and databases of links discovered during previous crawls to determine where to go next. Whenever the crawler finds new links on a site, it adds them to the list of pages to visit next. If Googlebot finds changes in the links or broken links, it will make a note of that so the index can be updated. The program determines how often it will crawl pages. To make sure Googlebot can correctly index your site, you need to check its crawlability. If your site is available to crawlers they come around often.<br>
Source: [Yoast](https://yoast.com/what-is-googlebot/)

# Caution! Read before scraping.
Not all websites allows to scrap the data. Scraping makes the website traffic spike and may cause the breakdown of the website server.Not just this the team behind the websites works hard to put that data online so, it's up to them if they want to allow to scrape the data or not. Thus, not all websites allow people to scrape.
<br><br>
Note: To know what's allowed to scrape and what's not visit url+(robots.txt) of any website.
* Hacker News [Visit](https://news.ycombinator.com/robots.txt)
* ANI News [Visit](https://aninews.in/robots.txt)
* MoneyControl [Visit](https://www.moneycontrol.com/robots.txt)

# Web Scraping Using BeautifulSoup

## What is BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.<br>
[Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
#installing beautifulsoup4
!pip install beautifulsoup4



In [3]:
#installing requests library
!pip install requests



# How requests library is used?
The requests module in Python allows you to exchange requests on the web. It is a very useful library that has many essential methods and features to send HTTP requests. As mentioned earlier, HTTP works as a request-response system between a server and a client

In [4]:
import requests
from bs4 import BeautifulSoup
#take care of the capital letters it's case sensitive

In [5]:
#getting response from the web using requests is the first thing we do in web scraping 
#because requests help us to send HTTP requests and HTTP works as a request-response system between a server and a client
response=requests.get("https://news.ycombinator.com/")

In [6]:
response

<Response [200]>

Note : <Response [200]> means everything is working nicely

In [7]:
#to print the data we use 
response.text



# Yes! It is quite Messy
Now we'll be using beautifulsoup to get the specific data we want

In [8]:
soup=BeautifulSoup(response.text,'html.parser')

In [9]:
soup

<html lang="en" op="news"><head><meta content="origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="news.css?2S5u5LU4ipwHluPq7mNu" rel="stylesheet" type="text/css"/>
<link href="favicon.ico" rel="shortcut icon"/>
<link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
<title>Hacker News</title></head><body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href

# Let's make it more Prettier

In [10]:
print(soup.prettify())

<html lang="en" op="news">
 <head>
  <meta content="origin" name="referrer"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="news.css?2S5u5LU4ipwHluPq7mNu" rel="stylesheet" type="text/css"/>
  <link href="favicon.ico" rel="shortcut icon"/>
  <link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
  <title>
   Hacker News
  </title>
 </head>
 <body>
  <center>
   <table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
    <tr>
     <td bgcolor="#ff6600">
      <table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%">
       <tr>
        <td style="width:18px;padding-right:4px">
         <a href="https://news.ycombinator.com">
          <img height="18" src="y18.gif" style="border:1px white solid;" width="18"/>
         </a>
        </td>
        <td style="line-height:12pt; height:10px;">
         <span class="pagetop">
          <b class="hnname">
           <a href

In [11]:
#fetching the title of the webpage
print(soup.title)

<title>Hacker News</title>


In [12]:
#fetching the data contain of body tag of the webpage
print(soup.body)

<body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>
</tr></table></td></tr>
<tr id="pagespace" style="height:10px" title=""></tr><tr><td><table border="0" cellpadding="0" cellspacing="0" class="itemlist">
<tr class="athing

In [13]:
#fetching the data contents of body tag of the webpage
print(soup.body.contents)

[<center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
<a href="login?goto=news">login</a>
</span></td>
</tr></table></td></tr>
<tr id="pagespace" style="height:10px" title=""></tr><tr><td><table border="0" cellpadding="0" cellspacing="0" class="itemlist">
<tr class="athing" id=

In [14]:
#fetching all the divs objects in the data
print(soup.find_all('div'))

[<div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upvote"></div>, <div class="votearrow" title="upv

In [15]:
#fetching all the links (a tag) in the data of the webpage
print(soup.find_all('a'))

        Applications are open for YC Summer 2022
      </a>, <a href="newsguidelines.html">Guidelines</a>, <a href="newsfaq.html">FAQ</a>, <a href="lists">Lists</a>, <a href="https://github.com/HackerNews/API">API</a>, <a href="security.html">Security</a>, <a href="http://www.ycombinator.com/legal/">Legal</a>, <a href="http://www.ycombinator.com/apply/">Apply to YC</a>, <a href="mailto:hn@ycombinator.com">Contact</a>]


In [16]:
#fetching the first link (a tag) in the data of the webpage
print(soup.a)

<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>


# Conclusion
We have learnt a lot about Web Scraping and BeautifulSoup. In the next lecture we'll dive deep and learn more about Web Scraping