<h1>Web Scraping</h1>

* Web scraping is the process of gathering information from the Internet.
  
* It is also called web data mining or web harvesting
  
* The data is obtained in an unstructured format, which is then converted into a structured manner after performing multiple pre-processing steps.
  
* Web scraping can also be done manually for small web pages by simply copying and pasting the data from the web page.
  
* If we require data at a large scale and from multiple web pages we use web scrapers


<h2>web scraping and crawling</h2>

<h3>Web scraping</h3>

* the act of automatically downloading a web page's data and extracting very specific information from it.
  
* The extracted information can be stored pretty much anywhere (database, file, etc.).
  
* Can be implemented at any scale.

<h3>Web crawling</h3>

*  the act of automatically downloading a web page's data, extracting the hyperlinks it contains and following them.
  
*  The downloaded data is generally stored in an index or a database to make it easily searchable.

*  Mostly done on large scale.	

<h2> Why is web scraping often seen negatively? </h2>

1. It's increasingly being used for business purposes to gain a competitive advantage. So there's often a financial motive behind it.

2. It's often done in complete disregard of copyright laws and of Terms of Service (ToS).

3. It's often done in abusive manners.
   Example: web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites.


Note: Tons of individuals and companies are running their own web scrapers right now. So much that this has been causing headaches for companies whose websites are scraped, like social networks (e.g. Facebook, LinkedIn, etc.) and online stores (e.g. Amazon). This is probably why Facebook has separate terms for automated data collection.

<h2> Is scraping legal or illegal? </h2>

* Web scraping and crawling aren't illegal by themselves
  
* Some websites need permission for crawling and some don't allow as in their  Terms of Service (ToS)
  
* Example: Linked suied anonymous scrapers for following reasons : <br>
    1. Violation of the Computer Fraud and Abuse Act (CFAA). <br>
    2. Violation of California Penal Code. <br>
    3. Violation of the Digital Millennium Copyright Act (DMCA). <br>
    4. Breach of contract. <br>


<h2>Uses of Web Scraping</h2>

1. <b>Marketing</b> : collect information about their products or services from various social media websites to get a general public sentiment. Also, they extract email ids from various websites and then send bulk promotional emails to the owners of these email ids.
   
3. <b> Content Creation</b> : It helps the creator to create quality and trending content as we can gather information from news articles, research reports, and blog posts
   
5. <b> Price Comparison </b> : Web scraping can be used to extract the prices from multiple e-commerce websites can compare them.
   
7. <b> Job Postings</b> : Web Scraping can also be used to collect data on various job openings across multiple job portals so that this information can help many job seekers and recruiters.

8. <b> Data for Machine Learning Projects</b> : Retrieval of data for machine learning projects depends upon web scraping.

9. <b>Search Engine Optimization (SEO) </b>: Web scraping is widely used by SEO tools like SEMRush, Majestic etc. to tell business how they rank for search keywords that matter to them.

<h2>Python Library for scraping</h2>

* Scrapy
* Salenium
* BeautifulSoup
  

<h3> Necessary Library </h3>

* pip install requests
* pip install urllib3
* pip install bs4
* pip install lxml

In [1]:
!pip install requests



In [1]:
#we use requests to make a GET HTTP requests for the url: https://ekantipur.com/ by making a GET request.
import requests
r = requests.get('https://gorkhapatraonline.com/')

In [3]:
# retrieve the content by using .text property
# we got the first 600 characters
r.text[:600]

'<!DOCTYPE html>\n<html class="no-js" lang="">\n\n<head>\n\t<meta charset="utf-8"/>\n\t<meta content="ie=edge" http-equiv="x-ua-compatible"/>\n\n\t\t\t<title> Gorkhapatra | Nepal&#039;s First News Organization</title>\n\t\n\t        <meta\n            name="description"\n            content="Gorkhapatra is the oldest national daily newspaper of Nepal. It is run by Gorkhapatra Sansthan. It was launched as a weekly in May\n\t\t\t  1901 and became a daily newspaper in 1961.Gorkhapatra Online is Nepal\'s oldest newspaper which is now available in digital form as well.Nepal\'s digital newspaper, online destination for Nepa'

In [4]:
! pip install urllib3



In [5]:
!pip install bs4



In [6]:
# It is a highperformance HTML and XML parsing library.
! pip install lxml



In [8]:
import csv
import lxml
import requests
from bs4 import BeautifulSoup

r = requests.get('https://annapurnapost.com/category/opinion/')
# print(r)
soup = BeautifulSoup(r.text, 'lxml')
# print(soup.prettify())

match = soup.title.text
match


'विचार | अन्नपूर्ण पोस्ट्  '

In [10]:
match = soup.div
print(match.prettify())

<div id="fb-root">
</div>



In [9]:
# csv_file = open('news_scrape.csv', 'w')
# csv_writer = csv.writer(csv_file)
# csv_writer.writerow(['headline', 'summary', 'news_link'])

<h3>Extracting news from a single page</h3>

In [12]:
import requests
from bs4 import BeautifulSoup

r = requests.get('https://annapurnapost.com/story/449623/')
soup = BeautifulSoup(r.text, 'lxml')
# print(soup.prettify())
match = soup.title.text
match

'सार्वजनिक ऋणका पाँच सबाल | अन्नपूर्ण पोस्ट्  '

In [23]:
#scrape title
title = soup.find('div', class_='news__details-titles')  # use .text too
# print(title)
p_title = title.h1.text
# print(p_title)
print("The title of the article is:", p_title)
# print(title.prettify())

The title of the article is: 
                    सार्वजनिक ऋणका पाँच सबाल
                


In [24]:
#scrape author name
author_element = title.find('p', class_='author__name')

if author_element:
    author_name = author_element.a.text.strip()
    print("Author's Name:", author_name)
else:
    print("Author's name not found.")

Author's Name: गोपीनाथ मैनाली


In [13]:
#scrape date
date_element = title.find('p', class_='date')

if date_element:
    date = date_element.text.strip()
    print("Date:", date)
else:
    print("Date not found")

Date: पुष २०, २०८० शुक्रबार १०:५८:५९


In [27]:
#scrape body of news article
body = soup.find('div', class_='news__details').text
print(body)


विकासशील मुलुकहरू दोहोरो वित्त समस्यामा हुन्छन्। पहिलो, वित्तीय स्रोतको न्यूनता र दोस्रो, उपलब्ध स्रोतको उपयोगको न्यून सामर्थ्य। नेपाल यसको अपवाद होइन। नेपालले तिर्न बाँकी ऋण करिब २,३४९ अर्ब पुगेको छ। जसमा १,१७० अर्ब बाह्य र अन्तरिक १,१७९ अर्ब छ। यो प्रतिव्यक्तिका हिसाबमा करिब ७९ हजार नाघेको छ। अर्कोतर्फ विकास वित्तका लागि लिइएको ऋण समयमै खर्च हुन नसक्दा पुँजी निर्माणको क्रम पनि सुस्त छ। भएको खर्चको समयमै शोधभर्ना लिन नहुँदा नगद प्रवाह पनि समस्यामा पर्ने गरेको छ।








राजनीतिक तवरबाट सार्वजनिक ऋण बढ्दै गएकोमा आलोचना भइरहेकै छ तर सरकारमा पुगेपछि ऋणलाई सजिलो विकल्पको रूपमा लिने प्रवृत्ति छ। अझ ऋण लिन सक्नुलाई राजनीतिक सफलता मानिन्छ। नेपालको बजेट सबै घाटा बजेट हुन्, राष्ट्रिय बजेटको २५ प्रतिशतभन्दा माथिको हिस्सा सार्वजनिक ऋणले लिँदै आएको छ। २०७८/०७९ को बजेट १,६३२ अर्बमा आन्तरिक राजस्व १,०५० अर्ब, अनुदान ५९.९ अर्ब भई खुद न्यून ५२२ अर्बमध्ये २८३ अर्ब र आन्तरिक ऋण २३९ अर्ब थियो। त्यसपछिका आर्थिक वर्षदेखि आन्तरिक ऋणले बाह्यलाई उछिन्दै आएको छ। चालू आव २०८०/०८१ को बजेटको आकार १,७५१ अर्बमा ४

<h3>Using find method</h3>

In [3]:
import csv
import lxml
import requests
from bs4 import BeautifulSoup

r = requests.get('https://annapurnapost.com/category/opinion/')
soup = BeautifulSoup(r.text, 'lxml')

In [9]:
article = soup.find('div', class_='category__news')
headline = article.h3.a.text
print(headline)

हिंसामा सामाजिक मनोविज्ञान


In [10]:
summary = soup.find('div', class_='card__desc')
sum = summary.text
sum

'\n                                \n                                नेपालीमा उखान छ, सुतेको मानिसलाई जगाउन सकिन्छ तर सुतेको जस्तै बहाना गर्नेलाई जगाउन सकिँदैन। यही उखानसँग मिल्दोजुल्दो व्यवहार नेपालको सबैभन्दा पुरानो र ठूलो …\n                                \n                            '

<h4>Extract title only</h4>

In [15]:
article =  soup.find_all('div', class_='grid__card')

for i in article:
    # print(i.prettify())
    title = i.h3.a.text
    print(title)

त्रिवि रेकर्डको अन्तर्य
हिंसामा सामाजिक मनोविज्ञान
सार्वजनिक ऋणका पाँच सबाल
जे देखें, त्यही लेखें
उत्पादनसँगै माग बढाउनुपर्‍यो
पीडामा एम्सका नेपाली डाक्टर
भोकको मूल्य
बलमिच्याइँमा विश्वविद्यालय
विश्वव्यापी आर्थिक संकटका बाछिटा
द्वन्द्वपीडित प्राथमिकतामा परेनन्
उकुसमुकुस छन् नागरिक
जलविद्युत् क्षेत्रमा तेस्रो नियामकको उदय
यसरी हटाऔं बारुदी सुरुङको जोखिम
प्रचण्डको कमजोर एक वर्ष
बालकुमारी काण्ड : कोरिया रोजगारी गुम्ने संशय
पुँजीवादको राजनीति
मधुमेहको शिक्षा र सुरक्षा
बालबालिकामा ग्याजेटको असर
इजरायल–हमास युद्ध धेरै नलम्बिने सङ्केत
नयाँ राजनीतिक विचार कहिले ?

                                
                                सियोल : दक्षिण कोरियाको सीमावर्ती टापु योन्प्योङमा उत्तर कोरियाले लगातार तेस्रो दिन ‘प्रत्य…
                                
                            

                                
                                काठमाडौं : गत चैतसम्म नारायणगढ बुटवल सडक खण्डमा जम्मा ५.२९ मिटर सडक कालोपत्रे गरिएको थियो। ८ …
                                
                       

<h4>Extract summary only</h4>

In [16]:
summary = soup.find_all('div', class_='card__desc')

for i in summary:
    # print(i.prettify())
    sum = i.text
    print(sum)


                                
                                नेपालीमा उखान छ, सुतेको मानिसलाई जगाउन सकिन्छ तर सुतेको जस्तै बहाना गर्नेलाई जगाउन सकिँदैन। यही उखानसँग मिल्दोजुल्दो व्यवहार नेपालको सबैभन्दा पुरानो र ठूलो …
                                
                            

                            
                            सृष्टिको रथ धान्ने अर्को पांग्रोलाई हेर्ने दृष्टिकोण समान छ ? अर्थात् नारीलाई हामीले गर्ने व्यवहार …
                            
                        

                            
                            विकासशील मुलुकहरू दोहोरो वित्त समस्यामा हुन्छन्। पहिलो, वित्तीय स्रोतको न्यूनता र दोस्रो, उपलब्ध स्…
                            
                        

                            
                            गाउँमा बिताएका बाल्यकालका स्मरणमध्ये एउटा कुराले अहिले पनि मेरो मगज खलबली रहन्छ। म बसेको गाउँ…
                            
                        

                            
                            सन् २०३५ सम्म २८ हजार म

<h3>Extract link only </h3>

In [17]:
article =  soup.find_all('div', class_='grid__card')

for i in article:
    # print(i.prettify())
    link = i.h3.a['href']
    print(link)

/story/449734/
/story/449733/
/story/449623/
/story/449625/
/story/449559/
/story/449555/
/story/449554/
/story/449486/
/story/449485/
/story/449478/
/story/449429/
/story/449425/
/story/449421/
/story/449414/
/story/449374/
/story/449317/
/story/449316/
/story/449290/
/story/449244/
/story/449196/
/story/449790/
/story/449729/
/story/449746/
/story/449732/


In [9]:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import csv

# Base URL of the website
base_url = "https://annapurnapost.com"

r = requests.get('https://annapurnapost.com/category/opinion/')
soup = BeautifulSoup(r.text, 'lxml')

# Create a CSV file for writing
with open('article_data.csv', mode='w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)
    
    # Write the header row
    writer.writerow(['Title', 'Linked Summary', 'Full Link'])

    # Iterate through the links and visit each page
    article = soup.find_all('div', class_='grid__card')
    for i in article:
        relative_link = i.h3.a['href']
        # print(relative_link)
        full_link = urljoin(base_url, relative_link)  # Concatenate base URL with relative link
        # print(full_link)

        # Send an HTTP GET request to the linked page
        response = requests.get(full_link)
        # print(response)
  
        if response.status_code == 200:
            # Parse the content of the linked page using BeautifulSoup
            linked_soup = BeautifulSoup(response.content, 'html.parser')

            # Extract the title and summary from the linked page
            linked_title = linked_soup.find('div', class_='news__details-titles').h1.text
            # print(linked_title)
            linked_summary = linked_soup.find('div', class_='news__details').text.strip()
            # print(linked_summary)
            # Write the data to the CSV file
            writer.writerow([linked_title, linked_summary, full_link])

        else:
            print("Failed to retrieve the linked page:", response.status_code)


In [25]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.1.4-cp39-cp39-win_amd64.whl.metadata (18 kB)
Collecting numpy<2,>=1.22.4 (from pandas)
  Downloading numpy-1.26.3-cp39-cp39-win_amd64.whl.metadata (61 kB)
     ---------------------------------------- 0.0/61.2 kB ? eta -:--:--
     ------------------------------- ------ 51.2/61.2 kB 871.5 kB/s eta 0:00:01
     -------------------------------------- 61.2/61.2 kB 809.0 kB/s eta 0:00:00
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas)
  Using cached tzdata-2023.4-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.1.4-cp39-cp39-win_amd64.whl (10.8 MB)
   ---------------------------------------- 0.0/10.8 MB ? eta -:--:--
   ---------------------------------------- 0.1/10.8 MB 2.4 MB/s eta 0:00:05
   - -------------------------------------- 0.4/10.8 MB 3.6 MB/s eta 0:00:03
   -- ------------------------------------- 0.7/10.8 MB 4.4 MB/s eta 0:00

In [10]:
import pandas as pd

df = pd.read_csv("article_data.csv")

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Title           24 non-null     object
 1   Linked Summary  24 non-null     object
 2   Full Link       24 non-null     object
dtypes: object(3)
memory usage: 704.0+ bytes


In [15]:
df["Full Link"][0]

'https://annapurnapost.com/story/449800/'