# Part 1 Web Scraping

Web Scraping is an art where one has to study the website and work according to the dynamics of that particular website.

Most common tools used for web scraping in python are demonstrated below.

1. requests https://requests.readthedocs.io/en/latest/
2. beautiful soup https://beautiful-soup-4.readthedocs.io/en/latest/
3. Selenium https://selenium-python.readthedocs.io/
4. Scrapy https://docs.scrapy.org/en/latest/

We will be working on the first three and the fourth one can be explored in the homeworks.

We will be scraping 4 websites today:

1. ScrapeThisSite
2. GeeksforGeeks
3. CNBC
4. Hoopshype

There are different techniques to be used when scraping a dynamic website vs a static website which will be discussed in the coming sections

Some websites have their APIs open and those can be used to directly fetch the data without the need of scraping the HTML or XML pages.

In [None]:
# installing the libraries
!pip install requests
!pip install bs4
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Ign:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,265 kB]
Hit:12 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:13 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:14 http://security.ubuntu.com/ubu

In [None]:
# importing the libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import json
from google.colab import drive
import sys

In [None]:
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
url = 'https://www.scrapethissite.com/pages/forms/'

In [None]:
page = requests.get(url)

In [None]:
soup = BeautifulSoup(page.text, 'html')

In [None]:
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

In [None]:
soup.find('div')

<div class="container">
<div class="col-md-12">
<ul class="nav nav-tabs">
<li id="nav-homepage">
<a class="nav-link hidden-sm hidden-xs" href="/">
<img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                Scrape This Site
                            </a>
</li>
<li id="nav-sandbox">
<a class="nav-link" href="/pages/">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                Sandbox
                            </a>
</li>
<li id="nav-lessons">
<a class="nav-link" href="/lessons/">
<i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                Lessons
                            </a>
</li>
<li id="nav-faq">
<a class="nav-link" href="/faq/">
<i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                FAQ
                            </a>
</li>
<li class="pull-right" id="nav-login">
<a class="nav-link" href="/login/">
                                Login

In [None]:
soup.find_all('p')

[<p class="lead">
                             Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.
                             Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.
                         </p>,
 <p>
 <i class="glyphicon glyphicon-education"></i> There are <a href="/lessons/">8 video lessons</a> that show you how to scrape this page.
                         </p>,
 <p>
                             
                                 Data via
                                 <a class="data-attribution" href="http://www.opensourcesports.com/hockey/" target="_blank">http://www.opensourcesports.com/hockey/</a>
 </p>]

In [None]:
soup.find('p', class_ = 'lead').text

'\n                            Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.\n                            Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.\n                        '

In [None]:
soup.find('p', class_ = 'lead').text.strip()

'Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.\n                            Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.'

In [None]:
soup.find_all('th')

[<th>
                             Team Name
                         </th>,
 <th>
                             Year
                         </th>,
 <th>
                             Wins
                         </th>,
 <th>
                             Losses
                         </th>,
 <th>
                             OT Losses
                         </th>,
 <th>
                             Win %
                         </th>,
 <th>
                             Goals For (GF)
                         </th>,
 <th>
                             Goals Against (GA)
                         </th>,
 <th>
                             + / -
                         </th>]

In [None]:
soup.find('th').text.strip()

'Team Name'

### My work for Task 1 and Task 5:

I created a pandas dataframe with the table available on the website. This is for multiple tasks i.e For Task 1 and Task 5. I have discussed this with the professor and she said that is fine.

In [None]:
import pandas as pd

rows = []
for i in range(1,7):
  url = 'https://www.scrapethissite.com/pages/forms/?page_num={}&per_page=100'.format(i)
  print(url)
  page = requests.get(url)
  soup = BeautifulSoup(page.text, 'html')
  table = soup.find('table')
  columns = [th.text.replace("\n","").strip() for th in table.find_all('th')]
  for tr in table.find_all('tr')[1:]:
    temp = []
    for td in tr.find_all('td'):
      x = td.text.replace("\n","").strip()
      temp.append(x)
    rows = rows + [temp]
  # print(len(rows))
  # print(len(set(tuple(row) for row in rows)))

df_from_html = pd.DataFrame(rows, columns=columns)
df_from_html

https://www.scrapethissite.com/pages/forms/?page_num=1&per_page=100
https://www.scrapethissite.com/pages/forms/?page_num=2&per_page=100
https://www.scrapethissite.com/pages/forms/?page_num=3&per_page=100
https://www.scrapethissite.com/pages/forms/?page_num=4&per_page=100
https://www.scrapethissite.com/pages/forms/?page_num=5&per_page=100
https://www.scrapethissite.com/pages/forms/?page_num=6&per_page=100


Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9,0.622,249,198,51
580,Washington Capitals,2011,42,32,8,0.512,222,230,-8


In [None]:
# getting the first URL
# open the URL in parallel in other tab to check the information we are extracting
url = "https://www.geeksforgeeks.org/python-programming-language/"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
# creating a soup object from the returned html page
sp = BeautifulSoup(res.text, "lxml")
sp

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]--><!--[if IE 8]>
<html class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]--><!--[if !(IE 7) | !(IE 8)  ]><!--><html lang="en-US" prefix="og: http://ogp.me/ns#">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="Data Structures, Algorithms, Python, Java, C, C++, JavaScript, Android Development, SQL, Data Science, Machine Learning, PHP, Web Development, System Design, Tutorial, Technical Blogs, Interview Experience, Interview Preparation, Programming, Competitive Programming, Jobs, Coding Contests, GATE CSE, HTML, CSS, React, NodeJS, Placement, Aptitude, Quiz, Computer Science, Programming Examples, GeeksforGeeks Courses, Puzzles, SSC, Banking, UPSC, Commerce, Finance, CBSE, School, k12, General Knowledge, News, Mathematics, Exams" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=0.5, maximum-scale=3.0" name="viewp

In [None]:
# printing it in readable format
print(sp.prettify())

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<!--[if !(IE 7) | !(IE 8)  ]><!-->
<html lang="en-US" prefix="og: http://ogp.me/ns#">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="Data Structures, Algorithms, Python, Java, C, C++, JavaScript, Android Development, SQL, Data Science, Machine Learning, PHP, Web Development, System Design, Tutorial, Technical Blogs, Interview Experience, Interview Preparation, Programming, Competitive Programming, Jobs, Coding Contests, GATE CSE, HTML, CSS, React, NodeJS, Placement, Aptitude, Quiz, Computer Science, Programming Examples, GeeksforGeeks Courses, Puzzles, SSC, Banking, UPSC, Commerce, Finance, CBSE, School, k12, General Knowledge, News, Mathematics, Exams" name="keywords"/>
  <meta content="width=device-width, initial-scale=1.0, minimum-scale=0.5, maximum-scale=3

In [None]:
# parsing title element of the page
print(sp.title)
print(sp.title.name)
print(sp.title.string)
print(sp.title.parent.name)

<title>Python Tutorial | Learn Python Programming Language (2024)</title>
title
Python Tutorial | Learn Python Programming Language (2024)
head


In [None]:
# extracting the title of the article
print(sp.find("div", {"class" : "article-title"}).text)


Python Tutorial | Learn Python Programming Language



In [None]:
print(sp.find("div"))

<div class="header-main__wrapper">
<a class="gfg-stc" href="#main" style="top:0">Skip to content</a>
<a aria-label="Logo" class="header-main__logo" href="https://www.geeksforgeeks.org/">
<div class="_logo">
<!-- Original Logo -->
<img alt="geeksforgeeks" class="gfg_logo_img" src="https://media.geeksforgeeks.org/gfg-gg-logo.svg" style="height: 30px; width: 80px; max-width: fit-content;"/>
</div>
</a>
<div class="header-main__container">
<!-- for mobile only -->
<!-- For Web view only -->
<ul class="header-main__list"><li aria-expanded="true" class="header-main__list-item Header_1" data-expandable="true" data-parent="false"><span>Tutorials</span><i class="gfg-icon gfg-icon_arrow-down gfg-icon_header"></i><ul class="mega-dropdown Screen_1"><li aria-expanded="true" class="mega-dropdown__list-item" data-expandable="true" data-parent="false"><span>Python Tutorial</span><i class="gfg-icon gfg-icon_arrow-right"></i><ul class="mega-dropdown Screen_2"><li aria-expanded="false" class="mega-dropdo

In [None]:
# extracting the date of the article
date_label = sp.find('span', class_='strong', text='Last Updated : ')
date_span = date_label.find_next('span')
last_updated_date = date_span.text
print(f"Last Updated Date: {last_updated_date}")


Last Updated Date: 10 Sep, 2024


  date_label = sp.find('span', class_='strong', text='Last Updated : ')


In [None]:
# extracting the content of the article
# it extracts everyhting together, in the next sections we can see how to iteratively extract information paragraph by paragraph
print(sp.find("div", {"class" : "text"}).text.strip().replace("\n", " "))

Learning Python is a great choice because it’s beginner-friendly, making it easy for those new to programming to get started. Its versatility allows it to be used in various fields such as web development, data science, and automation. Additionally, Python skills are highly sought after by employers in the tech industry. In this tutorial, you’ll start with the basics, including installing Python and setting up your environment. then move on to understanding syntax, variables, and data types. As you progress, you will learn about control flow using loops and conditional statements, and how to work with data structures like lists and dictionaries. Finally, you’ll explore advanced topics such as object-oriented programming, error handling, and popular libraries. Now, whether you’re a beginner looking to write your first Python program or an experienced developer exploring advanced Python features, this Python tutorial is tailored to guide you through every step of your Python journey and 

In [None]:
# extracting the links found in the bottom of the article for further reading
for tag in sp.find("div", {"class":"gfg-similar-reads-list"}).findAll("a", href = True): print(tag.text, "\n", tag["href"], "\n")



Natural Language Processing(NLP) VS  Programming Language
In the world of computers, there are mainly two kinds of languages: Natural Language Processing (NLP) and Programming Languages. NLP is all about understanding human language while programming languages help us to tell computers what to do. But as technology grows, these two areas are starting to overlap in cool ways, changing how we interact with



4 min read

 
 https://www.geeksforgeeks.org/natural-language-processingnlp-vs-programming-language/?ref=oin_asr1 



Python - Fastest Growing Programming Language
There was a time when the word "Python" meant a particularly large snake but now it is a programming language that is all the rage!!! According to the TIOBE Index, Python is the fourth most popular programming language in the world currently and this extraordinary growth is only set to increase as observed by Stack Overflow Trends. So the question



5 min read

 
 https://www.geeksforgeeks.org/python-fastest-growing-pr

In [None]:
# extracting the links found in the bottom of the article for further reading
for link in sp.find('div', class_='gfg-similar-reads-list').find_all('a'):
    print(link.get('href'))

https://www.geeksforgeeks.org/natural-language-processingnlp-vs-programming-language/?ref=oin_asr1
https://www.geeksforgeeks.org/python-fastest-growing-programming-language/?ref=oin_asr2
https://www.geeksforgeeks.org/difference-between-python-and-lua-programming-language/?ref=oin_asr3
https://www.geeksforgeeks.org/python-program-to-find-gsoc-organisations-that-use-a-particular-programming-language/?ref=oin_asr4
https://www.geeksforgeeks.org/how-to-create-a-programming-language-using-python/?ref=oin_asr5
https://www.geeksforgeeks.org/difference-between-go-and-python-programming-language/?ref=oin_asr6
https://www.geeksforgeeks.org/popular-programming-languages/?ref=oin_asr7
https://www.geeksforgeeks.org/facts-about-cython-programming-language/?ref=oin_asr8
https://www.geeksforgeeks.org/tips-and-tricks-for-competitive-programmers-set-2-which-language-should-be-used-for-competitive-programming/?ref=oin_asr9
https://www.geeksforgeeks.org/geeksforgeeks-python-foundation-course-learn-python-i

In [None]:
# extracting information paragraph by paragraph
for tag in sp.find("div", {"class":"text"}).findAll("p"): print(tag.text)

Learning Python is a great choice because it’s beginner-friendly, making it easy for those new to programming to get started. Its versatility allows it to be used in various fields such as web development, data science, and automation. Additionally, Python skills are highly sought after by employers in the tech industry.
In this tutorial, you’ll start with the basics, including installing Python and setting up your environment. then move on to understanding syntax, variables, and data types. As you progress, you will learn about control flow using loops and conditional statements, and how to work with data structures like lists and dictionaries. Finally, you’ll explore advanced topics such as object-oriented programming, error handling, and popular libraries.
Now, whether you’re a beginner looking to write your first Python program or an experienced developer exploring advanced Python features, this Python tutorial is tailored to guide you through every step of your Python journey and 

Almost all the geeksforgeeks articles have the same format and hence all of them can be scraped using the same code, this repeatability is useful while doing web scraping as a block of code can help one get a lot of information in a structured way

Other information can be extracted using similar methods as used in geeksforgeeks, this can be done in homework

Again same as geeksforgeeks, all articles of scrapethissite
are similar in structure and hence the same code can be used to scrap through all the articles of this website

Now working with a dynamic website that has its API open.

Open the CNBC website and search for any topic. If we search for SPORTS the URL looks like this: https://www.cnbc.com/search/?query=SPORTS&qsearchterm=SPORTS and if we search for POLITICS the URL looks like this: https://www.cnbc.com/search/?query=POLITICS&qsearchterm=POLITICS

Here we can observe a pattern and we can predict what the url would look like if we search something else, this information can be used for reusability and repseatability of code.

Now we can see that there are more than 50,000 results for the topic politics but only 10 are loaded in the beginning. Once we scroll down, next 10 are loaded and so on. To load the next results when the user scrolls down, an API is hit and link to that API is found from the network section of the inspect element: https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28

Here playing around with the endindex and batchsize parameter we can get various results.

Now the response of this API call would be a JSON and hence the need for parsing a web page is gone when there is an open API

In [None]:
# getting the third url
url = "https://api.queryly.com/cnbc/json.aspx?queryly_key=31a35d40a9a64ab3&query=POLITICS&endindex=10&batchsize=10&callback=&showfaceted=false&timezoneoffset=420&facetedfields=formats&facetedkey=formats%7C&facetedvalue=!Press%20Release%7C&additionalindexes=4cd6f71fbf22424d,937d600b0d0d4e23,3bfbe40caee7443e,626fdfcd96444f28"

In [None]:
# hitting the url and getting the response
res = requests.get(url)
print(res.status_code)

200


In [None]:
res.text

'{"metadata":{"q":"politics","totalresults":58922,"pagesize":10,"totalpage":5893,"pagerequested":2,"corrections":[],"stems":["politics"],"suggestions":["politics"],"facetsuggestions":[],"related":[],"resultgenerationtime":"15.6378 ms"},"results":[{"description":"Speaking to CNBC\'s Tania Bryer at the Cannes Lions International Festival of Creativity, Richard Edelman, CEO of the communications firm Edelman, discussed his global consultancy\'s trust barometer sur","cn:lastPubDate":"2024-07-12T09:59:15+0000","dateModified":"2024-07-12T09:59:15+0000","cn:dateline":"","cn:branding":"cnbc","section":"","cn:type":"cnbcvideo","author":"Tania Bryer","cn:source":[],"cn:subtype":"clips","duration":"166","summary":"Speaking to CNBC\'s Tania Bryer at the Cannes Lions International Festival of Creativity, Richard Edelman, CEO of the communications firm Edelman, discussed his global consultancy\'s trust barometer survey and the challenges facing brands in 2024.","expires":"","cn:sectionSubType":[],"c

In [None]:
# getting the description of each news article
for description in json.loads(res.text)["results"]: print(description["description"], "\n")

Speaking to CNBC's Tania Bryer at the Cannes Lions International Festival of Creativity, Richard Edelman, CEO of the communications firm Edelman, discussed his global consultancy's trust barometer sur 

Hosted by Brian Sullivan, “Last Call” is a fast-paced, entertaining business show that explores the intersection of money, culture and policy. Tune in Monday through Friday at 7 p.m. ET on CNBC. 

Hosted by Brian Sullivan, “Last Call” is a fast-paced, entertaining business show that explores the intersection of money, culture and policy. Tune in Monday through Friday at 7 p.m. ET on CNBC. 

Helima Croft, RBC Capital Markets global head of commodity strategy, joins 'Closing Bell: Overtime' to discuss oil at 2-month highs, the energy outlook, and hurricane impact. 

TOKYO — U.S. opponents of a Japanese steelmaker's $14.9 billion bid for U.S. Steel cite concerns about national security and a reluctance to relinquish a storied American company. In Japan, the busine 

Hosted by Brian Sulliva

Finally working with Selenium

In [None]:
!echo | sudo add-apt-repository ppa:saiarcot895/chromium-beta
!sudo apt remove chromium-browser
!sudo snap remove chromium
!sudo apt install chromium-browser -qq
# Chromium (an open-source version of Chrome) and Chromium WebDriver (which allows Selenium to control Chromium).


PPA publishes dbgsym, you may need to include 'main/debug' component
Repository: 'deb https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu/ jammy main'
Description:
This PPA contains the latest Chromium Beta builds, with hardware video decoding enabled (hidden behind a flag), and support for Widevine (needed for viewing many DRM-protected videos) enabled.

== Hardware Video Decoding ==

To enable hardware video decoding, start Chromium with the --enable-features=VaapiVideoDecoder argument. To make this persistent, create a file at /etc/chromium-browser/customizations/92-vaapi-hardware-decoding with the following contents:

CHROMIUM_FLAGS="${CHROMIUM_FLAGS} --enable-features=VaapiVideoDecoder"

See also https://wiki.archlinux.org/title/Chromium#Hardware_video_acceleration for more information on VAAPI video decoding support.

=== Widevine Support ===

The packages in this PPA have support for Widevine inside Chromium enabled. However, you still need to copy some files from 

In [None]:
!pip3 install selenium --quiet
!apt-get update
!apt install chromium-chromedriver -qq
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
#Selenium requires a browser driver (in this case, chromedriver) to communicate with the browser. You're installing it using the chromium-chromedriver package and copying it to /usr/bin for easy access.

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connecte                                                                                                    Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadcont                                                                                                    Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadcontent.net (185.125.190.8                                                                                                    Hit:4 http://security.ubuntu.com/ubuntu jammy-security InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadco

In [None]:
!pip install --upgrade selenium



In [None]:
!pip install selenium
!apt-get update
!apt-get install -y chromium-chromedriver

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/saiarcot895/chromium-beta/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu

## Task 2:

Salaries of players

In [None]:
import sys
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait

sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)
#The ChromeService class sets up the path to the chromedriver

In [None]:
# this code should open hoopshype website in your newly opened chrome window
driver.get('https://hoopshype.com/salaries/players/')
# After setting up the Selenium WebDriver, the browser (in headless mode) navigates to HoopsHype Salaries page

In [None]:
# getting players name list
# wait = WebDriverWait(driver, 10)
# element = wait.until(driver.find_elements("xpath", '//td[@class="name"]'))
players = driver.find_elements("xpath", '//td[@class="name"]')
#uses an XPath selector to find all the HTML elements on the page that have a class attribute "name", which corresponds to the player names.

In [None]:
players_list = []
try:
  for p in range(len(players)):
    players_list.append(players[p].text)
except:
  print("Stale Element Error")
print(len(set(players_list)))
players_list = set(players_list)

600


In [None]:
salaries = []
for i in range(len(players_list)-2):
  salary = driver.find_elements("xpath", '//*[@id="content-container"]/div/div[3]/div[2]/div[2]/div[1]/table/tbody/tr[{}]/td[3]'.format(i+1))
  salaries.append(salary[0].text)
salaries

['$55,761,217',
 '$51,415,938',
 '$51,415,938',
 '$51,179,020',
 '$50,203,930',
 '$49,205,800',
 '$49,205,800',
 '$49,205,800',
 '$49,205,800',
 '$49,205,800',
 '$48,798,677',
 '$48,787,676',
 '$48,787,676',
 '$48,728,845',
 '$43,827,586',
 '$43,219,440',
 '$43,031,940',
 '$43,031,940',
 '$43,031,940',
 '$42,846,615',
 '$42,176,400',
 '$42,176,400',
 '$42,176,400',
 '$42,176,400',
 '$41,000,000',
 '$40,500,000',
 '$40,338,144',
 '$36,725,670',
 '$36,725,670',
 '$36,725,670',
 '$36,637,932',
 '$36,016,200',
 '$36,016,200',
 '$35,859,950',
 '$35,859,950',
 '$35,410,310',
 '$35,147,000',
 '$35,147,000',
 '$34,848,340',
 '$34,848,340',
 '$34,848,340',
 '$34,005,250',
 '$34,005,126',
 '$33,653,846',
 '$33,333,333',
 '$32,500,000',
 '$31,666,667',
 '$30,000,000',
 '$30,000,000',
 '$29,793,104',
 '$29,651,786',
 '$29,347,826',
 '$29,283,801',
 '$29,268,293',
 '$29,000,000',
 '$28,939,680',
 '$27,556,817',
 '$27,173,913',
 '$26,580,000',
 '$26,276,786',
 '$25,892,857',
 '$25,794,643',
 '$25,36

## Task 3

### Website 1

In [None]:
# Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.
import sys
from selenium.webdriver.chrome.service import Service as ChromeService
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)
#The ChromeService class sets up the path to the chromedriver

In [None]:
# this code should open hoopshype website in your newly opened chrome window
driver.get('https://www.learnpytorch.io/')
# After setting up the Selenium WebDriver, the browser (in headless mode) navigates to HoopsHype Salaries page

In [None]:
# getting players name list
main_content = driver.find_elements("xpath", '/html/body/div[3]/main/div/div[3]')
#uses an XPath selector to find all the HTML elements on the page that have a class attribute "name", which corresponds to the player names.

In [None]:
main_content[0].text

'Learn PyTorch for Deep Learning: Zero to Mastery book\nWelcome to the second best place on the internet to learn PyTorch (the first being the PyTorch documentation).\nThis is the online book version of the Learn PyTorch for Deep Learning: Zero to Mastery course.\nThis course will teach you the foundations of machine learning and deep learning with PyTorch (a machine learning framework written in Python).\nThe course is video based. However, the videos are based on the contents of this online book.\nFor full code and resources see the course GitHub.\nOtherwise, you can find more about the course below.\nDoes this course cover PyTorch 2.0?\nYes. PyTorch 2.0 is an additive release to previous versions of PyTorch.\nThis means it adds new features on top of the existing baseline features of PyTorch.\nThis course focuses on the baseline features of PyTorch (e.g. you\'re a beginner wanting to get into deep learning/AI).\nOnce you know the fundamentals of PyTorch, PyTorch 2.0 is a quick upgra

In [None]:
# getting all list items from the unordered list
list_of_topics = driver.find_elements("xpath", '/html/body/div[3]/main/div/div[1]/div/div/nav/ul/li/a/span')

# # Fetching text using 'innerText'
for topic in list_of_topics:
    text = topic.get_attribute('innerText').strip()
    if text:
        print(text)


Home
00. PyTorch Fundamentals
01. PyTorch Workflow Fundamentals
02. PyTorch Neural Network Classification
03. PyTorch Computer Vision
04. PyTorch Custom Datasets
05. PyTorch Going Modular
06. PyTorch Transfer Learning
07. PyTorch Experiment Tracking
08. PyTorch Paper Replicating
09. PyTorch Model Deployment
A Quick PyTorch 2.0 Tutorial
PyTorch Extra Resources
PyTorch Cheatsheet
The Three Most Common Errors in PyTorch


Similarly other information such as players' salaries can also be easily extracted and can be done as homework

### Website 2

In [None]:
# Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.
import sys
from selenium.webdriver.chrome.service import Service as ChromeService
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
# download the selenium chromedriver executable file and paste the link in the following code
# this code should open a new chrome window in your machine
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_service = ChromeService(
    executable_path='/usr/lib/chromium-browser/chromedriver',
    log_path='/dev/null'  # You can change the log path as needed
)
driver = webdriver.Chrome(service=chrome_service,options=chrome_options)
#The ChromeService class sets up the path to the chromedriver

In [None]:
# this code should open hoopshype website in your newly opened chrome window
driver.get('https://en.wikipedia.org/wiki/Main_Page')
# After setting up the Selenium WebDriver, the browser (in headless mode) navigates to HoopsHype Salaries page

In [None]:
# getting players name list
main_content = driver.find_elements("xpath", '//*[@id="mp-tfa"]/p')
#uses an XPath selector to find all the HTML elements on the page that have a class attribute "name", which corresponds to the player names.
main_content[0].text

"An attempted coup took place on September 13, 1964, in South Vietnam against the ruling military junta, led by Nguyễn Khánh (pictured). In the preceding month, Khánh had tried to improve his leadership by declaring a state of emergency, provoking protests and riots. He made concessions to the protesters and removed military officials linked to former President Ngo Dinh Diem, including Lâm Văn Phát and Dương Văn Đức. They responded with a coup, broadcasting their promise to revive Diem's policies. Khánh evaded capture and rallied allies while the U.S. continued their support for his rule. Khánh forced Phát and Đức to capitulate the next morning and various coup leaders appeared at a media conference where they denied that a coup had taken place. To maintain power, Khánh tried to court support from Buddhist activists, who supported negotiations to end the Vietnam War. As the Americans were strongly opposed to such policies, relations with Khánh became strained. (Full article...)"

## Task 4

In [None]:
!pip install scrapy
!pip install pyppeteer
!pip install scrapy asyncio



In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class BasicSpider(scrapy.Spider):
    name = "basic"
    start_urls = ['https://www.learnpytorch.io/']

    def parse(self, response):
        for title in response.css('h1::text'):
            data = {'title': title.get()}
            print(data)
            yield data

        next_page = response.css('a.next::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

process = CrawlerProcess(settings={
    "FEEDS": {
        "output.json": {"format": "json"},  # Save data to a JSON file
    },
})

process.crawl(BasicSpider)
process.start()

INFO:scrapy.utils.log:Scrapy 2.11.2 started (bot: scrapybot)
2024-09-13 01:29:35 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
2024-09-13 01:29:35 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-6.1.85+-x86_64-with-glibc2.35
INFO:scrapy.addons:Enabled addons:
[]
2024-09-13 01:29:35 [scrapy.addons] INFO: Enabled addons:
[]


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler

{'title': 'Learn PyTorch for Deep Learning: Zero to Mastery book'}


# Part 2: PPT Scraping

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install python-pptx



Extracting text from PPT using the python-pptx library, it is typically used for generating ppts from databases but we can exploit some of its features here to extract text from ppts, this is a very basic example and it can be explored further as per the need. documentation to the libary: https://python-pptx.readthedocs.io/en/latest/

In [None]:
# importing library
from pptx import Presentation

In [None]:
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/drive/MyDrive/Natural Language Processing/HW2/Your big idea.pptx")
counter_slide = 1
for slide in prs.slides:
    print("slide:", counter_slide, "\n")
    counter_content = 1
    for shape in slide.shapes:
        try:
            print("content:", counter_content, shape.text, "\n")
            counter_content += 1
        except: continue
    print("\n\n")
    counter_slide += 1

slide: 1 

content: 1 Making Presentations That Stick 

content: 2 A guide by Chip Heath & Dan Heath 




slide: 2 

content: 1 Selling your idea 

content: 2 Created in partnership with Chip and Dan Heath, authors of the bestselling book Made To Stick, this template advises users on how to build and deliver a memorable presentation of a new product, service, or idea. 




slide: 3 

content: 1 1. Intro 

content: 2 Choose one approach to grab the audience’s attention right from the start: unexpected, emotional, or simple.
UnexpectedHighlight what’s new, unusual, or surprising.
EmotionalGive people a reason to care.
SimpleProvide a simple unifying message for what is to come 




slide: 4 

content: 1 How many languages do you need to know to communicate with the rest of the world? 




slide: 5 

content: 1 Just one! Your own.
(With a little help from your smart phone) 




slide: 6 

content: 1 The Google Translate app can repeat anything you say in up to NINETY LANGUAGES from G

Similarly other components of the PPT can be extracted after following the documentation as per need

## Task 6

Extracting table from PPT using pptx

In [None]:
# extracting texts slide wise and section wise
# open the PPT in parallel to check the outcome
prs = Presentation("/content/drive/MyDrive/Natural Language Processing/HW2/Scrape This.pptx")
counter_slide = 1
for slide in prs.slides:
    print("slide:", counter_slide, "\n")
    counter_content = 1
    for shape in slide.shapes:
      if(shape.has_text_frame):
        print("content:", counter_content, shape.text, "\n")
        counter_content += 1

      if shape.has_table:
        table = shape.table
        for row_number, row in enumerate(table.rows, start=1):
            row_data = []
            for cell in row.cells:
                row_data.append(cell.text)
            print(f"Row {row_number}: {row_data}")

    print()
    counter_slide += 1

slide: 1 

content: 1 Scrape This 

content: 2  


slide: 2 

Row 1: ['Team Name', 'Year', 'Wins', 'Losses', 'OT Losses', 'Win %', 'Goals For (GF)', 'Goals Against (GA)', '+ / -']
Row 2: ['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35']
Row 3: ['Buffalo Sabres', '1990', '31', '30', '', '0.388', '292', '278', '14']
Row 4: ['Calgary Flames', '1990', '46', '26', '', '0.575', '344', '263', '81']
Row 5: ['Chicago Blackhawks', '1990', '49', '23', '', '0.613', '284', '211', '73']
Row 6: ['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25']
Row 7: ['Edmonton Oilers', '1990', '37', '37', '', '0.463', '272', '272', '0']



## Part 3: PDF Scraping


Using the library PyPDF2: https://pypi.org/project/PyPDF2/
This library can only extract text from PDFs, for tables and images other methods are required.
Extracting text from PDFs is much difficult compared to web and ppt as there is no inherent structure where just calling the right elements will give us everything, infact pdfs can be seen as an image and hence whatever extraction we do is by using some kind of optical character recognition.

In [None]:
# installing the library
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
# importing the library
import PyPDF2

In [None]:
# reading the file
pdfFileObj = open('/content/drive/MyDrive/Natural Language Processing/HW2/Evaluation_of_Sentiment_Analysis_in_Finance_From_Lexicons_to_Transformers.pdf', 'rb')


In [None]:
# passing the file to PyPDF
pdfReader = PyPDF2.PdfReader(pdfFileObj)

In [None]:
# getting the number of pages
print(len(pdfReader.pages))

21


In [None]:
# getting the first page
pageObj = pdfReader.pages[1]

In [None]:
# extracting text from the first page
print(pageObj.extract_text())

K. Mishev et al.: Evaluation of Sentiment Analysis in Finance: From Lexicons to Transformers
decisions. The sentiments expressed in news and tweets inu-
ence stock prices and brand reputation, hence, constant mea-
surement and tracking of these sentiments is becoming one of
the most important activities for investors. Studies have used
sentiment analysis based on nancial news to forecast stock
prices [6][8], foreign exchange and global nancial market
trends [9], [10] as well as to predict corporate earnings [11].
Given that the nancial sector uses its own jargon, it is
not suitable to apply generic sentiment analysis in nance
because many of the words differ from their general meaning.
For example, ``liability'' is generally a negative word, but
in the nancial domain it has a neutral meaning. The term
``share'' usually has a positive meaning, but in the nancial
domain, share represents a nancial asset or a stock, which
is a neutral word. Furthermore, ``bull'' is neutral in gen

# Homework

1. As discussed in the demo above, using the example of geeksforgeeks extract the information from ScrapethisSite articles apart from those that is already demonstrated.

2. As discussed in the demo above, extract the salaries of each of the players from the hoopshype website using the example of how to extract the names.

3. Apart from that choose any 2 websites of your choice and extract meaningful and structured information from there.

4. Also explore the scrapy library to perform webscraping apart from the three discussed above in the demo

5. Pick a website that has tabular data and try to scrap it using the tools studied during the demo.

(The datasets you will be collecting for the projects would be by text extraction so make sure to extract usable structured information)

6. Extract table from a PPT using the same library.
