# Data Handling and Web Scraping

## Data Handling - tutorial
--Prof. Dorien Herremans

This lab consists of two parts: some basic unix commands and web parsing.

We are going to explore some commands that will be extremely useful when dealing with large data files on your computer. If you want to learn more about this, this cheat sheet is a good start: https://www.guru99.com/linux-commands-cheat-sheet.html

These are all Unix commands, and hence will be extremely useful in your bash terminal. However, given the popularity for Google Colab for data science, we illustrate them here in Colab. You can execute this lab in playground mode: File - Run in Playground mode.

**To run Unix commands in Colab, you should preceed the statements  with an exclamation point. Then you are executing unix commands in your Colab files. You can browse through your files on the left.**

Let's start by downloading a small csv dataset using the `wget` command.

In [1]:
!wget http://dorienherremans.com/sites/default/files/parsing/data.csv

--2024-02-08 04:36:47--  http://dorienherremans.com/sites/default/files/parsing/data.csv
Resolving dorienherremans.com (dorienherremans.com)... 184.154.70.198
Connecting to dorienherremans.com (dorienherremans.com)|184.154.70.198|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 949388 (927K) [text/csv]
Saving to: ‘data.csv’


2024-02-08 04:36:47 (11.1 MB/s) - ‘data.csv’ saved [949388/949388]



You can now see the file on the left sidebar under files. You can also visualise the contents using the `cat` command.

In [2]:
!cat data.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Total income,8866,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Total expenditure,7618,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Operating profit before tax,770,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Total assets,55700,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Fixed tangible assets,32155,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Activity unit,21777,COUNT
2011,A,"Agriculture, Fore

This is a long file. To save us time, we can send the output of the cat command as input to the tail command (which cuts off -n lines).

We can 'pipe' the output of one command to the input of another comment by using the pipeline sign '|'. Try both commands below. Do they have different output?

In [3]:
!cat data.csv | tail -5
# !tail -n 5 data.csv


2017,all,All Industries,j_Grand_Total,Total income,644159,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Total expenditure,560665,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Operating profit before tax,78054,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Total assets,1968716,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Fixed tangible assets,458928,DOLLARS(millions)


More concisely we can also use the head or tail command directly. The option `-n` limits the output to `n` lines.

In [4]:
!head -5 data.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


For counting words we can use `wc` (word count):

In [5]:
!wc data.csv

 10837  60425 949388 data.csv


Adding the option `-l` counts the number of lines in the file, and the `-w` option only lists the word count.

In [6]:
!wc -l data.csv

10837 data.csv


A very useful command is `grep`, which performs a search within files. In this case we search for a term in inputfile `data.csv`.

if an anypoint, you want to explore the options of a certain command you can always get the manual, by using the `man` command. Let's chech how `grep` works:

In [7]:
!man grep

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, including manpages, you can run the 'unminimize'
command. You will still need to ensure the 'man-db' package is installed.


Useful for us are the options `-B` and `-A`, which will show use `num` lines before or after the term is found:

```
-A num, --after-context=num
-B num, --before-context=num
```



In [8]:
!grep -i -B 1 -A 1 'tangible assets' data.csv

2011,A,"Agriculture, Forestry and Fishing",a_0,Total assets,55700,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Fixed tangible assets,32155,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Activity unit,21777,COUNT
--
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Total assets,52666,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Fixed tangible assets,31235,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",c_6-9,Activity unit,1965,COUNT
--
2011,A,"Agriculture, Forestry and Fishing",c_6-9,Total assets,9323,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",c_6-9,Fixed tangible assets,5482,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",d_10-19,Activity unit,1140,COUNT
--
2011,A,"Agriculture, Forestry and Fishing",d_10-19,Total assets,6524,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",d_10-19,Fixed tangible assets,3649,DOLLARS(millions)
2011,A,"Agriculture, Forestry and F

In the case that we don't just want to search for a file, but also change words, or replace words, we can use `sed`, this works with regex (regular expressions). Regex is a bit more complex, more background reading here: https://www.tutorialspoint.com/unix/unix-regular-expressions.htm

For example, let's have another look at the data file:

In [9]:
!head -5 data.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


We can replace the term 'Agriculture, Forestry' by 'Computer science' and save this as a new file:

In [10]:
!sed -e 's/Agriculture, Forestry/Computer science/g' data.csv > newdata.csv

Let's check the new file:

In [11]:
!head -5 newdata.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Computer science and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Computer science and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Computer science and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Computer science and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


Excellent, the change happened as expected... Now we can sort the file. We can use the '|' command as a 'pipe' to relay the output of one command as the input the the next command. So here, the output of head is fed into the sort command.


In [12]:
!head -n 5 data.csv | sort

2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)
year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit


Or slightly more advanced, sort on the second columns (-k2), using ‘,’ as a column separator, case insensitive (-f). The character right after the -f flag is used as the column separator, the option -k is followed by the column number on which we sort, -f tells the sort command to ignore upper/lower cases. You can check all these options by running `man sort`.

In [13]:
!head -n 5 data.csv | sort -f -t',' -k2

2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)
year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit


We can also find the number of unique words or lines.

```
uniq command options:
-c : shows a count (occurrence) before each line
-d : shows only duplicate lines
-u : shows only unique lines
```

For instance, show a sorted list of all lines with their count:

Note we use a pipeline again. The -nr option in sort sorts a file with numeric data in reverse orde.

In [14]:
!sort data.csv | uniq -c | sort -nr

      1 year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
      1 2017,S,Other Services,i_Industry_Total,Total income,10454,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,Total expenditure,9360,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,Total assets,12938,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,"Sales, government funding, grants and subsidies",9029,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,Salaries and wages paid,2964,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,Rolling mean employees,60798,COUNT
      1 2017,S,Other Services,i_Industry_Total,Operating profit before tax,812,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,Fixed tangible assets,3514,DOLLARS(millions)
      1 2017,S,Other Services,i_Industry_Total,Activity unit,21576,COUNT
      1 2017,S,Other Services,h_200+,Total income,483,DOLLARS(millions)
      1 2017,S,

Note that this prints the results, to save to a new files you will need to: `!sort data.csv | uniq -c | sort -nr > savefile.csv`, as the `>` will save the output of a command to a file instead of the console.

We can als show a sorted list of all duplicate lines:

In [15]:
!sort data.csv | uniq -d


(no duplicate lines to show here though)

Now, imagine you have a 4.4GB csv file. It has over 14 million records and 60 columns. All you need from this file is the sum of all values in one particular column (column 1). We can use the `cat` command and pipe the file contents to `awk` for this.



=> `cat` sends the text to `awk` using the pipe symbol (|). Awk then splits columns based on `,` and then the 4th column is summarised and printed as a float `(.2f)`.



In [16]:
!man awk

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, including manpages, you can run the 'unminimize'
command. You will still need to ensure the 'man-db' package is installed.


`awk` is a very powerful command. `-F` indicates the column separator. More info on how awk works: https://opensource.com/article/20/9/awk-ebook

In [17]:
!cat data.csv | awk -F "," '{ sum += $1 } END { printf "%.2f\n", sum }'

21823704.00


## Exercises (for checkoff)
1. Download the data.csv file from http://dorienherremans.com/sites/default/files/parsing/data.csv
2. How many words are there in the file?
3. Find the record of funds that cost exactly ‘8898’ and print out two lines below as well.
4. Replace all instances of , by ;
5. Show the first 7 lines of the file
6. Count the number of unique lines in the file


In [38]:
!wget http://dorienherremans.com/sites/default/files/parsing/data.csv

--2024-02-08 04:48:09--  http://dorienherremans.com/sites/default/files/parsing/data.csv
Resolving dorienherremans.com (dorienherremans.com)... 184.154.70.198
Connecting to dorienherremans.com (dorienherremans.com)|184.154.70.198|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 949388 (927K) [text/csv]
Saving to: ‘data.csv’


2024-02-08 04:48:09 (11.6 MB/s) - ‘data.csv’ saved [949388/949388]



In [39]:
# word count
!wc -w data.csv

60425 data.csv


In [40]:
# 3. Find the record of funds that cost exactly ‘8898’ and print out two lines below as well.
!grep -i -A 2 '8898' data.csv

2011,D,"Electricity, Gas, Water and Waste Services",h_200+,Rolling mean employees,8898,COUNT
2011,D,"Electricity, Gas, Water and Waste Services",h_200+,Salaries and wages paid,648,DOLLARS(millions)
2011,D,"Electricity, Gas, Water and Waste Services",h_200+,"Sales, government funding, grants and subsidies",10500,DOLLARS(millions)


In [41]:
# 4. Replace all instances of , by ;
!head -5 data.csv
!sed -e 's/,/;/g' data.csv > newdata_2.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


In [42]:
# 5. Show the first 7 lines of the file
!head -7 newdata_2.csv

year;industry_code_ANZSIC;industry_name_ANZSIC;rme_size_grp;variable;value;unit
2011;A;"Agriculture; Forestry and Fishing";a_0;Activity unit;46134;COUNT
2011;A;"Agriculture; Forestry and Fishing";a_0;Rolling mean employees;0;COUNT
2011;A;"Agriculture; Forestry and Fishing";a_0;Salaries and wages paid;279;DOLLARS(millions)
2011;A;"Agriculture; Forestry and Fishing";a_0;"Sales; government funding; grants and subsidies";8187;DOLLARS(millions)
2011;A;"Agriculture; Forestry and Fishing";a_0;Total income;8866;DOLLARS(millions)
2011;A;"Agriculture; Forestry and Fishing";a_0;Total expenditure;7618;DOLLARS(millions)


In [47]:
# 6. Count the number of unique lines in the file
!sort data.csv | uniq -u > savefile.csv

In [48]:
!wc -l savefile.csv

10837 savefile.csv


# Web scraping

Objective: Scraping a website or data files, parsing them and and save in a csv file (or database).

Scraping: automatically extracting data.

Parsing: processing the extracted data in a format you can easily make sense of.

Libraries: Beautiful Soup, lxml, jSoup (java),…

Take note of these best practices:
* Check a website's Terms and Conditions before scraping. * Usually, the data you scrape cannot be used for commercial purposes.
* Don’t put too much stress on the website. (i.e. no 10,000 requests a minute), this may break the website. Make it behave human-like, e.g. one request for one webpage per second is good practice.
* Revisit the site at times to check if the layout has changed
adapt code.

NOTE: often parsing is not necessary as companies are offering APIs to extract data:
https://iextrading.com/developer/. -> stock market data API
https://developer.yahoo.com/boss/search/boss_api_guide/ -> Yahoo news API



When we scrape websites, we make use of html tags to get the info that we need.

Example webpage:

```
<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p> Hello World </p>
    <body>
</html>
```


Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` for table columns.

Tags often have id and class as attributes. (id is unique, and class can occur multiple times as it specifies the style)


Each of the tags can have id and class labels. These are typically used to format the webpage (layout). We can use them to pinpoint the text we want to parse:

```
<!DOCTYPE html>
<html>
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p id=‘first_paragraph’ class=‘plain_text’> Hello 			World </p>
    <body>
</html>
```




## Example
Let's parse the site `pythonprogramming.net/parsememcparseface/`. The code below will extract the website's source code and store it in the soup variable, a BeautifulSoup `soup` object. You can print this out to make sure that you were successful.

Visit [the page](https://pythonprogramming.net/parsememcparseface/) in your webbrowser too so you have an idea what it looks like.

In [49]:
from bs4 import BeautifulSoup
import urllib.request
from requests import get

url = 'https://pythonprogramming.net/parsememcparseface/'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

#let's print the entire 'soup' or webpage:
print(soup)


<html>
<head>
<!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Python Programming Tutorials</title>
<meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
<link href="/static/favicon.ico" rel="shortcut icon"/>
<link href="/static/css/materialize.min.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification">
<link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
<!-- Compiled and minified CSS -->
<!-- Compiled and minified JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
<script src="https://cdnjs.cloudflare.com/aj

Extracting some basic html tags from our soup variable:

In [50]:
# title of the page betwen <title> tags. This prints the whole tag, including the tag itself
print(soup.title)

#get attributes (prints the name of the tag)
print(soup.title.name)

#get values (only prints the value between the title tags)
print(soup.title.string)

#beginning navigation, prints the parent tag in which title is contained (here head)
print(soup.title.parent.name)

#getting specific values: (prints first paragraph)
print(soup.p)

<title>Python Programming Tutorials</title>
title
Python Programming Tutorials
head
<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>


Find all paragraphs `<p>`

In [51]:
print(soup.find_all('p'))

[<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>, <p>The following table gives some general information for the following <code>programming languages</code>:</p>, <p>I think it's clear that, on a scale of 1-10, python is:</p>, <p>Javascript (dynamic data) test:</p>, <p class="jstest" id="yesnojs">y u bad tho?</p>, <p>Whᶐt hαppéns now¿</p>, <p><a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a></p>, <p class="grey-text text-lighten-4">Contact: Harrison@pythonprogramming.net.</p>, <p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>]


Alternatively we can iterate through all paragraphs:

In [52]:
for paragraph in soup.find_all('p'):
  print(str(paragraph.text))

Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
The following table gives some general information for the following programming languages:
I think it's clear that, on a scale of 1-10, python is:
Javascript (dynamic data) test:
y u bad tho?
Whᶐt hαppéns now¿
sitemap
Contact: Harrison@pythonprogramming.net.
Programming is a superpower.


What other html tags are there? Let's get all links:

In [53]:
for url in soup.find_all('a'):
  print(url.get('href'))

/
#
/
/+=1/
/support/
https://goo.gl/7zgAVQ
/login/
/register/
/
/+=1/
/support/
https://goo.gl/7zgAVQ
/login/
/register/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
/sitemap.xml
/support-donate/
/consulting/
https://www.facebook.com/pythonprogramming.net/
https://twitter.com/sentdex
https://instagram.com/sentdex
/about/tos/
/about/privacy-policy/
https://xkcd.com/353/


Let's print the paragraph (`<p>`) that contains the introduction (class = introducton):

In [54]:
name_box = soup.find('p', attrs={'class':'introduction'})
print(name_box.text)

Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.


Find tags that contains the text yesnojs:


In [55]:
thistext = soup.find_all(id="yesnojs")
print(thistext[0].text)

y u bad tho?


Find elements in an html table (today's points of python) and append this to a csv file:

In [56]:
table = soup.find('table')
table_rows = table.find_all('tr')
#note tr are table row, td are table columns within each row

pythonrow = table_rows[1]
#you see an empty first row in the table

cells_in_pythonrow = pythonrow.find_all('td')
#finds all cells in a row

ip_python = cells_in_pythonrow[1]
#second column of the row contains or value

print(ip_python.text)

932914021


Write to csv:


In [57]:
import csv
from datetime import datetime

with open('index.csv', 'a') as csv_file:
  writer = csv.writer(csv_file)
  writer.writerow([datetime.now(), 'Python', ip_python.text])

## Exercises (for checkoff)



Parse the page: https://en.wikipedia.org/wiki/Data_science and perform the following operations:



1.   Display the second paragraph
2.   Display the link text AND url of all links on this page
3. Display the entire reference section (at the bottom)



In [67]:
from bs4 import BeautifulSoup
import urllib.request
from requests import get

url = 'https://en.wikipedia.org/wiki/Data_science'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

#let's print the entire 'soup' or webpage:
print(soup)

<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Data science - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width

In [63]:
# Display 2nd paragraph
print(soup.find_all('p')[1].text)

Data science is an interdisciplinary academic field[1] that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.[2]



In [64]:
# Display the link text AND url of all links on this page
for url in soup.find_all('a'):
  print(url.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Data+science
/w/index.php?title=Special:UserLogin&returnto=Data+science
/w/index.php?title=Special:CreateAccount&returnto=Data+science
/w/index.php?title=Special:UserLogin&returnto=Data+science
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Foundations
#Relationship_to_statistics
#Etymology
#Early_usage
#Modern_usage
#Data_science_and_data_analysis
#History
#See_also
#References
https://ar.wikipedia.org/wiki/%D8%B9%D9%84%D9

In [83]:
# Display the entire reference section (at the bottom)
ref = soup.find('ol', attrs={'class':"references"})
print(ref.text)


^ Donoho, David (2017). "50 Years of Data Science". Journal of Computational and Graphical Statistics. 26 (4): 745–766. doi:10.1080/10618600.2017.1384734. S2CID 114558008.

^ Dhar, V. (2013). "Data science and prediction". Communications of the ACM. 56 (12): 64–73. doi:10.1145/2500499. S2CID 6107147. Archived from the original on 9 November 2014. Retrieved 2 September 2015.

^ Danyluk, A.; Leidig, P. (2021). Computing Competencies for Undergraduate Data Science Curricula (PDF). ACM Data Science Task Force Final Report (Report).

^ Mike, Koby; Hazzan, Orit (20 January 2023). "What is Data Science?". Communications of the ACM. 66 (2): 12–13. doi:10.1145/3575663. ISSN 0001-0782.

^ Hayashi, Chikio (1 January 1998). "What is Data Science ? Fundamental Concepts and a Heuristic Example". In Hayashi, Chikio; Yajima, Keiji; Bock, Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa (eds.). Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and