# Data Handling and Web Scraping

## Data Handling - tutorial
--Prof. Dorien Herremans

We are going to explore some commands that will be extremely useful when dealing with large data files on your computer. 

These are all Unix commands, and hence will be extremely useful in your bash terminal. However, given the popularity for Google Colab for data science, we illustrate them here in Colab. You can execute this lab in playground mode: File - Run in Playground mode. 

**To run Unix commands in Colab, you should preceed the statements  with an exclamation point. Then you are executing unix commands in your Colab files. You can browse through your files on the left.** 

Let's start by downloading a small csv dataset using the `wget` command. 

In [None]:
!wget http://dorienherremans.com/sites/default/files/parsing/data.csv

--2020-09-28 12:24:25--  http://dorienherremans.com/sites/default/files/parsing/data.csv
Resolving dorienherremans.com (dorienherremans.com)... 99.198.97.250
Connecting to dorienherremans.com (dorienherremans.com)|99.198.97.250|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 949388 (927K) [text/csv]
Saving to: ‘data.csv’


2020-09-28 12:24:26 (1.50 MB/s) - ‘data.csv’ saved [949388/949388]



You can now see the file on the left sidebar under files. You can also visualise the contents using the `cat` command. 

In [None]:
!cat data.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Total income,8866,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Total expenditure,7618,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Operating profit before tax,770,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Total assets,55700,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Fixed tangible assets,32155,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Activity unit,21777,COUNT
2011,A,"Agriculture, Fore

This is a long file. To save us time, we can send the output of the cat command as input to the tail command (which cuts off -n lines). 

We can 'pipe' the output of one command to the input of another comment by using the pipeline sign '|'. Try both commands below. Do they have different output?

In [None]:
!cat data.csv | tail -5
# !tail -n 5 data.csv
# same output

2017,all,All Industries,j_Grand_Total,Total income,644159,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Total expenditure,560665,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Operating profit before tax,78054,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Total assets,1968716,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Fixed tangible assets,458928,DOLLARS(millions)


In [None]:
!tail -n 5 data.csv

2017,all,All Industries,j_Grand_Total,Total income,644159,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Total expenditure,560665,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Operating profit before tax,78054,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Total assets,1968716,DOLLARS(millions)
2017,all,All Industries,j_Grand_Total,Fixed tangible assets,458928,DOLLARS(millions)


More concisely we can also use the head or tail command directly. 

In [None]:
!head -5 data.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


For counting words we can use wc (word count): 

In [None]:
!wc data.csv
# total_no_of_lines words characters file_name

 10837  60425 949388 data.csv


Adding the option -l counts the number of lines in the file. 

In [None]:
!wc -l data.csv


10837 data.csv


In [None]:
man grep

A very useful command is grep, which performs a search within files. In this case we search for a term in inputfile data.csv. 

In [None]:
!grep -i -B 1 -A 1 'tangible assets' data.csv
# -i : case-insensitive search/pattern matching
# -B NUM : print NUM lines of leading context before matching line
# -A NUM : print NUM lines of trailing context after matching lines

2011,A,"Agriculture, Forestry and Fishing",a_0,Total assets,55700,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,Fixed tangible assets,32155,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Activity unit,21777,COUNT
--
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Total assets,52666,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",b_1-5,Fixed tangible assets,31235,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",c_6-9,Activity unit,1965,COUNT
--
2011,A,"Agriculture, Forestry and Fishing",c_6-9,Total assets,9323,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",c_6-9,Fixed tangible assets,5482,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",d_10-19,Activity unit,1140,COUNT
--
2011,A,"Agriculture, Forestry and Fishing",d_10-19,Total assets,6524,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",d_10-19,Fixed tangible assets,3649,DOLLARS(millions)
2011,A,"Agriculture, Forestry and F

I encourage you to experiment a bit with grep (`man grep` for more options). 

In the case that we don't just want to search for a file, but also change words, or replace words, we can use sed, this works with regex expressions (aaah, look this up and be confused ; ). 

For example, let's have another look at the data file:

In [None]:
!head -5 data.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


We can replace the term 'Agriculture, Forestry' by 'Computer science' and save this as a new file: 

In [None]:
# sed : stream editor
# can do insertion, deletion, search and replace (substitution)
!sed -e 's/Agriculture, Forestry/Computer science/g' data.csv > newdata.csv
# -e : edit SCRIPT
# / : delimiter
# s : substitution operation
# /g : global replacement - replace all occurrences of the string in the line
# /NUM: replace the first NUM occurrences

Let's check the new file: 

In [None]:
!head -5 newdata.csv

year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit
2011,A,"Computer science and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Computer science and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Computer science and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Computer science and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)


Excellent, the change happened as expected... Now we can sort the file: 

In [None]:
!head -n 5 data.csv | sort

2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)
year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit


Or slightly more advanced, sort on the second columns (-k2), using ‘,’ as a column separator, case insensitive (-f):

In [None]:
!head -n 5 data.csv | sort -f -t',' -k2
# -k2: sort based on second column

2011,A,"Agriculture, Forestry and Fishing",a_0,Activity unit,46134,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Rolling mean employees,0,COUNT
2011,A,"Agriculture, Forestry and Fishing",a_0,Salaries and wages paid,279,DOLLARS(millions)
2011,A,"Agriculture, Forestry and Fishing",a_0,"Sales, government funding, grants and subsidies",8187,DOLLARS(millions)
year,industry_code_ANZSIC,industry_name_ANZSIC,rme_size_grp,variable,value,unit


We can also find the number of unique words or lines. 

```
uniq command options: 
-c : shows a count (occurrence) before each line
-d : shows only duplicate lines
-u : shows only unique lines
```

For isntance, show a sorted list of all lines with their count:
(note we use a pipeline again)


In [None]:
!sort data.csv | uniq -c | sort -nr 
# -nr : to sort results with lines that occur the most often first
# -c : count - tells how many times a line was repeated by displaying a number as a prefix with the line

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
      1 2014,E,Construction,c_6-9,Rolling mean employees,15426,COUNT
      1 2014,E,Construction,c_6-9,Operating profit before tax,215,DOLLARS(millions)
      1 2014,E,Construction,c_6-9,Fixed tangible assets,320,DOLLARS(millions)
      1 2014,E,Construction,c_6-9,Activity unit,2199,COUNT
      1 2014,E,Construction,b_1-5,Total income,8281,DOLLARS(millions)
      1 2014,E,Construction,b_1-5,Total expenditure,7637,DOLLARS(millions)
      1 2014,E,Construction,b_1-5,Total assets,4918,DOLLARS(millions)
      1 2014,E,Construction,b_1-5,"Sales, government funding, grants and subsidies",8167,DOLLARS(millions)
      1 2014,E,Construction,b_1-5,Salaries and wages paid,1626,DOLLARS(millions)
      1 2014,E,Construction,b_1-5,Rolling mean employees,30456,COUNT
      1 2014,E,Construction,b_1-5,Operating profit before tax,768,DOLLARS(millions)
      1 2014,E,Construction,b_1-5,Fixed tangible assets,1630,DOLLARS(millions)

Note that this prints the resutls, to save to a new files you will need to: `!sort data.csv | uniq -c | sort -nr > savefile.csv`

We can als show sorted list of all duplicate lines:

In [None]:
!sort data.csv | uniq -d

(no duplicate lines to show here though)

Now, imagine you have a 4.4GB csv file. It has over 14 million records and 60 columns. All you need from this file is the sum of all values in one particular column (column 1).

=> Cat sends the text to awk using the pipe symbol (|). Awk then splits columns based on ‘,’, then the 4th column is summarised and printed as a float (.2f)



In [None]:
!cat data.csv | awk -F "," '{ sum += $1 } END { printf "%.2f\n", sum }'

21823704.00


## Exercise
1. Download the data.csv file 
2. How many words are there in the file? 
3. Find the record of funds that cost exactly ‘8898’ and print out two lines below as well. 
4. Replace all instances of , by ;
5. Show the first 7 lines of the file
6. Count the number of unique lines in the file


In [None]:
# 1. Download the data.csv file
!wget http://dorienherremans.com/sites/default/files/parsing/data.csv

--2020-09-28 14:05:38--  http://dorienherremans.com/sites/default/files/parsing/data.csv
Resolving dorienherremans.com (dorienherremans.com)... 99.198.97.250
Connecting to dorienherremans.com (dorienherremans.com)|99.198.97.250|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 949388 (927K) [text/csv]
Saving to: ‘data.csv.1’


2020-09-28 14:05:39 (1.59 MB/s) - ‘data.csv.1’ saved [949388/949388]



In [None]:
# 2. How many words are there in the file?
!wc -w data.csv.1
# 60425 words

60425 data.csv.1


In [None]:
# 3. Find the record of funds that cost exactly '8898' and print out two lines below as well.
!grep -A 2 '8898' data.csv.1

2011,D,"Electricity, Gas, Water and Waste Services",h_200+,Rolling mean employees,8898,COUNT
2011,D,"Electricity, Gas, Water and Waste Services",h_200+,Salaries and wages paid,648,DOLLARS(millions)
2011,D,"Electricity, Gas, Water and Waste Services",h_200+,"Sales, government funding, grants and subsidies",10500,DOLLARS(millions)


In [None]:
# 4. Replace all instances of , by ;
!sed -e 's/,/;/g' data.csv.1 > newdata1.csv

In [None]:
# 5. Show the first 7 lines of the file
!head -7 newdata1.csv

year;industry_code_ANZSIC;industry_name_ANZSIC;rme_size_grp;variable;value;unit
2011;A;"Agriculture; Forestry and Fishing";a_0;Activity unit;46134;COUNT
2011;A;"Agriculture; Forestry and Fishing";a_0;Rolling mean employees;0;COUNT
2011;A;"Agriculture; Forestry and Fishing";a_0;Salaries and wages paid;279;DOLLARS(millions)
2011;A;"Agriculture; Forestry and Fishing";a_0;"Sales; government funding; grants and subsidies";8187;DOLLARS(millions)
2011;A;"Agriculture; Forestry and Fishing";a_0;Total income;8866;DOLLARS(millions)
2011;A;"Agriculture; Forestry and Fishing";a_0;Total expenditure;7618;DOLLARS(millions)


In [None]:
# 6. Count the number of unique lines in the file
!uniq -c newdata1.csv

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
      1 2014;O;Public Administration and Safety;g_100-199;Total expenditure;59;DOLLARS(millions)
      1 2014;O;Public Administration and Safety;g_100-199;Operating profit before tax;4;DOLLARS(millions)
      1 2014;O;Public Administration and Safety;g_100-199;Total assets;34;DOLLARS(millions)
      1 2014;O;Public Administration and Safety;g_100-199;Fixed tangible assets;6;DOLLARS(millions)
      1 2014;O;Public Administration and Safety;h_200+;Activity unit;12;COUNT
      1 2014;O;Public Administration and Safety;h_200+;Rolling mean employees;7227;COUNT
      1 2014;O;Public Administration and Safety;h_200+;Salaries and wages paid;287;DOLLARS(millions)
      1 2014;O;Public Administration and Safety;h_200+;"Sales; government funding; grants and subsidies";555;DOLLARS(millions)
      1 2014;O;Public Administration and Safety;h_200+;Total income;557;DOLLARS(millions)
      1 2014;O;Public Administration and Safet

In [None]:
!uniq -c newdata1.csv | wc -l

10837


# Web scraping

Objective: Scraping a website or data files, parsing them and and save in a csv file (or database). 

Scraping: automatically extracting data.

Parsing: processing the extracted data in a format you can easily make sense of. 
Libraries: Beautiful Soup, lxml, jSoup (java),… 

Take note of these best practices: 
* Check a website's Terms and Conditions before scraping. * Usually, the data you scrape cannot be used for commercial purposes.
* Don’t put too much stress on the website. (i.e. no 10,000 requests a minute), this may break the website. Make it behave human-like, e.g. one request for one webpage per second is good practice.
* Revisit the site at times to check if the layout has changed adapt code.

NOTE: often parsing is not necessary as companies are offering APIs to extract data: 
https://iextrading.com/developer/. -> stock market data API
https://developer.yahoo.com/boss/search/boss_api_guide/ -> Yahoo news API



When we scrape websites, we make use of html tags to get the info that we need. 

Example webpage: 

```
<!DOCTYPE html> 
<html>
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p> Hello World </p>
    <body>
</html>
```


Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` for table columns.

Tags often have id and class as attributes. (id is unique, and class can occur multiple times as it specifies the style)


Each of the tags can have id and class labels. These are typically used to format the webpage (layout). We can use them to pinpoint the text we want to parse: 

```
<!DOCTYPE html> 
<html>
    <head>
    </head>
    <body>
        <h1> First Scraping </h1>
        <p id=‘first_paragraph’ class=‘plain_text’> Hello 			World </p>
    <body>
</html>
```




## Example
Let's parse the site `pythonprogramming.net/parsememcparseface/`. The code below will extract the website's source code and store it in the soup variable, a BeautifulSoup object. You can print this out to make sure that you were successful. 


In [None]:
from bs4 import BeautifulSoup
import urllib.request
from requests import get

url = 'https://pythonprogramming.net/parsememcparseface/'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')

#let's print the entire 'soup' or webpage:
print(soup)


<html>
<head>
<!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Python Programming Tutorials</title>
<meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
<link href="/static/favicon.ico" rel="shortcut icon"/>
<link href="/static/css/materialize.min.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification">
<link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
<!-- Compiled and minified CSS -->
<!-- Compiled and minified JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
<script src="https://cdnjs.cloudflare.com/aj

Extractng some basic html tags from our soup variable: 

In [None]:
# title of the page betwen <title> tags. This prints the whole tag, including the tag itself
print(soup.title)

#get attributes (prints the name of the tag)
print(soup.title.name)

#get values (only prints the value between the title tags)
print(soup.title.string)

#beginning navigation, prints the parent tag in which title is contained (here head)
print(soup.title.parent.name)

#getting specific values: (prints first paragraph)
print(soup.p)

<title>Python Programming Tutorials</title>
title
Python Programming Tutorials
head
<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>


Find all paragraphs `<p>`

In [None]:
print(soup.find_all('p'))

[<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>, <p>The following table gives some general information for the following <code>programming languages</code>:</p>, <p>I think it's clear that, on a scale of 1-10, python is:</p>, <p>Javascript (dynamic data) test:</p>, <p class="jstest" id="yesnojs">y u bad tho?</p>, <p>Whᶐt hαppéns now¿</p>, <p><a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a></p>, <p class="grey-text text-lighten-4">Contact: Harrison@pythonprogramming.net.</p>, <p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>]


Alternatively we can iterate through all paragraphs:

In [None]:
for paragraph in soup.find_all('p'):
  print(str(paragraph.text))

Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
The following table gives some general information for the following programming languages:
I think it's clear that, on a scale of 1-10, python is:
Javascript (dynamic data) test:
y u bad tho?
Whᶐt hαppéns now¿
sitemap
Contact: Harrison@pythonprogramming.net.
Programming is a superpower.


What otherr html tags are there? Let's get all links: 

In [None]:
for url in soup.find_all('a'):
  print(url.get('href'))

/
#
/
/+=1/
/support/
https://goo.gl/7zgAVQ
/login/
/register/
/
/+=1/
/support/
https://goo.gl/7zgAVQ
/login/
/register/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
/sitemap.xml
/support-donate/
/consulting/
https://www.facebook.com/pythonprogramming.net/
https://twitter.com/sentdex
https://instagram.com/sentdex
/about/tos/
/about/privacy-policy/
https://xkcd.com/353/


Let's print the paragraph (`<p>`) that contains the introduction (class = introducton):

In [None]:
name_box = soup.find('p', attrs={'class':'introduction'})
print(name_box.text)

Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.


Find tags that contains the text yesnojs:


In [None]:
thistext = soup.find_all(id="yesnojs")
print(thistext[0].text)

y u bad tho?


Find elements in an html table (today's points of python) and append this to a csv file: 

In [None]:
table = soup.find('table')
table_rows = table.find_all('tr') 
#note tr are table row, td are table columns within each row

pythonrow = table_rows[1]
#you see an empty first row in the table

cells_in_pythonrow = pythonrow.find_all('td')
#finds all cells in a row

ip_python = cells_in_pythonrow[1]
#second column of the row contains or value

print(ip_python.text)

932914021


Write to csv: 


In [None]:
import csv
from datetime import datetime

with open('index.csv', 'a') as csv_file:
  writer = csv.writer(csv_file)
  writer.writerow([datetime.now(), 'Python', ip_python.text])

## Exercise



Parse the page: https://en.wikipedia.org/wiki/Data_science and perform the following operations: 



1.   Display the second paragraph
2.   Display the link text AND url of all links on this page
3. Display the entire reference section (at the bottom)



In [None]:
# parse the page
from bs4 import BeautifulSoup
import urllib.request
from requests import get

url_ex = 'https://en.wikipedia.org/wiki/Data_science'
response_ex = get(url_ex)
soup_ex = BeautifulSoup(response_ex.text, 'html.parser')

# let's print the entire 'soup' or webpage:
# print(soup_ex)


In [None]:
# 1. Display the second paragraph
second_para = soup_ex.find_all('p')[2]
print(second_para)

print("\n")

print(second_para.text)

<p>Data science is a "concept to unify <a href="/wiki/Statistics" title="Statistics">statistics</a>, <a href="/wiki/Data_analysis" title="Data analysis">data analysis</a> and their related methods" in order to "understand and analyze actual phenomena" with data.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3">[3]</a></sup> It uses techniques and theories drawn from many fields within the context of <a href="/wiki/Mathematics" title="Mathematics">mathematics</a>, <a href="/wiki/Statistics" title="Statistics">statistics</a>, <a href="/wiki/Computer_science" title="Computer science">computer science</a>, <a href="/wiki/Domain_knowledge" title="Domain knowledge">domain knowledge</a> and <a href="/wiki/Information_science" title="Information science">information science</a>. <a class="mw-redirect" href="/wiki/Turing_award" title="Turing award">Turing award</a> winner <a href="/wiki/Jim_Gray_(computer_scientist)" title="Jim Gray (computer scientist)">Jim Gray</a> imagined data s

In [None]:
# 2. Display the link text AND url of all links on this page
link_url_ex = soup_ex.find_all('a')
# print(link_url_ex)
# print("\n")

for url in soup_ex.find_all('a'):
  # print(url.text)
  # print(url.get('href'))
  print(str(url.text) + "\t" + str(url.get('href')))

	None
Jump to navigation	#mw-head
Jump to search	#searchInput
information science	/wiki/Information_science
Machine learning	/wiki/Machine_learning
data mining	/wiki/Data_mining
Classification	/wiki/Statistical_classification
Clustering	/wiki/Cluster_analysis
Regression	/wiki/Regression_analysis
Anomaly detection	/wiki/Anomaly_detection
AutoML	/wiki/Automated_machine_learning
Association rules	/wiki/Association_rule_learning
Reinforcement learning	/wiki/Reinforcement_learning
Structured prediction	/wiki/Structured_prediction
Feature engineering	/wiki/Feature_engineering
Feature learning	/wiki/Feature_learning
Online learning	/wiki/Online_machine_learning
Semi-supervised learning	/wiki/Semi-supervised_learning
Unsupervised learning	/wiki/Unsupervised_learning
Learning to rank	/wiki/Learning_to_rank
Grammar induction	/wiki/Grammar_induction
Supervised learning	/wiki/Supervised_learning
classification	/wiki/Statistical_classification
regression	/wiki/Regression_analysis
Decision trees	/wi

In [None]:
# 3. Display the entire reference section (at the bottom)

# class="reflist columns references-column-width"
# class = references
ref_ex = soup_ex.find('ol', attrs={'class':'references'})
ref_ex

# ref_ex1 = soup_ex.find('div', attrs={'reflist columns references-column-width'})
# ref_ex1

<ol class="references">
<li id="cite_note-1"><span class="mw-cite-backlink"><b><a href="#cite_ref-1">^</a></b></span> <span class="reference-text"><cite class="citation journal cs1" id="CITEREFDhar2013">Dhar, V. (2013). <a class="external text" href="http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext" rel="nofollow">"Data science and prediction"</a>. <i>Communications of the ACM</i>. <b>56</b> (12): 64–73. <a class="mw-redirect" href="/wiki/Doi_(identifier)" title="Doi (identifier)">doi</a>:<a class="external text" href="https://doi.org/10.1145%2F2500499" rel="nofollow">10.1145/2500499</a>. <a class="mw-redirect" href="/wiki/S2CID_(identifier)" title="S2CID (identifier)">S2CID</a> <a class="external text" href="https://api.semanticscholar.org/CorpusID:6107147" rel="nofollow">6107147</a>. <a class="external text" href="https://web.archive.org/web/20141109113411/http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext" rel="nofollow