# Tasks

In this part of demo, we will extract information from two websites:

- https://en.wikipedia.org/wiki/International_court
- https://members.parliament.uk/members/commons


# Load packages

In [1]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
! brew install lxml

[34m==>[0m [1mSearching for similarly named formulae and casks...[0m
[34m==>[0m [1mFormulae[0m
html-xml-utils      perl-xml-parser     python-lxml         libxmlb

To install html-xml-utils, run:
  brew install html-xml-utils


In [7]:
!brew install python-lxml

[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/python-lxml/manifests/4.9.3-2[0m
######################################################################### 100.0%
[32m==>[0m [1mFetching [32mpython-lxml[39m[0m
[34m==>[0m [1mDownloading https://ghcr.io/v2/homebrew/core/python-lxml/blobs/sha256:331bb6[0m
######################################################################### 100.0%
[34m==>[0m [1mPouring python-lxml--4.9.3.sonoma.bottle.2.tar.gz[0m
🍺  /usr/local/Cellar/python-lxml/4.9.3: 303 files, 15.3MB
[34m==>[0m [1mRunning `brew cleanup python-lxml`...[0m
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).


# International court

- Scenario 1

## Extract the first tables using `pandas`

In [8]:
df_tables = pd.read_html("https://en.wikipedia.org/wiki/International_court")

In [9]:
len(df_tables)

4

In [11]:
df_ic = df_tables[0]

In [12]:
df_ic

Unnamed: 0,Name,Subject matter and scope,Headquarters,Years active
0,African Court on Human and Peoples' Rights,Human rights within the African Union,"Addis Ababa, Ethiopia (2006–7) Arusha, Tanzani...",2006–present
1,Appellate Body of the World Trade Organization,Trade disputes within the World Trade Organiza...,"Geneva, Switzerland",1995–present
2,Benelux Court of Justice,Trade disputes within the Benelux,"Brussels, Belgium",1975–present
3,Caribbean Court of Justice,General disputes within the Caribbean Community,"Port of Spain, Trinidad and Tobago",2005–present
4,CIS Economic Court,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",1994–present
5,COMESA Court of Justice,Trade disputes within the Common Market for Ea...,"Khartoum, Sudan",1998–present
6,Common Court of Justice and Arbitration of the...,Interpretation of OHADA treaties and uniform laws,"Abidjan, Ivory Coast",1998–present
7,Court of Justice of the Andean Community,Trade disputes within the Andean Community,"Quito, Ecuador",1983–present
8,Court of the Eurasian Economic Union,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",2015–present
9,East African Court of Justice,Interpretation of East African Community treaties,"Arusha, Tanzania",2001–present


## Text modifying

We will try to extract the year of foundation.

### Simple method

- use `*.str.slice()`

### Using regular expression

Regular expression is really a powerful tool for extracting/modifying text in programming. There are several great introductions:

1. LinkedIn Learning (NLP with Python for Machine Learning Essential Training)
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/what-are-regular-expressions
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/learning-how-to-use-regular-expressions
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/regular-expression-replacements

2. YouTube
  - https://www.youtube.com/watch?v=K8L6KVGG-7o

In [13]:
years = df_ic['Years active']

In [14]:
df_years = years.str.extract(r'(\d{4}).(.+)')

In [15]:
df_years.rename({0: "start", 1: "end"}, axis = 1, inplace = True)

In [16]:
df_ic['Founded'] = df_ic['Years active'].str.extract(r'^(\d{4})').astype(int)

In [17]:
df_ic

Unnamed: 0,Name,Subject matter and scope,Headquarters,Years active,Founded
0,African Court on Human and Peoples' Rights,Human rights within the African Union,"Addis Ababa, Ethiopia (2006–7) Arusha, Tanzani...",2006–present,2006
1,Appellate Body of the World Trade Organization,Trade disputes within the World Trade Organiza...,"Geneva, Switzerland",1995–present,1995
2,Benelux Court of Justice,Trade disputes within the Benelux,"Brussels, Belgium",1975–present,1975
3,Caribbean Court of Justice,General disputes within the Caribbean Community,"Port of Spain, Trinidad and Tobago",2005–present,2005
4,CIS Economic Court,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",1994–present,1994
5,COMESA Court of Justice,Trade disputes within the Common Market for Ea...,"Khartoum, Sudan",1998–present,1998
6,Common Court of Justice and Arbitration of the...,Interpretation of OHADA treaties and uniform laws,"Abidjan, Ivory Coast",1998–present,1998
7,Court of Justice of the Andean Community,Trade disputes within the Andean Community,"Quito, Ecuador",1983–present,1983
8,Court of the Eurasian Economic Union,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",2015–present,2015
9,East African Court of Justice,Interpretation of East African Community treaties,"Arusha, Tanzania",2001–present,2001


## Save the data

In [18]:
df_ic.to_csv("data_iternational_court.csv")

# List of MEPs

In this part of demo, we will create a list of the Members of European Parliament (MEPs).

The base url (list of MEPs with family name starting with letter 'a') is here:
https://www.europarl.europa.eu/meps/en/full-list/a

## Extract names

- Scenario 2
  - `bs.select()` then `item.get_text()`

In [19]:
url = "https://www.europarl.europa.eu/meps/en/full-list/a"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")

In [20]:
names = bs.select('')

SelectorSyntaxError: Expected a selector at position 0
  line 1:

^

## Extract political groups and country names

- Same as above

## Extract party name

- Same

## Extract link to the individual pages

- `bs.select()` then `item['tagname']`

## Combine

In [None]:
df_meps = pd.DataFrame(mep_name, columns = ['names'])

## add more variables

## Save the data