# Web Scraping with Beautiful Soup

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

In [1]:
%pip install requests
#Con este comando se instala el paquete requests, el cual sirve para hacer peticiones
#http, sea el metod POST o GET

Collecting requests
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Using cached charset_normalizer-3.4.3-cp313-cp313-win_amd64.whl.metadata (37 kB)
Collecting idna<4,>=2.5 (from requests)
  Using cached idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting certifi>=2017.4.17 (from requests)
  Using cached certifi-2025.8.3-py3-none-any.whl.metadata (2.4 kB)
Using cached requests-2.32.5-py3-none-any.whl (64 kB)
Using cached charset_normalizer-3.4.3-cp313-cp313-win_amd64.whl (107 kB)
Using cached idna-3.10-py3-none-any.whl (70 kB)
Using cached urllib3-2.5.0-py3-none-any.whl (129 kB)
Using cached certifi-2025.8.3-py3-none-any.whl (161 kB)
Installing collected packages: urllib3, idna, charset_normalizer, certifi, requests

   ---------------------------------------- 0/5 [urllib3]
   ---------------------------

In [2]:
%pip install beautifulsoup4
#Con este comando se instala el paquete beautifulsoup4, el cual permite extrae
#informacion de documentos html y xml

Collecting beautifulsoup4
  Using cached beautifulsoup4-4.13.5-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Using cached soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Collecting typing-extensions>=4.0.0 (from beautifulsoup4)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Using cached beautifulsoup4-4.13.5-py3-none-any.whl (105 kB)
Using cached soupsieve-2.7-py3-none-any.whl (36 kB)
Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
Installing collected packages: typing-extensions, soupsieve, beautifulsoup4

   ---------------------------------------- 0/3 [typing-extensions]
   ------------- -------------------------- 1/3 [soupsieve]
   ------------- -------------------------- 1/3 [soupsieve]
   -------------------------- ------------- 2/3 [beautifulsoup4]
   -------------------------- ------------- 2/3 [beautifulsoup4]
   -------------------------- ------------- 2/3 [beautifulsoup4]
   ---------------------

We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [3]:
%pip install lxml
#Con este comando se instala el paquete lxml, el cual permite procesar
# documentos html y xml

Collecting lxml
  Using cached lxml-6.0.1-cp313-cp313-win_amd64.whl.metadata (3.9 kB)
Using cached lxml-6.0.1-cp313-cp313-win_amd64.whl (4.0 MB)
Installing collected packages: lxml
Successfully installed lxml-6.0.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import required libraries
from bs4 import BeautifulSoup #importa la clase BeautifulSoup del paquete bs4
from datetime import datetime #importa la clase datetime del paquete datetime
import requests  #importa liberia requests
import time #importa libreria de tiempo

<a id='extract'></a>

# Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [4]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp') # Hacemos una peticion get de
#la direccion url
# Read the content of the server’s response
src = req.text # atributo que significa "source" (fuente) y se utiliza para especificar
#la ruta  de un archivo externo, en este caso a un archivo txt
# View some output
print(src[:1000]) #permite capturar del archivo src los primeros 1000 caracteres del archivo

NameError: name 'requests' is not defined

## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [5]:
# Parse the response into an HTML tree
soup = BeautifulSoup(src, 'lxml') #utiliza la biblioteca Beautiful Soup para analizar
#un documento HTML o XML almacenado en la variable src, lxml especifica el analizador
#que Beautiful Soup debe usar, en este caso, el analizador lxml
# Take a look
print(soup.prettify()[:1000]) #imprime los primeros 1000 caracteres del HTML formateado
#haciendo etl,  que está contenido en la variable soup, la función prettify()
#aplica formato al HTML para hacerlo más legible, añadiendo sangría y saltos de línea

NameError: name 'BeautifulSoup' is not defined

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [None]:
# Find all elements with a certain tag
a_tags = soup.find_all("a") #usa el método find_all() de la biblioteca Beautiful Soup
#para encontrar todas las etiquetas HTML de tipo "a" (enlaces) dentro de un documento HTML
# y almacenar todos los resultados en una lista llamada a_tags
print(a_tags[:10]) #imprime en la consola los primeros 10 elementos de una variable llamada tags
#que se espera que sea una secuencia como una lista o una cadena

NameError: name 'soup' is not defined

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [None]:
a_tags = soup.find_all("a") #usa el método find_all() de la biblioteca Beautiful Soup
#para encontrar todas las etiquetas HTML de tipo "a" (enlaces) dentro de un documento HTML
# y almacenar todos los resultados en una lista llamada a_tags
a_tags_alt = soup("a") #usa el método soup de la biblioteca Beautiful Soup
# y almacenar todos los resultados en una lista llamada a_tags_alt
print(a_tags[0]) #Imprime la variable a_tags en la posicion 0
print(a_tags_alt[0]) #Imprime la variable a_tags_alt en la posicion 0

NameError: name 'soup' is not defined

How many links did we obtain?

In [8]:
print(len(a_tags)) #Imprime  el número de elementos que contiene la variable tags

NameError: name 'a_tags' is not defined

That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [None]:
# Get only the 'a' tags in 'sidemenu' class
side_menus = soup("a", class_="sidemenu") #usa el método soup de la biblioteca Beautiful Soup
#almacena en la variable side_menus en donde tiene parametros, con la clase sidemenu
#Es decir esta haciendo una peticion get, o capturando informacion de un menu y sub menu
side_menus[:5] #la viable side_menus selecciona los primeros 5 elementos

NameError: name 'soup' is not defined

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [10]:
# Get elements with "a.sidemenu" CSS Selector.
selected = soup.select("a.sidemenu") #En la variable selected esta haciendo
#seleccion del menu
selected[:5] #esta seleccionando los primeros 5 elementos

NameError: name 'soup' is not defined

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [None]:
# YOUR CODE HERE


## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [None]:
# Get all sidemenu links as a list
side_menu_links = soup.select("a.sidemenu") # en el menu lateral, los links del menu
#estan seleccionando al submenu/menu lateral

# Examine the first link
first_link = side_menu_links[0] #Se esta guardando el primer elemento de la variable
#side_menu_links , en la variable first_link
print(first_link) #Imprime la variable first_link

# What class is this variable?
print('Class: ', type(first_link)) #Imprime la clase, digitando la variable first_link

NameError: name 'soup' is not defined

It's a Beautiful Soup tag! This means it has a `text` member:

In [12]:
print(first_link.text) #imprime el archivo .txt

NameError: name 'first_link' is not defined

Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [13]:
print(first_link['href'])#imprime link de pagina web, guardada en la variable
#first_link

NameError: name 'first_link' is not defined

## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [None]:
# YOUR CODE HERE


<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98') # en el objeto
#req hacemos una peticio get de la URL seleccionada
# Read the content of the server’s response
src = req.text # atributo que significa "source" (fuente) y se utiliza para especificar
#la ruta  de un archivo externo, en este caso a un archivo txt
# Soup it
soup = BeautifulSoup(src, "lxml") #utiliza la biblioteca Beautiful Soup para analizar
#un documento HTML o XML almacenado en la variable src, lxml especifica el analizador
#que Beautiful Soup debe usar, en este caso, el analizador lxml

NameError: name 'requests' is not defined

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [15]:
# Get all table row elements
rows = soup.find_all("tr")  #usa el método find_all() de la biblioteca Beautiful Soup
#para encontrar todas las etiquetas HTML de tipo "tr" (enlaces) dentro de un documento HTML
# y almacenar todos los resultados en una lista llamada rows
len(rows) #Longitud de elementos de rows

NameError: name 'soup' is not defined

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [None]:
# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')#es una expresión de BeautifulSoup 
#que utiliza selectores CSS para encontrar todos los elementos
#<tr> (filas de tabla) anidados dentro de otras filas <tr> 
#dentro de un documento HTML

for row in rows[:5]: #Bucle for de la variable row de los 5 primero elementos
    print(row, '\n') #para imprimir las variables hasta que llegue 5 con un salto de linea

NameError: name 'soup' is not defined

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [17]:
example_row = rows[2] # en la variable exampe_row se guarda los dos primero elementos
print(example_row.prettify()) #el resultado del método prettify() del objeto
#example row. prettify()

NameError: name 'rows' is not defined

Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [19]:
for cell in example_row.select('td'): #sobre cada celda (td) dentro de una fila (example_row)
    print(cell) #y luego imprime el contenido de cada celda, la función print() al final imprime una línea nueva
print() #para un espacio entre la salida de cada celda y la siguiente

for cell in example_row.select('.detail'): #sobre cada clase (.detail) dentro de una fila (example_row)
    print(cell) #y luego imprime el contenido de cada celda, la función print() al final imprime una línea nueva
print() #para un espacio entre la salida de cada celda y la siguiente

for cell in example_row.select('td.detail'): #sobre cada clase combinada con la celda (td.detail) dentro de una fila (example_row)
    print(cell) #y luego imprime el contenido de cada celda, la función print() al final imprime una línea nueva
print() #para un espacio entre la salida de cada celda y la siguiente

NameError: name 'example_row' is not defined

We can confirm that these are all the same.

In [None]:
assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')
#se está verificando que tres selecciones de elementos HTML en example_row resultan en el 
#mismo conjunto de elementos, lo cual es una condición que solo será cierta si todos los
#elementos <td> (celdas de tabla) tienen la clase detail, y no hay otros elementos <td> que no
#la tengan, ni otros elementos con la clase detail que no sean <td>

NameError: name 'example_row' is not defined

Let's use the selector `td.detail` to be as specific as possible.

In [21]:
# Select only those 'td' tags with class 'detail' 
detail_cells = example_row.select('td.detail') #es una operación común en el raspado web o 
#análisis de contenido HTML/XML, particularmente cuando se utilizan bibliotecas como 
#Beautiful Soup en Python o herramientas de manipulación DOM similares en otros lenguajes
detail_cells


NameError: name 'example_row' is not defined

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [22]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells] #crea una secuencia de texto extraído 
#de cada elemento cell dentro de la lista detail_cells

print(row_data) #Imprime la variable row_data

NameError: name 'detail_cells' is not defined

Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [23]:
print(row_data[0]) # Name #Imprime la variable row_data en la posicion 0 o primer elemento
print(row_data[3]) # District #Imprime la variable row_data en la posicion 3 o cuarto elemento
print(row_data[4]) # Party #Imprime la variable row_data en la posicion 4 o quinto elemento

NameError: name 'row_data' is not defined

## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [24]:
print('Row 0:\n', rows[0], '\n') #Imprime  la cadena \"Row 0:\", seguida de un salto de línea
#el contenido del primer elemento (índice 0) de la variable rows, y finalmente otro salto de línea
print('Row 1:\n', rows[1], '\n') #Imprime  la cadena \"Row 1:\", seguida de un salto de línea
#el contenido del segundo elemento (índice 1) de la variable rows, y finalmente otro salto de línea
print('Last Row:\n', rows[-1]) #Imprime  la cadena \"Last Row:\", seguida de un salto de línea
#el contenido del ultimo elemento (índice -1) de la variable rows, y finalmente otro salto de línea

NameError: name 'rows' is not defined

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [25]:
# Bad rows
print(len(rows[0])) #Imprime la longitud de la cedana rows en la posicion 0 /primer elemento
print(len(rows[1])) #Imprime la longitud de la cedana rows en la posicion 1 /segundo elemento

# Good rows
print(len(rows[2])) #Imprime la longitud de la cedana rows en la posicion 2 /tercer elemento
print(len(rows[3])) #Imprime la longitud de la cedana rows en la posicion 3 /cuarto elemento

NameError: name 'rows' is not defined

Perhaps good rows have a length of 5. Let's check:

In [None]:
good_rows = [row for row in rows if len(row) == 5]
#crea una nueva lista llamada good_rows. Esta nueva lista contendrá solo aquellos elementos
#(que se llaman row en el código) de la lista original rows donde la longitud del elemento sea
#exactamente 5
# Let's check some rows
print(good_rows[0], '\n')#Imprime el primer elemento de la variable  good_rows,
#seguido de un salto de linea
print(good_rows[-2], '\n') #Imprime el penultimo elemento de la variable  good_rows,
#seguido de un salto de linea
print(good_rows[-1]) #Imprime el ultimo elemento de la variable  good_rows

NameError: name 'rows' is not defined

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [28]:
rows[2].select('td.detail') 
#selecciona el tercer elemento (índice 2) del conjunto rows y, dentro de este, busca y
#devuelve todas las celdas (<td>) que tengan la clase detail

NameError: name 'rows' is not defined

In [None]:
# Bad row
print(rows[-1].select('td.detail'), '\n') 
#imprime (mostrar) el resultado de seleccionar todos los elementos <td> 
#con la clase \"detail\" de la última fila (rows[-1]) de una tabla, 
#seguido de un salto de línea

# Good row
print(rows[5].select('td.detail'), '\n')
#imprime (mostrar) el resultado de seleccionar todos los elementos <td> 
#con la clase \"detail\" de la fila 6 (rows[5]) de una tabla,
#seguido de un salto de línea

# How about this?
good_rows = [row for row in rows if row.select('td.detail')]
#En la variable good_rows  iteractua sobre una colección de elementos rows, 
#y para cada row, verifica si contiene un elemento <td> con la clase detail 
#utilizando el método select(), si esta condición es 
#verdadera, el row es incluido en el resultado

print("Checking rows...\n") #Imprime un mensaje seguido de un salto de linea
print(good_rows[0], '\n') #Imprime el primer elemento de good_rows seguido de un salto de linea
print(good_rows[-1]) #Imprime el ultimo elemento de good_rows

NameError: name 'rows' is not defined

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [31]:
# Define storage list
members = [] #Forma una lista
# Get rid of junk rows
valid_rows = [row for row in rows if row.select('td.detail')]
#En la variable valid_rows iteractua sobre una colección de elementos rows, 
#y para cada row, verifica si contiene un elemento <td> con la clase detail 
#utilizando el método select(), si esta condición es 
#verdadera, el row es incluido en el resultado
# Loop through all rows
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    # Store in a tuple
    senator = (name, district, party)
    # Append to list
    members.append(senator)

NameError: name 'rows' is not defined

In [32]:
# Should be 61
len(members) #longitud de elementos de la variable members

0

Let's take a look at what we have in `members`.

In [33]:
print(members[:5]) #Imprime los primeros cinco elemetos de la variable members

[]


## 🥊  Challenge: Get `href` elements pointing to members' bills 

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. 

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`. 

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

In [None]:
# Make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98') # en la variable
#req hacemos una peticio get de la URL seleccionada
# Read the content of the server’s response
src = req.text # atributo que significa "source" (fuente) y se utiliza para especificar
#la ruta  de un archivo externo, en este caso a un archivo txt
# Soup it
soup = BeautifulSoup(src, "lxml") #utiliza la biblioteca Beautiful Soup para analizar
#un documento HTML o XML almacenado en la variable src, lxml especifica el analizador
#que Beautiful Soup debe usar, en este caso, el analizador lxml
# Create empty list to store our data
members = [] #Crea una lista

# Returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
#BeautifulSoup que utiliza selectores CSS para encontrar todos los elementos <tr> filas de
#tabla) anidados dentro de otras filas <tr> dentro de un documento HTML. En la práctica, esta 
#expresión se usa para buscar filas específicas, a menudo para extraer datos de tablas que 
#contienen subtables, buscando filas que están dentro de otras filas de tabla
# Get rid of junk rows
rows = [row for row in rows if row.select('td.detail')]
#En la variable rows iteractua sobre una colección de elementos rows,
#y para cada row, verifica si contiene un elemento <td> con la clase detail 
#utilizando el método select(), si esta condición es 
#verdadera, el row es incluido en el resultado

# Loop through all rows
for row in rows:
    # Select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail') 
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]

    # YOUR CODE HERE
    full_path = ''

    # Store in a tuple
    senator = (name, district, party, full_path)
    # Append to list
    members.append(senator)

NameError: name 'requests' is not defined

In [None]:
# Uncomment to test 
# members[:5]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. 

In [35]:
# YOUR CODE HERE
def get_members(url): #funcion para obtener los miembros
    return [___]


In [37]:
# Test your code
url = 'http://www.ilga.gov/senate/default.asp?GA=98' #direccion url
senate_members = get_members(url) #en el objeto senate_members , con el metodo get extrae
#los miembros de la url
len(senate_members) #longitud del contenido de la variable senate_members

1

## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

In [None]:
def get_bills(url): #funcion con metod get para extraer informacion de la url
    src = requests.get(url).text # en el  objeto  src hace las peticiones
    #con el metodo get de la url
    soup = BeautifulSoup(src) #inicia la biblioteca Beautiful Soup, creando un objeto soup
    #que representa un documento HTML o XML parseado a partir de la variable src
    #(que contiene el contenido del HTML o XML) y lo convierte en un árbol de objetos
    #en el que se pueden buscar y extraer datos de forma sencilla
    rows = soup.select('tr') #estás seleccionando todos los elementos de la etiqueta <tr>
    #(fila de tabla) en el documento HTML analizado por la librería BeautifulSoup y asignando \n",
    #el resultado a la variable rows
    bills = [] #crea una lista
    for row in rows:
        # YOUR CODE HERE
        bill_id =
        description =
        chamber =
        last_action =
        last_action_date =
        bill = (bill_id, description, chamber, last_action, last_action_date)
        bills.append(bill)
    return bills

SyntaxError: invalid syntax (2408095071.py, line 14)

In [None]:
# Uncomment to test your code
# test_url = senate_members[0][3]
# get_bills(test_url)[0:5]

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# YOUR CODE HERE


In [None]:
# Uncomment to test your code
# bills_dict[52]