<a href="https://colab.research.google.com/github/yotam-biu/tutorial6/blob/main/scraping_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Read `HTML`

## Dependency

**Import the necessary libraries**
* Begin by importing the requests library to send HTTP requests.
* Then, import BeautifulSoup from the bs4 module to parse HTML content.
* Finally, import the pandas library as pd to work with tabular data.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Read as normal html file

**requests**

1. Get the URL address
To begin, you need to define or obtain the URL from which you want to fetch content. Use this [link](https://people.mbi.ucla.edu/sumchan/codon_table.html) to get to the website.


2. Use `requests.get()` function to make a GET request
The `requests.get(url)` function requires the URL as its input argument. This is the URL you want to retrieve data from. In our case, we are passing the URL variable defined in the first step.
The requests.get() function sends an HTTP GET request to the specified URL and returns a response object. The response object contains information about the request, including the content of the URL.

3. Print the content of the URL using the `.content` attribute of the response GET object.

In [None]:
url = "https://people.mbi.ucla.edu/sumchan/codon_table.html"
url_content = requests.get(url).content
# url_content

**BeautifulSoup**

1. Use BeautifulSoup to parse the HTML content you received from the response.content. This is done by creating a BeautifulSoup object, passing the content as the first argument and the parser type ('html.parser') as the second argument.```python soup = BeautifulSoup(url_content, 'html.parser')```

2. How do you extract the first <table> element from the parsed content?

3. From the table you got in the last step, how do you extract all the <tr> (table row) elements?

4. For the rows you got in the last step, how do you access the first row of the table? After printing it, what problem do you see with the first row?

In [None]:
soup = BeautifulSoup(url_content, 'html.parser')

table = soup.find('table')
rows = soup.find_all('tr')
rows[0]

## Read with the forgiving `html5lib` parser

Use BeautifulSoup to parse the HTML content with the 'html5lib' parser (which is more tolerant than 'html.parser'), create a BeautifulSoup object by passing the content as the first argument and the parser type ('html5lib') as the second argument.
The only change here is the use of 'html5lib' instead of 'html.parser'. The html5lib parser is known for being more forgiving and better at handling malformed HTML.

In [None]:
soup = BeautifulSoup(url_content, 'html5lib')
table = soup.find('table')

rows = table.find_all('tr')
rows[0]


## Extract the data cell

**Understand the Row Structure**

1. Write code to inspect the contents of the first row (rows[0]) and the second row (rows[1]) to determine the structure of the table. For each row:

  * Start by extracting the individual cells in the row using the `.find_all('td')` method.
  * Create an empty list to store the text content of each cell.
  * Loop through the extracted cells and append their `.text` content to the list.

2. Notice how the first row behaves differently compared to subsequent rows. What are the differences?





In [None]:
cells = rows[0].find_all('td')
cells_data = []
for cell in cells:
    cells_data.append(cell.text)
cells_data

In [None]:
cells = rows[0].find_all('td')
cells_data = []
for cell in cells:
    cells_data.append(cell.text)
cells_data

## Building a Dictionary

Your objective is to create a dictionary that maps data from the first column (key) to the fourth column (value) of a table.  Skip rows that don't have enough data.  

1. Initialize a Dictionary. Start by creating an empty dictionary where the keys and values will be stored.  

2. Loop Through Rows. Iterate through all the rows of the table, starting from the second row (skip the header row).  

3. Check for Incomplete Rows. For each row, check if it has at least four cells. If a row has fewer than four cells, skip it and move to the next row.  

4. Extract Key and Value. For rows with enough cells:  
  - Take the text content of the first cell as the key.  
  - Take the text content of the fourth cell as the value.  

5. Add to Dictionary. Add the key-value pair to the dictionary.  

6. Verify Your Work. After processing all rows, inspect the dictionary to ensure it contains the expected data.  

---

#### Final Questions to Reflect On:  
1. What kind of data is being used as keys and values in the dictionary?  
2. How does skipping incomplete rows help ensure data accuracy?  
3. What potential issues might arise if the table format changes?  



In [None]:

codons2aa = {}
for row in rows[1:]:
    if len(row) < 3:
        continue
    cells = row.find_all('td')
    key = cells[0].text
    value = cells[3].text
    codons2aa[key] = value



codons2aa



## Read with pandas

In [None]:
url = "https://people.mbi.ucla.edu/sumchan/codon_table.html"  # URL or path to the HTML table.

df = pd.read_html(url, header=None)[0]  # Read the first table from the HTML.

df.columns = ["Codon", "Full Name", "3-Letter Abbreviation", "1-Letter Abbreviation", "Frequency"]  # Assign column names.

df.head()  # Display the first few rows of the DataFrame.