<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDA0321ENSkillsNetwork928-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Hands-on Lab : Web Scraping**


Estimated time needed: **30 to 45** minutes


## Objectives


In this lab you will perform the following:


* Extract information from a given web site 
* Write the scraped data into a csv file.


## Extract information from the given web site
You will extract the data from the below web site: <br> 


In [3]:
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.


Import the required libraries


In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Download the webpage at the url


In [7]:
try:
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for bad status codes
    html_content = response.text  # Get the HTML content as text

except requests.exceptions.RequestException as e:
    print(f"Error during request: {e}")
    exit()

Create a soup object


In [8]:
soup = BeautifulSoup(html_content, "html.parser")

Scrape the `Language name` and `annual average salary`.


In [9]:
table = soup.find('table')
if not table:
    print("Error: Could not find the table on the page.")
    exit()

# 5. Extract the data (Language Name and Average Annual Salary)
data = []  # List to store the scraped data
# Iterate over the rows, skipping the header row.
# The header row is usually a <tr> tag containing <th> tags, and
# the data rows are <tr> tags containing <td> tags.

for row in table.find_all('tr')[1:]:  # [1:] slices the list to skip the first row (header)
    columns = row.find_all('td')
    if len(columns) >= 4:  # Ensure we have enough columns
        language_name = columns[1].get_text(strip=True) # Column 1: Language
        salary_str = columns[3].get_text(strip=True)    # Column 3: Salary
        
        #Clean the Salary
        salary_str = salary_str.replace('$','').replace(',','')

        # Convert salary to a number (handling potential errors)
        try:
            salary = float(salary_str)
        except ValueError:
            print(f"Warning: Could not convert salary to float for: {language_name}")
            salary = None  # Or use some other default, like 0, or skip this row

        data.append([language_name, salary])  # Store as a list of lists

Save the scrapped data into a file named *popular-languages.csv*


In [11]:
df = pd.DataFrame(data, columns=['Language Name', 'Average Annual Salary'])

df.to_csv("popular-languages.csv", index=False)
print("Data saved to popular-languages.csv")

print(df)

Data saved to popular-languages.csv
  Language Name  Average Annual Salary
0        Python               114383.0
1          Java               101013.0
2             R                92037.0
3    Javascript               110981.0
4         Swift               130801.0
5           C++               113865.0
6            C#                88726.0
7           PHP                84727.0
8           SQL                84793.0
9            Go                94082.0


## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


## Change Log


|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |
|---|---|---|---|
| 2020-10-17  | 0.1  | Ramesh Sannareddy  |  Created initial version of the lab |


 Copyright &copy; 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDA0321ENSkillsNetwork928-2022-01-01).
