<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDA0321ENSkillsNetwork928-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Hands-on Lab : Web Scraping**


Estimated time needed: **30 to 45** minutes


## Objectives


In this lab you will perform the following:


* Extract information from a given web site 
* Write the scraped data into a csv file.


## Extract information from the given web site
You will extract the data from the below web site: <br> 


In [1]:
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

The data you need to scrape is the **name of the programming language** and **average annual salary**.<br> It is a good idea to open the url in your web broswer and study the contents of the web page before you start to scrape.


Import the required libraries


In [2]:
# Your code here
import requests
from bs4 import BeautifulSoup
import pandas as pd


Download the webpage at the url


In [3]:
#your code goes here
#this url contains the data you need to scrape
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

Create a soup object


In [4]:
#your code goes here
# Download the HTML page and parse it with BeautifulSoup
import requests
from bs4 import BeautifulSoup

response = requests.get(url, timeout=30)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")

# Quick sanity check: show the page title and the first couple table headers
page_title = soup.title.get_text(strip=True) if soup.title else None
print(page_title)

first_table = soup.find("table")
header_texts = []
if first_table:
    th_vals = first_table.find_all("th")
    header_texts = [th.get_text(strip=True) for th in th_vals]
print(header_texts[:10])

Salary survey results of programming languages
[]


Scrape the `Language name` and `annual average salary`.


In [6]:
#your code goes here
# Extract the table into a pandas DataFrame
import pandas as pd

tables_list = pd.read_html(response.text)
print(len(tables_list))

# Assume the first table is the programming languages table
lang_df = tables_list[0].copy()

# Clean column names a bit
lang_df.columns = [str(col).strip() for col in lang_df.columns]

print(lang_df.head())



# Basic cleaning: standardize column names and keep the most relevant columns
col_map = {}
for col in lang_df.columns:
    col_l = str(col).strip().lower()
    if "language" in col_l:
        col_map[col] = "language"
    if "average" in col_l and "salary" in col_l:
        col_map[col] = "average_salary"
    if col_l in ["year", "years"]:
        col_map[col] = "year"

lang_df = lang_df.rename(columns=col_map)

# If we didn't detect salary column, keep everything and just continue
print(lang_df.columns.tolist())

# Try to coerce salary to numeric if present
if "average_salary" in lang_df.columns:
    lang_df["average_salary"] = (
        lang_df["average_salary"].astype(str)
        .str.replace(",", "", regex=False)
        .str.replace("$", "", regex=False)
    )
    lang_df["average_salary"] = pd.to_numeric(lang_df["average_salary"], errors="coerce")

print(lang_df.head())

1
     0           1                             2                      3  \
0  No.    Language                    Created By  Average Annual Salary   
1    1      Python              Guido van Rossum               $114,383   
2    2        Java                 James Gosling               $101,013   
3    3           R  Robert Gentleman, Ross Ihaka                $92,037   
4    4  Javascript                      Netscape               $110,981   

                     4  
0  Learning Difficulty  
1                 Easy  
2                 Easy  
3                 Hard  
4                 Easy  
['0', '1', '2', '3', '4']
     0           1                             2                      3  \
0  No.    Language                    Created By  Average Annual Salary   
1    1      Python              Guido van Rossum               $114,383   
2    2        Java                 James Gosling               $101,013   
3    3           R  Robert Gentleman, Ross Ihaka                $92,037

Save the scrapped data into a file named *popular-languages.csv*


In [7]:
# your code goes here
# Save the scraped data to CSV
output_csv = "programming_languages_scraped.csv"
lang_df.to_csv(output_csv, index=False)
output_csv

'programming_languages_scraped.csv'

In [8]:
# Quick visualization: top languages by average salary if the column exists
import matplotlib.pyplot as plt

if "average_salary" in lang_df.columns and "language" in lang_df.columns:
    plot_df = lang_df.dropna(subset=["average_salary"]).sort_values("average_salary", ascending=False).head(10)
    plt.figure(figsize=(10, 5))
    plt.bar(plot_df["language"].astype(str), plot_df["average_salary"])
    plt.xticks(rotation=45, ha="right")
    plt.ylabel("Average Salary")
    plt.title("Top 10 Languages by Average Salary")
    plt.tight_layout()
    plt.show()
else:
    print(lang_df.head())

     0           1                             2                      3  \
0  No.    Language                    Created By  Average Annual Salary   
1    1      Python              Guido van Rossum               $114,383   
2    2        Java                 James Gosling               $101,013   
3    3           R  Robert Gentleman, Ross Ihaka                $92,037   
4    4  Javascript                      Netscape               $110,981   

                     4  
0  Learning Difficulty  
1                 Easy  
2                 Easy  
3                 Hard  
4                 Easy  


## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


## Change Log


|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |
|---|---|---|---|
| 2020-10-17  | 0.1  | Ramesh Sannareddy  |  Created initial version of the lab |


 Copyright &copy; 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDA0321ENSkillsNetwork928-2022-01-01).
