#  Scraping data in tabular format

1. Identify the Target Website: Choose the website from which you want to scrape the table data.
2. Inspect the Web Page: Use your browser's developer tools to inspect the HTML structure of the web page containing the table.
3. Locate the Table Element: Identify the HTML <table> element that contains the data you want to scrape.
4. Understand the Table Structure: Analyze the structure of the table, including its rows (<tr>) and columns (<td> or <th>), and any headers or footers.
5. Choose a Web Scraping Tool: Select a web scraping tool or library such as BeautifulSoup or Scrapy in Python.
6. Write Scraping Code: Use the chosen tool to write code that programmatically extracts data from the HTML content of the table.
7. Parse the Table Data: Utilize the tool's methods to parse and extract data from the table, navigating through its rows and columns.
8. Handle Pagination (if applicable): If the table spans multiple pages, implement logic to navigate through pagination and scrape data from each page.
9. Clean and Format the Data: Process the extracted data to remove any unnecessary characters, format it into a structured format, and handle missing or inconsistent values.
10. Save the Data: Choose a suitable storage format such as CSV, JSON, or a database, and save the scraped table data for further analysis or use.

### Installing Libraries is needed but i am using jupyter notebook so it comes preinstalled.

In [1]:
# Installing BeautifulSoup
# Uncomment below to install
#!pip install bs4

In [2]:
# Installing requests
# Uncomment below to install
#!pip install requests

In [3]:
# Importing Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [4]:

# Send a request to the web page
url = "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers"
response = requests.get(url)

In [5]:
# Check the response status code
if response.status_code == 200:
    print('Good to go')
else
    print(f'Response code: {response.status_code}')
    print('Not Good to Go')


Good to go


In [6]:
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

In [7]:
# Beautify the HTML content
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of languages by number of native speakers - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1

In [8]:
# Find the table we need to find
table = soup.find_all("table")[0]

In [9]:
# Extract data from the table
data = []
for row in table.find_all("tr"):
    row_data = [cell.get_text(strip=True) for cell in row.find_all(["th", "td"])]
    data.append(row_data)

In [16]:
# Convert data to DataFrame
df = pd.DataFrame(data[1:], columns=data[0])

In [20]:
# Print or further process the DataFrame
df

Unnamed: 0,Language,Native speakers(in millions),Language family,Branch
0,Mandarin Chinese,941,Sino-Tibetan,Sinitic
1,Spanish,486,Indo-European,Romance
2,English,380,Indo-European,Germanic
3,Hindi,345,Indo-European,Indo-Aryan
4,Bengali,237,Indo-European,Indo-Aryan
5,Portuguese,236,Indo-European,Romance
6,Russian,148,Indo-European,Balto-Slavic
7,Japanese,123,Japonic,Japanese
8,Yue Chinese,86,Sino-Tibetan,Sinitic
9,Vietnamese,85,Austroasiatic,Vietic


In [21]:
# To Export data 
df.to_excel("books.xlsx")