# Web Scraping

Q. What is Web Scraping?

Web scraping is the process of **extracting information from websites**. It involves fetching the web page, parsing its HTML content, and extracting the desired information. 

![image-2.png](attachment:image-2.png)

This technique is used to gather data for various purposes, such as research, analysis, or building datasets.

#### Web Scraping Vs Web Crawling

![image.png](attachment:image.png)

- Crawler: A crawler, also known as a spider or bot, is a program that systematically browses the internet, following links from one page to another. It is used to index and update web pages in search engine databases. Needs a crawler or crawl agent only. **Deduplication is mandatory**. Crawler follows `robots.txt` file.
- Scraper: A scraper is a program or script designed to extract specific data from websites. It focuses on fetching and parsing the content of a single page to extract relevant information. Needs a crawler or crawl agent with parser. **Deduplication is not necessary**. Scraper considers itself as a search engine and bypasses the `robots.txt` file.

Q What is Web Scraping used for?

Web Scraping has multiple applications across various industries. Let’s check out some of these now!

1. Price Monitoring: to scrap the product data for their products and competing products as well to see how it impacts their pricing strategies. Companies can use this data to fix the optimal pricing for their products to obtain maximum revenue.

2. Market Research: High-quality web scraped data obtained in large volumes, empowering companies to analyze consumer trends comprehensively. This insights-driven approach guides strategic decision-making for the business.

3. News Monitoring: proves invaluable in delivering detailed reports on current events, particularly crucial for companies frequently in the news or reliant on daily updates for operational decisions. 

4. Sentiment Analysis: Companies gain insights into the general sentiment surrounding their products, facilitating product development and outpacing competitors, through web scraping social media platforms like Facebook and Twitter.

5. Email Marketing: strategies by enabling the collection of email addresses from various sources. Companies can then engage in targeted email campaigns, reaching a broader audience with promotional and marketing messages. 


Q. Legality of Web Scraping

The legality of web scraping is a complex and evolving area, and it depends on various factors such as the website's terms of service, the nature of the data being scraped, and the jurisdiction. Here are key points to consider:

1. Terms of Service: Many websites have terms of service that explicitly prohibit web scraping. If you violate these terms, you may face legal consequences.

2. Priority to API: Use API as an alternative if provided by the company whose data you want to scrape.

3. Robots.txt: Some websites use the robots.txt file to communicate whether web crawlers are allowed or not. Ignoring the directives in the robots.txt file might have legal implications.

4. Copyright and Intellectual Property: Scraping copyrighted content without permission may lead to legal issues. Ensure that you have the right to access and use the data you scrape.

5. Scraping Rate: Scrape the website with slow rate of requests to fetch the data by following proper rules and regulations of website.

# Web Scraping Cheatcodes 2024

Let's discuss, how to use `request` and `BeautifulSoup` some key libraries commonly used for web scraping.

- Requests: Getting data(HTML):
In order to work with the HTML, we will have to get the HTML as a string by using get() function in requests module. 

- BeautifulSoup (bs4):
Beautiful Soup is the perfect module to parse or transverse through HTML code. We can easily target any div, table, td, tr, class, id, etc. 

Note: This cheat code serves as a reference for addressing any challenges encountered with Webscraping Websites in the future.

### Importing Library

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## 1. Scraping data from a Container

##### Steps you can follow:

1. Inspect the HTML Structure:
* Identify the container `<div>`, `<section>`, or other relevant tag that holds the data.

2. Use BeautifulSoup:
* Fetch the HTML content using requests.
* Parse the HTML with BeautifulSoup.

3. Find the Container:
* Locate the container using find or find_all methods.

4. Extract Data:
* Extract the required information within the container.

In [5]:
requests.get("https://www.ambitionbox.com/list-of-companies?campaign=desktop_nav&page=1")

<Response [403]>

#### If response code is 403

You have to give `headers` to the webpage, to extract `robots.txt` files. 

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

Note: `headers` allow us to send request to the website as a human requesting for information through browser and not a bot.

In [6]:
url = "https://www.ambitionbox.com/list-of-companies?campaign=desktop_nav&page=1"

In [16]:
HEADERS = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
response = requests.get(url,headers=HEADERS)
response.status_code

200

Now, we can scrape the desired informations from the website as the `status_code` of response is `200` by creating a soup.

### Parsing Data

Once the HTML is fetched using requests the next step will be to parse the HTML content. For that we will use python’s BeautifulSoup module which will create a tree like structure for our DOM. This line is parsing the data:

In [19]:
soup = BeautifulSoup(response.content, "lxml")
soup

<!DOCTYPE html>
<html data-n-head="%7B%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-n-head-ssr="" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,minimum-scale=1" name="viewport"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<style>@media only screen and (min-width:767px){.trp-img{width:400px!important;max-width:400px!important}}#ab-body{pointer-events:none}</style>
<script>window.dataLayer=window.dataLayer||[],window.gtag=window.gtag||function(){window.dataLayer.push(arguments)},gtag("js",new Date)</script>
<title>List of companies in India | AmbitionBox</title><meta content="2024 AmbitionBox" data-n-head="ssr" name="copyright"/><meta content="1 day" data-n-head="ssr" name="revisit-after"/><meta content="AmbitionBox" data-n-head="ssr" name="application-name"/><meta content="EN" data-n-head="ssr" name="content-language"/><meta content="462822053404-hphug4pkahqljh2tc96g35at47o4isv2.apps.googleusercontent.com" data-n-head="ssr" name="

We can use `prettify` method to make the HTML or XML content more readable and formatted.

In [20]:
print(soup.prettify())

<!DOCTYPE html>
<html data-n-head="%7B%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-n-head-ssr="" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1,minimum-scale=1" name="viewport"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <style>
   @media only screen and (min-width:767px){.trp-img{width:400px!important;max-width:400px!important}}#ab-body{pointer-events:none}
  </style>
  <script>
   window.dataLayer=window.dataLayer||[],window.gtag=window.gtag||function(){window.dataLayer.push(arguments)},gtag("js",new Date)
  </script>
  <title>
   List of companies in India | AmbitionBox
  </title>
  <meta content="2024 AmbitionBox" data-n-head="ssr" name="copyright"/>
  <meta content="1 day" data-n-head="ssr" name="revisit-after"/>
  <meta content="AmbitionBox" data-n-head="ssr" name="application-name"/>
  <meta content="EN" data-n-head="ssr" name="content-language"/>
  <meta content="462822053404-hphug4pkahqljh2tc96g35at47o4isv2.app

### HTML Tree Traversal (Targeting Data)

HTML Tree traversal is travelling through the tree branches(HTML tags) and to target the branches we want and scrapping them. 

Types of Objects commonly used are:
1. Tag : type(title)
2. NavigableString: type(title.string)
3. BeautifulSoup: type(soup)
4. Comment: the HTML comment data

In [21]:
# Finding all the `<h1>` tags in the HTML document represented by the `soup` object. 
# It then selects the first `<h1>` tag found (index 0), retrieves the text content within
# that tag, and removes any leading or trailing whitespace using the `strip()` method.
soup.find_all('h1')[0].text.strip()

'List of companies in India'

Let's find out the names of the companies

In [24]:
for i in soup.find_all("h2", class_ = "companyCardWrapper__companyName"):
    print(i.text.strip())

TCS
Accenture
Cognizant
Wipro
HDFC Bank
ICICI Bank
Infosys
Capgemini
HCLTech
Tech Mahindra
Genpact
Axis Bank
Teleperformance
Concentrix Corporation
Reliance Jio
Amazon
IBM
Larsen & Toubro Limited
Reliance Retail
HDB Financial Services


 Trying to scrape "rating" data

In [35]:
for i in soup.find_all("div", class_ = "companyCardWrapper__companyRating"):
    print(i.text.strip())

3.8
4.0
3.9
3.8
3.9
4.0
3.8
3.8
3.6
3.7
3.9
3.8
3.6
3.9
4.0
4.1
4.1
4.0
3.9
4.0


Let's scrape other data

In [37]:
for i in soup.find_all("div", class_ = "companyCardWrapper__interLinkingWrapper"):
    print(i.text.strip())

IT Services & Consulting | 1 Lakh+ Employees | Public | 56 years old | Mumbai +338 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 35 years old | Dublin +168 more
IT Services & Consulting | 1 Lakh+ Employees | Forbes Global 2000 | 30 years old | Teaneck. New Jersey. +153 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 79 years old | Bangalore/Bengaluru +274 more
Banking | 1 Lakh+ Employees | Public | 30 years old | Mumbai +1516 more
Banking | 1 Lakh+ Employees | Public | 30 years old | Mumbai +1270 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 43 years old | Bengaluru/Bangalore +173 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 57 years old | Paris +129 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 33 years old | Noida +179 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 38 years old | Pune +260 more
IT Services & Consulting | 1 Lakh+ Employees | Public | 27 years old | New York +104 more
Banking | 50k-

In [40]:
# Your scraping code
data = []
for i in soup.find_all("div", class_="companyCardWrapper__interLinkingWrapper"):
    text = i.text.strip()

    # Extracting and cleaning data
    parts = text.split("|")
    ctype = parts[0].strip()
    no_of_employees = parts[1].strip()
    hq = parts[-1].split("+")[0].strip()

    # Do something with the extracted data (e.g., print or store in a DataFrame)
    print("Company Type:", ctype)
    print("No. of Employees:", no_of_employees)
    print("HQ:", hq)
    print("-" * 20)

    # If you want to store the data in a DataFrame, you can create a list and then convert it to a DataFrame
    # For example:
    data.append([ctype, no_of_employees, hq])

# Create a DataFrame from the collected data
df = pd.DataFrame(data, columns=['Company Type', 'No. of Employees', 'HQ'])

# Display the DataFrame
df

Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Mumbai
--------------------
Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Dublin
--------------------
Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Teaneck. New Jersey.
--------------------
Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Bangalore/Bengaluru
--------------------
Company Type: Banking
No. of Employees: 1 Lakh+ Employees
HQ: Mumbai
--------------------
Company Type: Banking
No. of Employees: 1 Lakh+ Employees
HQ: Mumbai
--------------------
Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Bengaluru/Bangalore
--------------------
Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Paris
--------------------
Company Type: IT Services & Consulting
No. of Employees: 1 Lakh+ Employees
HQ: Noida
--------------------
Company Type: IT Services & Consu

Unnamed: 0,Company Type,No. of Employees,HQ
0,IT Services & Consulting,1 Lakh+ Employees,Mumbai
1,IT Services & Consulting,1 Lakh+ Employees,Dublin
2,IT Services & Consulting,1 Lakh+ Employees,Teaneck. New Jersey.
3,IT Services & Consulting,1 Lakh+ Employees,Bangalore/Bengaluru
4,Banking,1 Lakh+ Employees,Mumbai
5,Banking,1 Lakh+ Employees,Mumbai
6,IT Services & Consulting,1 Lakh+ Employees,Bengaluru/Bangalore
7,IT Services & Consulting,1 Lakh+ Employees,Paris
8,IT Services & Consulting,1 Lakh+ Employees,Noida
9,IT Services & Consulting,1 Lakh+ Employees,Pune


### Working with the whole container

In [47]:
# Let's find all the company details at once
company_details = soup.find_all("div", class_ = "companyCardWrapper")
len(company_details)

20

In [50]:
# Creating empty lists

company_name = []
company_rating = []
company_type = []
no_of_employees = []
hq = []

for i in company_details:
    
    # Extracting company's name
    company_name.append(i.find("h2", class_ = "companyCardWrapper__companyName").text.strip())
    
    # Extracting company's rating
    company_rating.append(i.find("div", class_ = "companyCardWrapper__companyRating").text.strip())
    
    # Extracting company type, no. of employees and HQ details from the wrapper
    text = i.find("div", class_ = "companyCardWrapper__interLinkingWrapper").text.strip()

    # Extracting and cleaning data from the wrapper
    parts = text.split("|")
    ctype = parts[0].strip()
    no_of_employee = parts[1].strip()
    headq = parts[-1].split("+")[0].strip()

    company_type.append(ctype)
    no_of_employees.append(no_of_employee) 
    hq.append(headq)

# Create a DataFrame from the collected data
df = pd.DataFrame({'comapny_name':company_name,
   'company_rating':company_rating,
   'company_type':company_type,
   'head_quarters':hq,
   'No_of_Employee':no_of_employees,
   })


# Display the DataFrame
df

Unnamed: 0,comapny_name,company_rating,company_type,head_quarters,No_of_Employee
0,TCS,3.8,IT Services & Consulting,Mumbai,1 Lakh+ Employees
1,Accenture,4.0,IT Services & Consulting,Dublin,1 Lakh+ Employees
2,Cognizant,3.9,IT Services & Consulting,Teaneck. New Jersey.,1 Lakh+ Employees
3,Wipro,3.8,IT Services & Consulting,Bangalore/Bengaluru,1 Lakh+ Employees
4,HDFC Bank,3.9,Banking,Mumbai,1 Lakh+ Employees
5,ICICI Bank,4.0,Banking,Mumbai,1 Lakh+ Employees
6,Infosys,3.8,IT Services & Consulting,Bengaluru/Bangalore,1 Lakh+ Employees
7,Capgemini,3.8,IT Services & Consulting,Paris,1 Lakh+ Employees
8,HCLTech,3.6,IT Services & Consulting,Noida,1 Lakh+ Employees
9,Tech Mahindra,3.7,IT Services & Consulting,Pune,1 Lakh+ Employees


### Creating dataframe for all the pages

In [51]:
# Creating empty lists
final=pd.DataFrame()

# Scraping for 10 pages only
for j in range(1,10):
    HEADERS = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
    
    webpage=requests.get('https://www.ambitionbox.com/list-of-companies?campaign=desktop_nav&page={}'.format(j), headers=HEADERS)
    soup=BeautifulSoup(webpage.content,'lxml')
    company_details = soup.find_all("div", class_ = "companyCardWrapper")
  
    company_name = []
    company_rating = []
    company_type = []
    no_of_employees = []
    hq = []

    for i in company_details:
        
        # Extracting company's name
        try: 
            company_name.append(i.find("h2", class_ = "companyCardWrapper__companyName").text.strip())
        except:
            company_name.append(np.nan)
            
        # Extracting company's rating
        
        try: 
            company_rating.append(i.find("div", class_ = "companyCardWrapper__companyRating").text.strip())
        except:
            company_rating.append(np.nan)
        
        # Extracting company type, no. of employees and HQ details from the wrapper
        try: 
            text = i.find("div", class_ = "companyCardWrapper__interLinkingWrapper").text.strip()
            
            # Extracting and cleaning data from the wrapper
            parts = text.split("|")
            ctype = parts[0].strip()
            no_of_employee = parts[1].strip()
            headq = parts[-1].split("+")[0].strip()

            company_type.append(ctype)
            no_of_employees.append(no_of_employee) 
            hq.append(headq)
        
        except:
            company_type.append(np.nan)
            no_of_employees.append(np.nan)
            hq.append(np.nan)
            
    # Create a DataFrame from the collected data
    df = pd.DataFrame({'comapny_name':company_name,
    'company_rating':company_rating,
    'company_type':company_type,
    'head_quarters':hq,
    'No_of_Employee':no_of_employees,
    })


    # Create finalDataFrame
    final=final.append(df,ignore_index=True)

  final=final.append(df,ignore_index=True)


In [52]:
final.shape

(180, 5)

In [53]:
final

Unnamed: 0,comapny_name,company_rating,company_type,head_quarters,No_of_Employee
0,TCS,3.8,IT Services & Consulting,Mumbai,1 Lakh+ Employees
1,Accenture,4.0,IT Services & Consulting,Dublin,1 Lakh+ Employees
2,Cognizant,3.9,IT Services & Consulting,Teaneck. New Jersey.,1 Lakh+ Employees
3,Wipro,3.8,IT Services & Consulting,Bangalore/Bengaluru,1 Lakh+ Employees
4,HDFC Bank,3.9,Banking,Mumbai,1 Lakh+ Employees
...,...,...,...,...,...
175,Abbott Healthcare,4.1,Pharma,Illinois City,10k-50k Employees
176,Team Lease,3.9,Recruitment,Bangalore/Bengaluru,1k-5k Employees
177,VE Commercial Vehicles,4.0,Automobile,Gurgaon/Gurugram,1k-5k Employees
178,Ford Motor,4.4,Automobile,Dearborn,5k-10k Employees


In [54]:
final.to_csv("Companies Data.csv")

## 2. Scraping data from a Table

##### Steps you can follow:

1. Inspect the HTML Structure:
* Right-click on the webpage and choose "Inspect" to view the HTML structure.
* Identify the `<table>` tag that contains the data you want to scrape.

2. Use BeautifulSoup:
* Use the requests library to fetch the webpage's HTML content.
* Use BeautifulSoup to parse the HTML.

3. Find the Table:
* Use BeautifulSoup's find or find_all methods to locate the `<table>` tag.

4. Extract Data from Rows and Columns:
* Iterate through rows and columns within the table to extract the desired data.

In [3]:
# Fetch webpage content
webpage = requests.get("https://finasko.com/fortune-100-companies/")

In [4]:
soup=BeautifulSoup(webpage.content,'lxml')

In [28]:
company_details = soup.find_all("tr")

In [29]:
company_details

[<tr><td><strong>Companies</strong></td><td><strong>Sector</strong></td></tr>,
 <tr><td>1 Walmart</td><td>Retail</td></tr>,
 <tr><td>2 Amazon</td><td>Retail</td></tr>,
 <tr><td>3 Exxon Mobil</td><td>Energy</td></tr>,
 <tr><td>4 Apple</td><td>Technology</td></tr>,
 <tr><td>5 UnitedHealth Group</td><td>Health Care</td></tr>,
 <tr><td>6 CVS Health</td><td>Health Care</td></tr>,
 <tr><td>7 Berkshire Hathaway</td><td>Financial</td></tr>,
 <tr><td>8 Alphabet</td><td>Technology</td></tr>,
 <tr><td>9 McKesson</td><td>Health Care</td></tr>,
 <tr><td>10 Chevron</td><td>Energy</td></tr>,
 <tr><td>11 AmerisourceBergen</td><td>Health Care</td></tr>,
 <tr><td>12 Costco Wholesale</td><td>Retail</td></tr>,
 <tr><td>13 Microsoft </td><td>Technology </td></tr>,
 <tr><td>14 Cardinal Health </td><td>Health Care </td></tr>,
 <tr><td>15 Cigna</td><td>Health Care</td></tr>,
 <tr><td>16 Marathon Petroleum</td><td>Energy</td></tr>,
 <tr><td>17 Phillips 66</td><td>Energy</td></tr>,
 <tr><td>18 Valero Energy </t

In [23]:
# Extract company details
company_details = soup.find_all("tr")[1:]  # Skip the header row with serial numbers

# Extract data into lists
companies = []
sectors = []

for row in company_details:
    columns = row.find_all("td")
    companies.append(columns[0].get_text(strip=True).split(' ', 1)[1])  # Removing serial number
    sectors.append(columns[1].get_text(strip=True))

# Create a DataFrame
data = {'Companies': companies, 'Sectors': sectors}
df = pd.DataFrame(data)

# Display the DataFrame
df

Unnamed: 0,Companies,Sectors
0,Walmart,Retail
1,Amazon,Retail
2,Exxon Mobil,Energy
3,Apple,Technology
4,UnitedHealth Group,Health Care
...,...,...
95,United Airlines Holdings,Aviation
96,Thermo Fisher Scientific,Technology
97,Qualcomm,Telecommunication
98,Abbott Laboratories,Health Care


Here's what each part of the code does for "companies" Column:

`columns[0]`: This accesses the first element (index 0) in the list columns, which contains the company names.

`.get_text(strip=True)`: This extracts the text content from the HTML element, and strip=True removes any leading or trailing whitespaces.

`.split(' ', 1)`: This splits the text using the space (' ') as a separator, but it only performs one split. The 1 as the second argument ensures that only the first occurrence of the space is considered. This is important because some company names may have spaces in them, and we want to split only at the first space to separate the serial number.

`[1]`: This retrieves the second part of the split text (index 1), which represents the company name after removing the serial number.

## 3. Scraping data from Links

##### Steps you can follow:

1. Inspect the HTML Structure:
* Identify the <div> tag that represents the unordered list.

2. Use BeautifulSoup: 
* Fetch the HTML content.
* Parse the HTML with BeautifulSoup.

3. Extract List Items:
* Extract individual list items using find_all('li')


In [62]:
url = requests.get("https://en.wikipedia.org/wiki/Category:21st-century_Indian_businesspeople")
url

<Response [200]>

In [63]:
soup = BeautifulSoup(url.content,'lxml')
soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-not-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Category:21st-century Indian businesspeople - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-client

In [65]:
details = soup.find('div', class_='mw-category mw-category-columns')
details

<div class="mw-category mw-category-columns"><div class="mw-category-group"><h3>A</h3>
<ul><li><a href="/wiki/Anuradha_Acharya" title="Anuradha Acharya">Anuradha Acharya</a></li>
<li><a href="/wiki/Gautam_Adani" title="Gautam Adani">Gautam Adani</a></li>
<li><a href="/wiki/Karan_Adani" title="Karan Adani">Karan Adani</a></li>
<li><a href="/wiki/Pranav_Adani" title="Pranav Adani">Pranav Adani</a></li>
<li><a href="/wiki/Priti_Adani" title="Priti Adani">Priti Adani</a></li>
<li><a href="/wiki/Anu_Aga" title="Anu Aga">Anu Aga</a></li>
<li><a href="/wiki/Ashutosh_Agashe" title="Ashutosh Agashe">Ashutosh Agashe</a></li>
<li><a href="/wiki/Dnyaneshwar_Agashe" title="Dnyaneshwar Agashe">Dnyaneshwar Agashe</a></li>
<li><a href="/wiki/Mandar_Agashe" title="Mandar Agashe">Mandar Agashe</a></li>
<li><a href="/wiki/Sheetal_Agashe" title="Sheetal Agashe">Sheetal Agashe</a></li>
<li><a href="/wiki/Prem_Akkaraju" title="Prem Akkaraju">Prem Akkaraju</a></li>
<li><a href="/wiki/Vikram_Akula" title="Vik

In [66]:
for i in details.find_all("li"):
    print(i.text)

Anuradha Acharya
Gautam Adani
Karan Adani
Pranav Adani
Priti Adani
Anu Aga
Ashutosh Agashe
Dnyaneshwar Agashe
Mandar Agashe
Sheetal Agashe
Prem Akkaraju
Vikram Akula
Tina Ambani
Achal Bakeri
Vinita Bali
Kavita K. Barjatya
Mukta Barve
Ritu Beri
Vallabh Bhanshali
Uddhab Bharali
Shobhana Bhartia
Hridayeshwar Singh Bhati
Mukesh Bhatt
Arundhati Bhattacharya
Jeroo Billimoria
William Nanda Bissell
Shabbir Boxwala
Brijmohan Lall Munjal
Urvashi Butalia
Shoba Chandrasekhar
Juhi Chawla
Pamela Chopra
Santosh Choubey
Seetha Coleman-Kammula
Sylvester da Cunha
Jyoti Deshpande
Sulabha Deshpande
Dega Deva Kumar Reddy
Kanika Dewan
Tanya Dubash
C. Aswani Dutt
Priyanka Dutt
Krishna Ella
Kailash Chandra Gahtori
Galla Aruna Kumari
S. George
Manjula Ghattamaneni
Sumita Ghosh
Jyoti Gogte
Namita Gokhale
Divya Gokulnath
Gita Gopinath
Suhas Gopinath
Dharampal Gulati
Anita Gupta
Lalita D. Gupte
Dheeraj Hinduja
Gopichand Hinduja
Prakash Hinduja
S. P. Hinduja
Zuboni Hümtsoe
Shahnaz Husain
Anshu Jain
Devaki Jain
Tus

In [70]:
for i in details.find_all("a"):
    print("https://en.wikipedia.org/" + i.get("href"))

https://en.wikipedia.org//wiki/Anuradha_Acharya
https://en.wikipedia.org//wiki/Gautam_Adani
https://en.wikipedia.org//wiki/Karan_Adani
https://en.wikipedia.org//wiki/Pranav_Adani
https://en.wikipedia.org//wiki/Priti_Adani
https://en.wikipedia.org//wiki/Anu_Aga
https://en.wikipedia.org//wiki/Ashutosh_Agashe
https://en.wikipedia.org//wiki/Dnyaneshwar_Agashe
https://en.wikipedia.org//wiki/Mandar_Agashe
https://en.wikipedia.org//wiki/Sheetal_Agashe
https://en.wikipedia.org//wiki/Prem_Akkaraju
https://en.wikipedia.org//wiki/Vikram_Akula
https://en.wikipedia.org//wiki/Tina_Ambani
https://en.wikipedia.org//wiki/Achal_Bakeri
https://en.wikipedia.org//wiki/Vinita_Bali
https://en.wikipedia.org//wiki/Kavita_K._Barjatya
https://en.wikipedia.org//wiki/Mukta_Barve
https://en.wikipedia.org//wiki/Ritu_Beri
https://en.wikipedia.org//wiki/Vallabh_Bhanshali
https://en.wikipedia.org//wiki/Uddhab_Bharali
https://en.wikipedia.org//wiki/Shobhana_Bhartia
https://en.wikipedia.org//wiki/Hridayeshwar_Singh_Bhati

In [72]:
business_persons = []
for i in details.find_all("li"):
    business_persons.append(i.text)

persons_link = []
for i in details.find_all("a"):
    persons_link.append("https://en.wikipedia.org/" + i.get("href"))

In [73]:
business_persons_data=pd.DataFrame(zip(business_persons, persons_link),columns=['BusinessPersonName','DetailsLinks'])
business_persons_data.shape

(197, 2)

## 4. Scraping data from Multiple Links

In [74]:
# List of URLs
urls = [
    "https://en.wikipedia.org/wiki/Category:20th-century_Indian_businesspeople",
    "https://en.wikipedia.org/w/index.php?title=Category:20th-century_Indian_businesspeople&pagefrom=Pande%2C+Anku%0AAnku+Pande#mw-pages",
    "https://en.wikipedia.org/wiki/Category:20th-century_Indian_businesswomen",
    "https://en.wikipedia.org/wiki/Category:21st-century_Indian_businesspeople",
    "https://en.wikipedia.org/wiki/Category:21st-century_Indian_businesswomen"
]

# Initialize an empty list to store data
business_persons_data = []

# Iterate through each URL
for url in urls:
    # Fetch the HTML content
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'lxml')
    
    # Find the details container
    details = soup.find('div', class_='mw-category mw-category-columns')

    # Initialize lists to store individual details
    business_persons = []
    persons_link = []

    # Extract data from HTML structure
    for li in details.find_all("li"):
        business_persons.append(li.text)

    for a in details.find_all("a"):
        persons_link.append("https://en.wikipedia.org/" + a.get("href"))

    # Combine data into a list of tuples
    business_persons_data.extend(zip(business_persons, persons_link))

# Create a DataFrame
columns = ['BusinessPersonName', 'DetailsLinks']
business_persons_df = pd.DataFrame(business_persons_data, columns=columns)

# Remove duplicates based on the "DetailsLinks" column
business_persons_df.drop_duplicates(subset='DetailsLinks', inplace=True)

# Display the shape of the DataFrame
print(business_persons_df.shape)

(404, 2)


In [76]:
business_persons_df["DetailsLinks"][0]

'https://en.wikipedia.org//wiki/M._M._Abdul_Hameed'

In [77]:
business_persons_df.to_csv("business_persons - India.csv")

## 5. Scraping data from "Images"

##### Steps you can follow:
1. Send a Request to the Website:
 * Use the requests library to send a GET request to the website and retrieve the HTML content.

2. Parse HTML with BeautifulSoup:
* Parse the HTML content using BeautifulSoup to navigate and extract information.

3. Find Image Tags:
* Locate the image tags in the HTML. Images are often represented using the `<img>` tag.

4. Extract Image URLs:
* Extract the src attribute from each image tag to get the URL of the image.

5. Download Images:
* Use the requests library again to download the images. Save the images to your local machine.

In [3]:
food_url = "https://www.google.com/search?sca_esv=76c68a2d4a39925c&rlz=1C1CHBF_enIN1043IN1043&hl=en-GB&sxsrf=ACQVn089iTRO_gQJzx8NgYbZC1jQlqK2IA:1706868516549&q=food+images&tbm=isch&source=lnms&sa=X&ved=2ahUKEwjMpO3ctIyEAxW0SmwGHfkwAsMQ0pQJegQICxAB&biw=1396&bih=663&dpr=1.38"

food_webpage = requests.get(food_url) 
food_webpage

<Response [200]>

In [4]:
soup = BeautifulSoup(food_webpage.content, 'lxml')

In [7]:
img = soup.find_all("img")[1:]
img

[<img alt="" class="DS1iW" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT3jVkAzZHOi1XEP9RltnYgAwnrGtH33NrcwyMwAZAZy2uW_Ds36mijG595Iw&amp;s"/>,
 <img alt="" class="DS1iW" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRXIv828DUIoPF5YXTbX8hO-aAv0Tu8DGq-dODwZ-gtJwVhZdKg-inlILANnw&amp;s"/>,
 <img alt="" class="DS1iW" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRVG6Nr1BDBfnP_H-sbF4E2KlJwrP1XW5-DuIospW9bg6yK5QJj1eQwdmWdow&amp;s"/>,
 <img alt="" class="DS1iW" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ5R0uilEH5jbjo7HTJ7BEDAc2PVCFiBWW8LY0cNDfm5WWkNYj6q8jd0kQ-sg&amp;s"/>,
 <img alt="" class="DS1iW" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQVum3v_VU0bXTOjvziV2lIsf2iNYMjJn_PW2F0skrD8-711FFgpjSc7K0wcg&amp;s"/>,
 <img alt="" class="DS1iW" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFO9K9sISNG7CiSL4Y6E0tPRhpHER0f3eDRsjRYIYuMFR9zKvNDj9Ln4dmuzc&amp;s"/>,
 <img alt="" class="DS1iW" src="https://encrypted-tbn0.gs

In [8]:
for i in img:
    print(i.get("src"))

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT3jVkAzZHOi1XEP9RltnYgAwnrGtH33NrcwyMwAZAZy2uW_Ds36mijG595Iw&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRXIv828DUIoPF5YXTbX8hO-aAv0Tu8DGq-dODwZ-gtJwVhZdKg-inlILANnw&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRVG6Nr1BDBfnP_H-sbF4E2KlJwrP1XW5-DuIospW9bg6yK5QJj1eQwdmWdow&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ5R0uilEH5jbjo7HTJ7BEDAc2PVCFiBWW8LY0cNDfm5WWkNYj6q8jd0kQ-sg&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQVum3v_VU0bXTOjvziV2lIsf2iNYMjJn_PW2F0skrD8-711FFgpjSc7K0wcg&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRFO9K9sISNG7CiSL4Y6E0tPRhpHER0f3eDRsjRYIYuMFR9zKvNDj9Ln4dmuzc&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQgFVR3gTFB4l2-xnH0hphY62nUBQYbfgGZ43OYh9VSQXElFmPGhXLdviKANQ&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRi0Ev9Xx3OrypR_T2QruoERoY6-pkxOUQUJ5wk51103GW_uFO6COTvD2vA4Q&s
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSwcw3f9rZ

In [9]:
import os

food_url = "https://www.google.com/search?sca_esv=76c68a2d4a39925c&rlz=1C1CHBF_enIN1043IN1043&hl=en-GB&sxsrf=ACQVn089iTRO_gQJzx8NgYbZC1jQlqK2IA:1706868516549&q=food+images&tbm=isch&source=lnms&sa=X&ved=2ahUKEwjMpO3ctIyEAxW0SmwGHfkwAsMQ0pQJegQICxAB&biw=1396&bih=663&dpr=1.38"

food_webpage = requests.get(food_url) 
soup = BeautifulSoup(food_webpage.content, 'html.parser')

img_tags = soup.find_all("img")[1:]

# Create a folder to save the images
folder_path = "food_images"
os.makedirs(folder_path, exist_ok=True)

for i, img_tag in enumerate(img_tags):
    img_url = img_tag.get("src")
    img_data = requests.get(img_url).content

    # Save the image to the folder
    with open(os.path.join(folder_path, f"image_{i+1}.jpg"), "wb") as img_file:
        img_file.write(img_data)

    print(f"Image {i+1} saved.")

Image 1 saved.
Image 2 saved.
Image 3 saved.
Image 4 saved.
Image 5 saved.
Image 6 saved.
Image 7 saved.
Image 8 saved.
Image 9 saved.
Image 10 saved.
Image 11 saved.
Image 12 saved.
Image 13 saved.
Image 14 saved.
Image 15 saved.
Image 16 saved.
Image 17 saved.
Image 18 saved.
Image 19 saved.
Image 20 saved.


These are the major Scraping Types that I had personally encountered. 