# Web Scraping

Data scraping is one of the most used ways to collect data. In simple terms it means, to get HTML code for a webpage and scan it for data.  
![](https://rukminim1.flixcart.com/image/312/312/kfpq5jk0-0/headphone/c/n/6/rockerz-400-rockerz-410-boat-original-imafw45vhyrax3zj.jpeg?q=70)

**[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** and **[Selenium](https://www.selenium.dev/)** are most used packages for scanning data.  
In this notebook we'll see how to use Beautiful Soup and get reviews of **[boAt Rockerz 400](https://www.flipkart.com/boat-rockerz-400-bluetooth-headset/p/itm14d0416b87d55)**  
**Let's Get started**

## Importing modules
**[Request](https://requests.readthedocs.io/en/master/)** Module is used to get the HTML code for the URL given.

**Note**: *Not all webpages can be requested. For example most social media does not allow to scrape data due to privacy issues. These pages require special access of Developer APIs to scrape data.*

In [None]:
import requests 
from bs4 import BeautifulSoup 
from tqdm import tqdm

## Setting variables

In [None]:
URL = "https://www.flipkart.com/boat-rockerz-400-bluetooth-headset/product-reviews/itm14d0416b87d55?pid=ACCEJZXYKSG2T9GS&lid=LSTACCEJZXYKSG2T9GSVY4ZIC&marketplace=FLIPKART&page=1"

### Requesting desired Webpage

In [None]:
r = requests.get(URL)    
soup = BeautifulSoup(r.content, 'html.parser') 
print(soup.prettify()[6000:7000])

If you're know HTML, this might look familiar.  
Next we'll see how to get our data.

# Extracting data

A website can be divided into many components and sub components. At times it is a complex grid structure which needs to decoded.  
1. You can easily view the structure by `Ctrl + Shift + C`
2. Now if you hover on any review, you'll notice that each block has name `col._2wzgFH.K0kLPL`
![](https://github.com/kabirnagpal/Web-Scraping/blob/main/Images/div-name.png?raw=true)

3. Further this is divided into mutiple rows. The first row contains the rating, while the second contains the actual review. 
![](https://github.com/kabirnagpal/Web-Scraping/blob/main/Images/rating.png?raw=true)
![](https://github.com/kabirnagpal/Web-Scraping/blob/main/Images/review.png?raw=true)
We'll follow exact same approach to extract data.

In [None]:
# Extracting all review blocks
## Note col._2wzgFH.K0kLPL means 3 entities namely 'col', ' _2wzgFH' and 'K0kLPL' 
## This is written in HTML as 'col _2wzgFH K0kLPL'
## This can also be seen in Bullet 3

row = soup.find_all('div',attrs={'class':'col _2wzgFH K0kLPL'})

In [None]:
# list to store data
dataset = []

# iteration over all blocks
for i in row: 
    
    # finding all rows within the block
    sub_row = i.find_all('div',attrs={'class':'row'})
        
    # extracting text from 1st and 2nd row
    rating = sub_row[0].find('div').text
    review = sub_row[1].find('div').text
    
    # appending to data
    dataset.append({'review': review , 'rating' : rating})

dataset[:5]

## Iterating over multiple Pages

In [None]:
dataset = []

# iterating over 50 pages of reviews
for i in tqdm(range(1,50)):

    URL = f"https://www.flipkart.com/boat-rockerz-400-bluetooth-headset/product-reviews/itm14d0416b87d55?pid=ACCEJZXYKSG2T9GS&lid=LSTACCEJZXYKSG2T9GSVY4ZIC&marketplace=FLIPKART&page={i}"
    r = requests.get(URL)    
    soup = BeautifulSoup(r.content, 'html.parser') 

    cols = soup.find_all('div',attrs={'class':'col _2wzgFH K0kLPL'})

    for col in cols:
        row = col.find_all('div',attrs={'class':'row'})

        rating = row[0].find('div').text
        review = row[1].find('div').text

        dataset.append({'review': review , 'rating' : rating})
len(dataset)

In [None]:
import pandas as pd
# pd.DataFrame(dataset).to_csv('data.csv',index=False)
data = pd.read_csv('../input/flipkart-customer-review-and-rating/data.csv')
data.head()