## Webscraping using beautiful soup

### Import libraries

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

### Insert url and request yo the server.

If we obtain a code 200, we get successfully the website source code. 

A detail of HTTP codes
[here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

In [2]:
url = "https://www.decathlon.in/camping/tents-shelters-15687?id=15687&type=c"
response = requests.get(url)
print(response) #If the request it is 200, it is successful

<Response [200]>


We need to get the source code of a website, and the parser. There are many options, but the most used are `html.parser` and `lxml`

In [3]:
html = response.content #Source code
soup = bs(html, 'lxml') #Transform

To check it, we show title information of website source code 

In [4]:
soup.title

<title>Buy Tent Online | Decathlon</title>

This part of code you need to surf inside website source code (F12 in your prefered navigator), to identify the chunks of a HTML where we get the information. In this case, there are a `div` tag with class `card h-full` contains the useful information of every tent on website.

To get a list of all products, we use the function `find_all`. Instead, if you need to get the first, use `find`.

In [5]:
# encontrar todos los productos en la página
products = soup.find_all('div', class_='card h-full')

The next step is search in each producto and get the tag and class to catch specific information (p.e the `a` tag and `href` give the url to go to the details of a product). We collect the data on lists, to merge in a dataframe later.

In [6]:
names = []
brands = []
currencies = []
prices = []
ratings = []
reviews = []

for product in products:
    # encontrar el nombre del producto
    name = product.find('p', class_='capitalize text-14 lg:text-14 whitespace-nowrap overflow-ellipsis overflow-hidden mt-1').text.strip()
    names.append(name)
    
    # encontrar la marca del producto
    brand = product.find('div', {'class':'font-semibold text-grey-900 lg:text-16 GBwGxwDUZb'}).text.strip()
    brands.append(brand)
        
    # encontrar la moneda de precio del producto
    currency = product.find('div','relative px-2 sm:px-2 py-0.5 bg-yellow-400 text-14 whitespace-nowrap lg:text-16').find_all('span')[0].text.strip()
    currencies.append(currency)
    
    # encontrar el precio del producto
    price = product.find('div','relative px-2 sm:px-2 py-0.5 bg-yellow-400 text-14 whitespace-nowrap lg:text-16').find_all('span')[1].text.strip()
    prices.append(price)
    
    # encontrar la calificación promedio del producto
    rating = product.find('span', {'class':'ml-1 font-semibold text-blue-500 text-12 lg:text-14'}).text.strip()
    ratings.append(rating)
    
    # encontrar link con información detallada del producto
    review = product.find('a')['href']
    review = 'https://www.decathlon.in' + review
    reviews.append(review)

We create a dataframe with lists

In [7]:
df = pd.DataFrame.from_dict({'Product':names, 'Brand':brands, 'Rating':ratings, 'Currency':currencies,
                        'Price':prices, 'Description':reviews})

In [8]:
print(df.shape)
df.sample(5)

(20, 6)


Unnamed: 0,Product,Brand,Rating,Currency,Price,Description
19,Protective Groundsheet MT500 3-Person Tent,FORCLAZ,4.8,₹,999,https://www.decathlon.in/p/8581938/tents-shelt...
15,Camping tent - MH100 - 3-person - Fresh,QUECHUA,4.5,₹,6999,https://www.decathlon.in/p/8641760/tents-shelt...
11,Camping hoop tent - Arpenaz 6.3 - 6-Person - 3...,QUECHUA,3.9,₹,22999,https://www.decathlon.in/p/8603881/tents-shelt...
13,Camping Tarp MH100 Blue,QUECHUA,4.4,₹,2299,https://www.decathlon.in/p/8544366/tents-shelt...
3,Camping Living Room with poles - Arpenaz Base ...,QUECHUA,4.4,₹,11999,https://www.decathlon.in/p/8648391/tents-shelt...


### Bonus! Get the products from several pages

It is possible that there are a lot of tents, and the website includes and pagination. In this case, the url website has the option of a page at he end `&page=2` in case of page 2.

In [9]:
urls = []
page = 1
while page < 3:
      url = f"https://www.decathlon.in/camping/tents-shelters-15687?id=15687&type=c&page={page}"
      urls.append(url)
      print(url)
      page = page + 1

https://www.decathlon.in/camping/tents-shelters-15687?id=15687&type=c&page=1
https://www.decathlon.in/camping/tents-shelters-15687?id=15687&type=c&page=2


We create a function to catch the data in the 2 pages of a website. We only need the url to iterate each page.

In [10]:
def scrap_tents(url):
    response = requests.get(url)
    html = response.content #Codigo fuente
    soup = bs(html, 'lxml')
    products = soup.find_all('div', class_='card h-full')
    
    for product in products:
        # encontrar el nombre del producto
        name = product.find('p', class_='capitalize text-14 lg:text-14 whitespace-nowrap overflow-ellipsis overflow-hidden mt-1').text.strip()
        names.append(name)

        # encontrar la marca del producto
        brand = product.find('div', {'class':'font-semibold text-grey-900 lg:text-16 GBwGxwDUZb'}).text.strip()
        brands.append(brand)

        # encontrar la moneda de precio del producto
        currency = product.find('div','relative px-2 sm:px-2 py-0.5 bg-yellow-400 text-14 whitespace-nowrap lg:text-16').find_all('span')[0].text.strip()
        currencies.append(currency)

        # encontrar el precio del producto
        price = product.find('div','relative px-2 sm:px-2 py-0.5 bg-yellow-400 text-14 whitespace-nowrap lg:text-16').find_all('span')[1].text.strip()
        prices.append(price)

        # encontrar la calificación promedio del producto
        rating = product.find('span', {'class':'ml-1 font-semibold text-blue-500 text-12 lg:text-14'}).text.strip()
        ratings.append(rating)

        # encontrar link con información detallada del producto
        review = product.find('a')['href']
        review = 'https://www.decathlon.in' + review
        reviews.append(review)    

We iterate every url in list `urls` and save data in lists

In [11]:
names = []
brands = []
currencies = []
prices = []
ratings = []
reviews = []

[scrap_tents(url) for url in urls]

[None, None]

Next, we transform the data in a dataframe

In [12]:
df2 = pd.DataFrame.from_dict({'Product':names, 'Brand':brands, 'Rating':ratings, 'Currency':currencies,
                        'Price':prices, 'Description':reviews})

In [13]:
print(df2.shape)
df2.sample(5)

(29, 6)


Unnamed: 0,Product,Brand,Rating,Currency,Price,Description
18,Camping tent 2 Seconds Easy - 3-P - Fresh&Black,QUECHUA,4.6,₹,16999,https://www.decathlon.in/p/8576110/tents-shelt...
11,Camping hoop tent - Arpenaz 6.3 - 6-Person - 3...,QUECHUA,3.9,₹,22999,https://www.decathlon.in/p/8603881/tents-shelt...
25,Ultra-light reflective trekking guy ropes,FORCLAZ,4.3,₹,399,https://www.decathlon.in/p/8527882/tents-shelt...
28,3-Person Trekking Tent MT900 Ultralight,FORCLAZ,4.0,₹,22999,https://www.decathlon.in/p/8586319/tents-shelt...
26,Trekking dome tent - 2-person - MT900 Minimal ...,FORCLAZ,4.6,₹,19999,https://www.decathlon.in/p/8736725/tents-shelt...


In [15]:
#df2.sample(5).to_markdown() #get a markdown table