### **Ejercicio de recolección de datos desde páginas web**

Con el siguiente código vamos a recorrer la página de flipk para recolectar la información sobre todos los portátiles que tienen a la venta, específicamente:
 - Modelo
 - Specs
 - Precio Actual
 - Precio Original
 - Porcentaje de descuento
 - User Rating

Lo primero que hemos hecho ha sido revisar los permisos de la página web en /robot.txt para asegurarnos que es legal recolectar estos datos.

In [19]:
# Importamos las librerías necesarias

from bs4 import BeautifulSoup
import requests
import pandas as pd

In [20]:
# Creamos las listas vacías donde vamos a guardar la información obtenida

descriptions=[]
products=[]
prices=[]
original_prices=[] 
ratings=[]
discounts=[]

In [21]:
# Inicializamos un loop que recorra las 45 páginas que tiene la búsqueda de portátiles (laptops) en flipkart 
# para poder extraer la información de todos los productos ofrecidos


for n in range(1,46):
    url= ("https://www.flipkart.com/laptops/pr?sid=6bo%2Cb5g&marketplace=FLIPKART&page=")
    url=url+str(n)
    r = requests.get(url)
    data = r.text
    soup= BeautifulSoup(data, 'html.parser')

# localizamos el id del ancla donde está contenida la información de los portátiles
# y a su vez vamos identificando el id del div donde tenemos la información
# de el nombre, precio, rating y la descripcion de cada portátil

    for a in soup.findAll('a',href=True, attrs={'class':'_1fQZEK'}):
        name=a.find('div', attrs={'class':'_4rR01T'})
        price=a.find('div', attrs={'class':'_30jeq3 _1_WHN1'})
        original_price=a.find('div', attrs={'class':'_3I9_wc _27UcVY'})
        discount=a.find('div', attrs={'class':'_3Ay6Sb'})
        rating=a.find('div', attrs={'class':'_3LWZlK'})
        description=a.find('div',attrs={'class':'fMghEO'})

# agregamos a las listas y haremos separación entre los campos que siempre
# tienen información y los que a veces contienen nulos
         
        products.append(name.text)
        prices.append(price.text)
        descriptions.append(description)
        try:
            discounts.append(str(discount.span))
        except:
            discounts.append(discount)
        try:
            original_prices.append(original_price.text)
        except:
            original_prices.append(original_price)
        try:
            ratings.append(float(rating.text))
        except:
            ratings.append(None)
        

Dado que los specs (descriptions) en la página web venían como listas dentro del div, en este momento el formato que tenemos es *bs4.element.Tag*, así que seguiremos usando BeautifulSoup para iterar a través de cada uno de los elementos de la lista **descriptions** y extraer del tag de *li* el texto correspondiente. Así transformaremos **descriptions** a una lista de listas.

In [22]:
#Creamos un contador para poder guardar la lista generada en cada iteración en la posición correspondiente en la lista principal

cont=0
for ul in descriptions:
    specs = []
    for li in ul.findAll('li'):
        specs.append(li.text)
    descriptions[cont] = specs
    cont +=1
        

In [23]:
# Convertimos a  dataframe y lo mostramos

flipk_data = pd.DataFrame({'Model':products,'Specs':descriptions,'Current Price':prices, 'Original Price': original_prices,'Discount': discounts,'Rating':ratings})
flipk_data.tail(10)

Unnamed: 0,Model,Specs,Current Price,Original Price,Discount,Rating
974,Acer Nitro 5 Core i5 12th Gen - (16 GB/512 GB ...,"[Intel Core i5 Processor (12th Gen), 16 GB DDR...","₹73,990","₹90,999",<span>18% off</span>,4.3
975,Infinix ZEROBOOK 13 Intel Core i9 13th Gen - (...,"[Intel Core i9 Processor (13th Gen), 32 GB LPD...","₹81,990","₹1,49,900",<span>45% off</span>,
976,HP Chromebook MediaTek Kompanio 500 - (4 GB/64...,"[MediaTek MediaTek Kompanio 500 Processor, 4 G...","₹16,990","₹25,451",<span>33% off</span>,3.8
977,HP 15s Intel Core i5 12th Gen - (8 GB/512 GB S...,"[Intel Core i5 Processor (12th Gen), 8 GB DDR4...","₹53,990","₹67,832",<span>20% off</span>,
978,ASUS Vivobook S 14 Flip Ryzen 5 Hexa Core R5-5...,"[AMD Ryzen 5 Hexa Core Processor, 8 GB DDR4 RA...","₹58,500","₹79,000",<span>25% off</span>,
979,Avita Liber Core i7 8th Gen - (8 GB/256 GB SSD...,"[Intel Core i7 Processor (8th Gen), 8 GB DDR4 ...","₹67,990","₹79,990",<span>15% off</span>,
980,ASUS Vivobook 15 Core i5 11th Gen - (8 GB/512 ...,"[Intel Core i5 Processor (11th Gen), 8 GB DDR4...","₹40,990","₹69,990",<span>41% off</span>,
981,MSI Core i5 13th Gen - (8 GB/512 GB SSD/Window...,"[Intel Core i5 Processor (13th Gen), 8 GB DDR4...","₹54,990","₹64,990",<span>15% off</span>,3.0
982,DELL Core i9 12th Gen - (16 GB/1 TB SSD/Window...,"[Intel Core i9 Processor (12th Gen), 16 GB DDR...","₹1,75,018","₹2,35,073",<span>25% off</span>,
983,MSI Core i5 12th Gen - (8 GB/512 GB SSD/Window...,"[Intel Core i5 Processor (12th Gen), 8 GB LPDD...","₹89,990","₹1,06,990",<span>15% off</span>,


In [24]:
# Tratamos la columna de Discount para eliminar el tag de span

flipk_data.Discount=flipk_data.Discount.map(lambda x: x[6:9], na_action='ignore')

In [25]:
# Comprobamos el resultado

flipk_data.tail(10)

Unnamed: 0,Model,Specs,Current Price,Original Price,Discount,Rating
974,Acer Nitro 5 Core i5 12th Gen - (16 GB/512 GB ...,"[Intel Core i5 Processor (12th Gen), 16 GB DDR...","₹73,990","₹90,999",18%,4.3
975,Infinix ZEROBOOK 13 Intel Core i9 13th Gen - (...,"[Intel Core i9 Processor (13th Gen), 32 GB LPD...","₹81,990","₹1,49,900",45%,
976,HP Chromebook MediaTek Kompanio 500 - (4 GB/64...,"[MediaTek MediaTek Kompanio 500 Processor, 4 G...","₹16,990","₹25,451",33%,3.8
977,HP 15s Intel Core i5 12th Gen - (8 GB/512 GB S...,"[Intel Core i5 Processor (12th Gen), 8 GB DDR4...","₹53,990","₹67,832",20%,
978,ASUS Vivobook S 14 Flip Ryzen 5 Hexa Core R5-5...,"[AMD Ryzen 5 Hexa Core Processor, 8 GB DDR4 RA...","₹58,500","₹79,000",25%,
979,Avita Liber Core i7 8th Gen - (8 GB/256 GB SSD...,"[Intel Core i7 Processor (8th Gen), 8 GB DDR4 ...","₹67,990","₹79,990",15%,
980,ASUS Vivobook 15 Core i5 11th Gen - (8 GB/512 ...,"[Intel Core i5 Processor (11th Gen), 8 GB DDR4...","₹40,990","₹69,990",41%,
981,MSI Core i5 13th Gen - (8 GB/512 GB SSD/Window...,"[Intel Core i5 Processor (13th Gen), 8 GB DDR4...","₹54,990","₹64,990",15%,3.0
982,DELL Core i9 12th Gen - (16 GB/1 TB SSD/Window...,"[Intel Core i9 Processor (12th Gen), 16 GB DDR...","₹1,75,018","₹2,35,073",25%,
983,MSI Core i5 12th Gen - (8 GB/512 GB SSD/Window...,"[Intel Core i5 Processor (12th Gen), 8 GB LPDD...","₹89,990","₹1,06,990",15%,


In [26]:
# Guardamos nuestra información

flipk_data.to_csv('FlipK_Data.csv',index=False)

In [27]:
# Y la volvemos a cargar para comprobar que se ha guardado correctamente

ruta='FlipK_Data.csv'
data=pd.read_csv(ruta,sep=',')
data

Unnamed: 0,Model,Specs,Current Price,Original Price,Discount,Rating
0,Primebook 4G Android Based MediaTek MT8788 - (...,"['MediaTek MediaTek MT8788 Processor', '4 GB L...","₹13,990","₹27,990",50%,4.0
1,Primebook 4G Android Based MediaTek MT8788 - (...,"['MediaTek MediaTek MT8788 Processor', '4 GB L...","₹12,990","₹24,990",48%,
2,ASUS Vivobook 15 Core i3 11th Gen - (8 GB/512 ...,"['Intel Core i3 Processor (11th Gen)', '8 GB D...","₹34,990","₹49,990",30%,
3,ASUS TUF Gaming A15 Ryzen 5 Hexa Core AMD R5-4...,"['AMD Ryzen 5 Hexa Core Processor', '8 GB DDR4...","₹52,990","₹75,990",30%,4.3
4,realme Book (Slim) Core i3 11th Gen - (8 GB/25...,"['Stylish & Portable Thin and Light Laptop', '...","₹31,999","₹54,999",41%,
...,...,...,...,...,...,...
979,Avita Liber Core i7 8th Gen - (8 GB/256 GB SSD...,"['Intel Core i7 Processor (8th Gen)', '8 GB DD...","₹67,990","₹79,990",15%,
980,ASUS Vivobook 15 Core i5 11th Gen - (8 GB/512 ...,"['Intel Core i5 Processor (11th Gen)', '8 GB D...","₹40,990","₹69,990",41%,
981,MSI Core i5 13th Gen - (8 GB/512 GB SSD/Window...,"['Intel Core i5 Processor (13th Gen)', '8 GB D...","₹54,990","₹64,990",15%,3.0
982,DELL Core i9 12th Gen - (16 GB/1 TB SSD/Window...,"['Intel Core i9 Processor (12th Gen)', '16 GB ...","₹1,75,018","₹2,35,073",25%,


In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Model           984 non-null    object 
 1   Specs           984 non-null    object 
 2   Current Price   984 non-null    object 
 3   Original Price  971 non-null    object 
 4   Discount        968 non-null    object 
 5   Rating          476 non-null    float64
dtypes: float64(1), object(5)
memory usage: 46.2+ KB
