# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
#ESTO ES LO QUE HAY QUE INSTALAR PARA PODER LLEVAR A CABO WEB SCRAPING

#pip install beautifulsoup4          # NECESARIO PARA BS4
#pip install selenium                # NECESARIO PARA WEBSCRAPPING
#pip install webdriver-manager       # NECESARIO PARA WEBSCRAPPING

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import requests as req
import time

In [3]:
'''
Esto no hace falta en la nueva versión de Python
'''

#PATH = 'driver/chromedriver'

'\nEsto no hace falta en la nueva versión de Python\n'

In [4]:
'''
Sin necesidad de ejecutar PATH, podemos abrir el Chrome Driver
'''

driver = webdriver.Chrome()

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [5]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [6]:
driver.get(url)

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [7]:
tarjetas = driver.find_elements(By.CSS_SELECTOR, 'div.col-md-6')
tarjetas

[<selenium.webdriver.remote.webelement.WebElement (session="2f9aa57b5d1170687145b9c5ae6c2548", element="f.D32591034E7012EA4E6E57205D60598E.d.46EF595AD47ECD030BF5CC4231020573.e.16")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2f9aa57b5d1170687145b9c5ae6c2548", element="f.D32591034E7012EA4E6E57205D60598E.d.46EF595AD47ECD030BF5CC4231020573.e.17")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2f9aa57b5d1170687145b9c5ae6c2548", element="f.D32591034E7012EA4E6E57205D60598E.d.46EF595AD47ECD030BF5CC4231020573.e.18")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2f9aa57b5d1170687145b9c5ae6c2548", element="f.D32591034E7012EA4E6E57205D60598E.d.46EF595AD47ECD030BF5CC4231020573.e.19")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2f9aa57b5d1170687145b9c5ae6c2548", element="f.D32591034E7012EA4E6E57205D60598E.d.46EF595AD47ECD030BF5CC4231020573.e.20")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2f9aa57b5d1170687145b9c5a

In [8]:
'''
Nos aparecen 50 tarjetas en total, pero nosotros solo queremos las tarjetas que aparecen 
a la izquierda, es decir... Queremos las número 0, 2, 4, 6, etc.
'''

len(tarjetas)

50

In [9]:
'''
Para comprobar lo que digo en el anterior código, le voy a pedir que me cuente todas las
tarjetas de 2 en 2, y nos debería de devolver 25
'''

len(tarjetas[::2]) # El órden que le pedimos debe quedar dentro del 'len'
print(len(tarjetas[::2]))

'''
Nombramos nuevamente a las tarjetas con los datos que nos interesan
'''

tarjetas = tarjetas[::2]

25


In [10]:
'''
Extraemos el primer dato a modo de ejemplo usando "element" en lugar de "elements"
'''

usuario_0 = tarjetas[0].find_element(By.TAG_NAME, 'h1').text
print(usuario_0)
print()
nickname_0 = tarjetas[0].find_element(By.TAG_NAME, 'p').text
print(nickname_0)

Pete

epwalsh


In [11]:
'''
- Para el elemento t en tarjetas, encuentra el nombre "h1" e intenta encontrar el usuario "p".
- Si no lo encuentras pon el usuario vacío.
- Creame un diccionario donde aparezcan los nombres y los usuarios
- Añadir dictio a la lista res
'''

res = []

for t in tarjetas:
    nombre = t.find_element(By.TAG_NAME,'h1').text
    
    try:
        usuario = t.find_element(By.TAG_NAME, 'p').text
    except:
        usuario = ''
    
    dictio = {'nombre': nombre, 'usuario': usuario}
    
    res.append(dictio)

In [12]:
res

[{'nombre': 'Pete', 'usuario': 'epwalsh'},
 {'nombre': 'Charles Packer', 'usuario': 'cpacker'},
 {'nombre': 'Vectorized', 'usuario': 'Vectorized'},
 {'nombre': 'Oliver', 'usuario': 'SchrodingersGat'},
 {'nombre': 'Dessalines', 'usuario': 'dessalines'},
 {'nombre': 'lllyasviel', 'usuario': ''},
 {'nombre': 'weiyang', 'usuario': 'wy-z'},
 {'nombre': 'SoftFever', 'usuario': 'SoftFever'},
 {'nombre': 'Chakshu Gautam', 'usuario': 'ChakshuGautam'},
 {'nombre': 'Lee Robinson', 'usuario': 'leerob'},
 {'nombre': '三咲雅 · Misaki Masa', 'usuario': 'sxyazi'},
 {'nombre': 'Laurent Mazare', 'usuario': 'LaurentMazare'},
 {'nombre': 'Stephen Celis', 'usuario': 'stephencelis'},
 {'nombre': 'Deshraj Yadav', 'usuario': 'deshraj'},
 {'nombre': 'James Newton-King', 'usuario': 'JamesNK'},
 {'nombre': 'George Hotz', 'usuario': 'geohot'},
 {'nombre': 'J. Nick Koston', 'usuario': 'bdraco'},
 {'nombre': 'Jorge O. Castro', 'usuario': 'castrojo'},
 {'nombre': 'DaniPopes', 'usuario': ''},
 {'nombre': 'Sebastian Rasc

In [13]:
len(res)

25

In [14]:
driver.close()

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [15]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [16]:
driver = webdriver.Chrome()

In [17]:
driver.get(url)

In [18]:
tarjetas = driver.find_elements(By.CSS_SELECTOR, 'h2.h3.lh-condensed')
tarjetas

[<selenium.webdriver.remote.webelement.WebElement (session="8835be40f7435e91226f3f2b83aebdcb", element="f.545CBAAF25D945B9897DD7C91BF45EC0.d.5F1FECAAA0BD1863D9DDAD4B0CA3A1DC.e.18")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8835be40f7435e91226f3f2b83aebdcb", element="f.545CBAAF25D945B9897DD7C91BF45EC0.d.5F1FECAAA0BD1863D9DDAD4B0CA3A1DC.e.19")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8835be40f7435e91226f3f2b83aebdcb", element="f.545CBAAF25D945B9897DD7C91BF45EC0.d.5F1FECAAA0BD1863D9DDAD4B0CA3A1DC.e.20")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8835be40f7435e91226f3f2b83aebdcb", element="f.545CBAAF25D945B9897DD7C91BF45EC0.d.5F1FECAAA0BD1863D9DDAD4B0CA3A1DC.e.21")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8835be40f7435e91226f3f2b83aebdcb", element="f.545CBAAF25D945B9897DD7C91BF45EC0.d.5F1FECAAA0BD1863D9DDAD4B0CA3A1DC.e.22")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8835be40f7435e91226f3f2b8

In [19]:
len(tarjetas)

25

In [20]:
tarjetas[0].text

'layerdiffusion / sd-forge-layerdiffusion'

In [21]:
res = []

for t in tarjetas:
    res.append(t.text)

In [22]:
res

['layerdiffusion / sd-forge-layerdiffusion',
 'naver / dust3r',
 'AUTOMATIC1111 / stable-diffusion-webui',
 'pydantic / FastUI',
 'bigcode-project / starcoder2',
 'allenai / OLMo',
 'kyegomez / BitNet',
 'embedchain / embedchain',
 'mini-sora / minisora',
 'liguodongiot / llm-action',
 'lllyasviel / Fooocus',
 'tinygrad / tinygrad',
 'lllyasviel / stable-diffusion-webui-forge',
 'Sinaptik-AI / pandas-ai',
 'freqtrade / freqtrade',
 'roboflow / supervision',
 'donnemartin / system-design-primer',
 'microsoft / unilm',
 'ltdrdata / ComfyUI-Impact-Pack',
 'BatsResearch / bonito',
 'huggingface / alignment-handbook',
 'alexta69 / metube',
 'VikParuchuri / marker',
 'majacinka / crewai-experiments',
 'smicallef / spiderfoot']

In [23]:
len(res)

25

In [24]:
driver.close()

#### Display all the image links from Walt Disney wikipedia page

In [25]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [26]:
driver = webdriver.Chrome()

In [27]:
driver.get(url)

In [28]:
imagenes = driver.find_elements(By.TAG_NAME, 'img')
imagenes

[<selenium.webdriver.remote.webelement.WebElement (session="63e874f91f2fd03f4df7fd7d7dfdff84", element="f.6CEDE42D94AB4A53AF525C511A340B55.d.D2BC6620EEC8A30588E36E6F4CEE4DAB.e.18")>,
 <selenium.webdriver.remote.webelement.WebElement (session="63e874f91f2fd03f4df7fd7d7dfdff84", element="f.6CEDE42D94AB4A53AF525C511A340B55.d.D2BC6620EEC8A30588E36E6F4CEE4DAB.e.19")>,
 <selenium.webdriver.remote.webelement.WebElement (session="63e874f91f2fd03f4df7fd7d7dfdff84", element="f.6CEDE42D94AB4A53AF525C511A340B55.d.D2BC6620EEC8A30588E36E6F4CEE4DAB.e.20")>,
 <selenium.webdriver.remote.webelement.WebElement (session="63e874f91f2fd03f4df7fd7d7dfdff84", element="f.6CEDE42D94AB4A53AF525C511A340B55.d.D2BC6620EEC8A30588E36E6F4CEE4DAB.e.61")>,
 <selenium.webdriver.remote.webelement.WebElement (session="63e874f91f2fd03f4df7fd7d7dfdff84", element="f.6CEDE42D94AB4A53AF525C511A340B55.d.D2BC6620EEC8A30588E36E6F4CEE4DAB.e.62")>,
 <selenium.webdriver.remote.webelement.WebElement (session="63e874f91f2fd03f4df7fd7d7

In [29]:
'''
- Hay 36 imágenes en esta página de wikipedia
'''

len(imagenes)

35

In [30]:
'''
- Si queremos extraer la imagen de Walt Disney, es decir... Una imagen concreta:
    . Copiamos su XPATH
    . Sacamos la info desde el propio driver
    . Pedimos que lo encuentre y extraemos su atributo
'''

xpath = '//*[@id="mw-content-text"]/div[1]/table[1]/tbody/tr[2]/td/span/a/img'

img_wd = driver.find_element(By.XPATH, xpath).get_attribute('src')
img_wd

'https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG'

In [31]:
driver.close()

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [32]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [33]:
driver = webdriver.Chrome()

In [34]:
driver.get(url)

In [35]:
aes = driver.find_elements(By.TAG_NAME, 'a')
aes

[<selenium.webdriver.remote.webelement.WebElement (session="54cf4402bb65c996c89ed288b06df4b3", element="f.1B53143419B0F0D64D00B6837882032F.d.5B29DB2BC68CBA61414F1C31709C4F02.e.92")>,
 <selenium.webdriver.remote.webelement.WebElement (session="54cf4402bb65c996c89ed288b06df4b3", element="f.1B53143419B0F0D64D00B6837882032F.d.5B29DB2BC68CBA61414F1C31709C4F02.e.93")>,
 <selenium.webdriver.remote.webelement.WebElement (session="54cf4402bb65c996c89ed288b06df4b3", element="f.1B53143419B0F0D64D00B6837882032F.d.5B29DB2BC68CBA61414F1C31709C4F02.e.94")>,
 <selenium.webdriver.remote.webelement.WebElement (session="54cf4402bb65c996c89ed288b06df4b3", element="f.1B53143419B0F0D64D00B6837882032F.d.5B29DB2BC68CBA61414F1C31709C4F02.e.95")>,
 <selenium.webdriver.remote.webelement.WebElement (session="54cf4402bb65c996c89ed288b06df4b3", element="f.1B53143419B0F0D64D00B6837882032F.d.5B29DB2BC68CBA61414F1C31709C4F02.e.96")>,
 <selenium.webdriver.remote.webelement.WebElement (session="54cf4402bb65c996c89ed288b

In [36]:
'''
- Yo no puedo pedir que me extraiga el atributo de la lista completa, pero si que lo puedo
pedir por cada elemento.
'''

href = aes[0].get_attribute('href')
href

'https://en.wikipedia.org/wiki/Python#bodyContent'

In [37]:
'''
- Lo que podemos hacer es pedir que a partir de una lista vacía, añada los links de aquellas
partes de la página que contengan el atributo "href".
'''

href_completa = [a.get_attribute('href') for a in aes]
href_completa[:10]

['https://en.wikipedia.org/wiki/Python#bodyContent',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Wikipedia:Contents',
 'https://en.wikipedia.org/wiki/Portal:Current_events',
 'https://en.wikipedia.org/wiki/Special:Random',
 'https://en.wikipedia.org/wiki/Wikipedia:About',
 'https://en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://en.wikipedia.org/wiki/Help:Contents',
 'https://en.wikipedia.org/wiki/Help:Introduction']

In [38]:
driver.close()

#### Number of Titles that have changed in the United States Code since its last release point 

In [39]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [40]:
driver = webdriver.Chrome()

In [41]:
driver.get(url)

In [42]:
'''
- Le pedimos que nos extraiga todos los datos que contengan "div.usctitlechanged" usando el
CSS_SELECTOR.
'''

negrita = driver.find_elements(By.CSS_SELECTOR,'div.usctitlechanged')
negrita

[<selenium.webdriver.remote.webelement.WebElement (session="aec11e5cf450f98c5049029269bc7a1f", element="f.A402E0CF3EBA0F397A3B965C016F2D83.d.F689EBBECAAA5185D7D685833027F40B.e.30")>,
 <selenium.webdriver.remote.webelement.WebElement (session="aec11e5cf450f98c5049029269bc7a1f", element="f.A402E0CF3EBA0F397A3B965C016F2D83.d.F689EBBECAAA5185D7D685833027F40B.e.31")>,
 <selenium.webdriver.remote.webelement.WebElement (session="aec11e5cf450f98c5049029269bc7a1f", element="f.A402E0CF3EBA0F397A3B965C016F2D83.d.F689EBBECAAA5185D7D685833027F40B.e.32")>,
 <selenium.webdriver.remote.webelement.WebElement (session="aec11e5cf450f98c5049029269bc7a1f", element="f.A402E0CF3EBA0F397A3B965C016F2D83.d.F689EBBECAAA5185D7D685833027F40B.e.33")>,
 <selenium.webdriver.remote.webelement.WebElement (session="aec11e5cf450f98c5049029269bc7a1f", element="f.A402E0CF3EBA0F397A3B965C016F2D83.d.F689EBBECAAA5185D7D685833027F40B.e.34")>,
 <selenium.webdriver.remote.webelement.WebElement (session="aec11e5cf450f98c504902926

In [43]:
len(negrita)

6

In [44]:
'''
- Pedimos que nos extraiga el primer titulo que esté en negrita para comprobar que escoge
el correcto.
'''

negrita[0].text

'Title 2 - The Congress'

In [45]:
'''
- Creamos una lista vacía y le solicitamos que para cada elemento "n" en la lista "negrita", 
añada el texto de dicho elemento a la lista res
'''

res = []

for n in negrita:
    
    res.append(n.text)

In [46]:
res

['Title 2 - The Congress',
 'Title 5 - Government Organization and Employees ٭',
 'Title 6 - Domestic Security',
 'Title 18 - Crimes and Criminal Procedure ٭',
 'Title 19 - Customs Duties',
 'Title 42 - The Public Health and Welfare']

In [47]:
driver.close()

#### A Python list with the top ten FBI's Most Wanted names 

In [48]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [49]:
driver = webdriver.Chrome()

In [50]:
driver.get(url)

In [51]:
fugitivos = driver.find_elements(By.TAG_NAME, 'h3')
fugitivos

[<selenium.webdriver.remote.webelement.WebElement (session="8a657083979f2e9856fb074362e7d28c", element="f.5699A39899B902E3D235225072BD06BC.d.2FE5140025F518A97DF1DC016C63E6E9.e.28")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8a657083979f2e9856fb074362e7d28c", element="f.5699A39899B902E3D235225072BD06BC.d.2FE5140025F518A97DF1DC016C63E6E9.e.29")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8a657083979f2e9856fb074362e7d28c", element="f.5699A39899B902E3D235225072BD06BC.d.2FE5140025F518A97DF1DC016C63E6E9.e.30")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8a657083979f2e9856fb074362e7d28c", element="f.5699A39899B902E3D235225072BD06BC.d.2FE5140025F518A97DF1DC016C63E6E9.e.31")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8a657083979f2e9856fb074362e7d28c", element="f.5699A39899B902E3D235225072BD06BC.d.2FE5140025F518A97DF1DC016C63E6E9.e.32")>,
 <selenium.webdriver.remote.webelement.WebElement (session="8a657083979f2e9856fb07436

In [52]:
len(fugitivos)

12

In [53]:
fugitivos[0].text

'BHADRESHKUMAR CHETANBHAI PATEL'

In [54]:
res = []

for f in fugitivos:
    res.append(f.text)

In [55]:
fugitivos = res[:-2]
fugitivos

['BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'DONALD EUGENE FIELDS II',
 'RUJA IGNATOVA',
 'WILVER VILLEGAS-PALOMINO',
 "VITEL'HOMME INNOCENT",
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'OMAR ALEXANDER CARDENAS',
 'YULAN ADONAY ARCHAGA CARIAS']

In [56]:
len(fugitivos)

10

In [57]:
driver.close()

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [69]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [70]:
driver = webdriver.Chrome()

In [71]:
driver.get(url)

In [72]:
import pandas as pd
import numpy as np

In [73]:
cookie = driver.find_element(By.XPATH, '//*[@id="cookieConsentContainer"]/a/div').click()

In [74]:
tabla = driver.find_elements(By.TAG_NAME, 'table')[2]
tabla

<selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.113")>

In [75]:
encabezado = tabla.find_element(By.TAG_NAME, 'thead')
encabezado

<selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.114")>

In [76]:
palabras_enc = [e.text for e in encabezado.find_elements(By.TAG_NAME, 'th')]
palabras_enc = palabras_enc[2:]
palabras_enc

['Date & Time\nUTC',
 'Lat.\ndegrees',
 'Lon.\ndegrees',
 'Depth\nkm',
 'Mag.[+]',
 'Region']

In [77]:
cuerpo = tabla.find_element(By.TAG_NAME, 'tbody')

In [78]:
filas = cuerpo.find_elements(By.TAG_NAME, 'tr')

In [79]:
filas[0].text.split('\n')

['2024-03-03 15:54:59',
 '20 min ago',
 '11.240 125.680 44 3.1 SAMAR, PHILIPPINES']

In [80]:
filas[0].find_elements(By.TAG_NAME, 'td')

[<selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.224")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.225")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.226")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.227")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e512469a2272854eb", element="f.E3F7D540F1E5A6D848FD0E4B154F8A9B.d.D4BBD92073B7707D5E68368FA71509BB.e.228")>,
 <selenium.webdriver.remote.webelement.WebElement (session="7fcc16dad475336e5124

In [81]:
[e.text for e in filas[0].find_elements(By.TAG_NAME, 'td')]

['',
 '',
 '',
 '2024-03-03 15:54:59\n20 min ago',
 '11.240',
 '125.680',
 '44',
 '',
 '3.1',
 'SAMAR, PHILIPPINES']

In [82]:
len([e.text for e in filas[0].find_elements(By.TAG_NAME, 'td')])

10

In [83]:
f = [e.text for e in filas[0].find_elements(By.TAG_NAME, 'td')][1:]

res = []

for e in f:
    
    if e!='':       
        res.append(e)
        
res

['2024-03-03 15:54:59\n20 min ago',
 '11.240',
 '125.680',
 '44',
 '3.1',
 'SAMAR, PHILIPPINES']

In [84]:
len(res)

6

In [85]:
data = []

for f in filas:   
    tmp = []    
    d = [e.text for e in f.find_elements(By.TAG_NAME, 'td')][2:]

    for casilla in d:

        if casilla!='':
            tmp.append(casilla)

    data.append(tmp)

In [86]:
data

[['2024-03-03 15:54:59\n20 min ago',
  '11.240',
  '125.680',
  '44',
  '3.1',
  'SAMAR, PHILIPPINES'],
 ['2024-03-03 15:49:19\n26 min ago',
  '-8.860',
  '112.480',
  '28',
  '4.3',
  'JAVA, INDONESIA'],
 ['2024-03-03 15:47:02\n28 min ago',
  '28.108',
  '-16.263',
  '21',
  '2.2',
  'CANARY ISLANDS, SPAIN REGION'],
 ['2024-03-03 15:28:25\n47 min ago',
  '34.890',
  '24.120',
  '3',
  '2.7',
  'CRETE, GREECE'],
 ['2024-03-03 15:19:07\n56 min ago', '38.160', '22.820', '5', '2.1', 'GREECE'],
 ['2024-03-03 15:13:49\n1 hr 01 min ago',
  '37.488',
  '36.992',
  '7',
  '2.1',
  'CENTRAL TURKEY'],
 ['2024-03-03 15:07:42\n1 hr 08 min ago',
  '-5.051',
  '102.966',
  '62',
  '5.1',
  'SOUTHERN SUMATRA, INDONESIA'],
 ['2024-03-03 15:03:36\n1 hr 12 min ago',
  '38.380',
  '20.480',
  '5',
  '2.0',
  'GREECE'],
 ['2024-03-03 14:50:42\n1 hr 25 min ago',
  '38.035',
  '37.641',
  '11',
  '2.0',
  'CENTRAL TURKEY'],
 ['2024-03-03 14:43:55\n1 hr 31 min ago',
  '38.043',
  '37.659',
  '7',
  '4.3',
  

In [87]:
df = pd.DataFrame(data, columns = palabras_enc)
df

Unnamed: 0,Date & Time\nUTC,Lat.\ndegrees,Lon.\ndegrees,Depth\nkm,Mag.[+],Region
0,2024-03-03 15:54:59\n20 min ago,11.240,125.680,44,3.1,"SAMAR, PHILIPPINES"
1,2024-03-03 15:49:19\n26 min ago,-8.860,112.480,28,4.3,"JAVA, INDONESIA"
2,2024-03-03 15:47:02\n28 min ago,28.108,-16.263,21,2.2,"CANARY ISLANDS, SPAIN REGION"
3,2024-03-03 15:28:25\n47 min ago,34.890,24.120,3,2.7,"CRETE, GREECE"
4,2024-03-03 15:19:07\n56 min ago,38.160,22.820,5,2.1,GREECE
...,...,...,...,...,...,...
95,2024-03-03 08:30:28\n7 hr 45 min ago,38.390,20.490,14,2.1,GREECE
96,2024-03-03 08:28:51\n7 hr 46 min ago,38.400,20.480,6,2.7,GREECE
97,2024-03-03 08:27:47\n7 hr 47 min ago,38.400,20.464,9,3.2,GREECE
98,2024-03-03 08:25:12\n7 hr 50 min ago,38.400,20.480,3,3.0,GREECE


In [88]:
df[:20]

Unnamed: 0,Date & Time\nUTC,Lat.\ndegrees,Lon.\ndegrees,Depth\nkm,Mag.[+],Region
0,2024-03-03 15:54:59\n20 min ago,11.24,125.68,44,3.1,"SAMAR, PHILIPPINES"
1,2024-03-03 15:49:19\n26 min ago,-8.86,112.48,28,4.3,"JAVA, INDONESIA"
2,2024-03-03 15:47:02\n28 min ago,28.108,-16.263,21,2.2,"CANARY ISLANDS, SPAIN REGION"
3,2024-03-03 15:28:25\n47 min ago,34.89,24.12,3,2.7,"CRETE, GREECE"
4,2024-03-03 15:19:07\n56 min ago,38.16,22.82,5,2.1,GREECE
5,2024-03-03 15:13:49\n1 hr 01 min ago,37.488,36.992,7,2.1,CENTRAL TURKEY
6,2024-03-03 15:07:42\n1 hr 08 min ago,-5.051,102.966,62,5.1,"SOUTHERN SUMATRA, INDONESIA"
7,2024-03-03 15:03:36\n1 hr 12 min ago,38.38,20.48,5,2.0,GREECE
8,2024-03-03 14:50:42\n1 hr 25 min ago,38.035,37.641,11,2.0,CENTRAL TURKEY
9,2024-03-03 14:43:55\n1 hr 31 min ago,38.043,37.659,7,4.3,CENTRAL TURKEY


In [89]:
driver.close()

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [90]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/Nanosecso?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor'

In [91]:
driver = webdriver.Chrome()

In [92]:
driver.get(url)

In [94]:
xpath = '//*[@id="react-root"]/div/div/div[2]/main/div/div/div/div/div/div[1]/div[1]/div/div/div/div/div/div[2]/div/div'

driver.find_element(By.XPATH, xpath).text

'20,9 mil posts'

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [95]:
xpath = '//*[@id="react-root"]/div/div/div[2]/main/div/div/div/div/div/div[3]/div/div/div/div/div[5]/div[2]/a/span[1]/span'

driver.find_element(By.XPATH, xpath).text

'185,1 mil'

In [96]:
driver.close()

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [97]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [98]:
driver = webdriver.Chrome()

In [99]:
driver.get(url)

In [100]:
xpath = '//*[@id="www-wikipedia-org"]/main/nav[1]'

caja = driver.find_elements(By.XPATH, xpath)
caja

[<selenium.webdriver.remote.webelement.WebElement (session="9f35148b24b9d0e9d64c339175b2636f", element="f.F8851D82AAB712DCC8406CD256DEFA39.d.5BCF93898D8E1C5D081276124762E23B.e.14")>]

In [101]:
idiomas = []

for e in caja:
    strong = e.find_elements(By.TAG_NAME, 'strong')
    for idioma in strong:
        idiomas.append(idioma.text)

In [102]:
idiomas

['Español',
 'English',
 'Русский',
 '日本語',
 'Deutsch',
 'Français',
 'Italiano',
 '中文',
 'فارسی',
 'Português']

In [103]:
articulos = []

for e in caja:
    small = e.find_elements(By.TAG_NAME, 'small')
    for num_articulo in small:
        articulos.append(num_articulo.text)

In [104]:
articulos

['1.935.000+ artículos',
 '6,790,000+ articles',
 '1 966 000+ статей',
 '1,405,000+ 記事',
 '2.887.000+ Artikel',
 '2\u202f595\u202f000+ articles',
 '1.850.000+ voci',
 '1,406,000+ 条目 / 條目',
 '۹۹۴٬۰۰۰+ مقاله',
 '1.119.000+ artigos']

In [105]:
dict(zip(idiomas, articulos))

{'Español': '1.935.000+ artículos',
 'English': '6,790,000+ articles',
 'Русский': '1 966 000+ статей',
 '日本語': '1,405,000+ 記事',
 'Deutsch': '2.887.000+ Artikel',
 'Français': '2\u202f595\u202f000+ articles',
 'Italiano': '1.850.000+ voci',
 '中文': '1,406,000+ 条目 / 條目',
 'فارسی': '۹۹۴٬۰۰۰+ مقاله',
 'Português': '1.119.000+ artigos'}

In [109]:
#pip install -U deep-translator

In [107]:
from deep_translator import GoogleTranslator

In [108]:
for k,v in dict(zip(idiomas, articulos)).items():
    
    print(GoogleTranslator(source='auto', target='es').translate(k))
    print(GoogleTranslator(source='auto', target='es').translate(v))

Español
1.935.000+ artículos
Inglés
6.790.000+ artículos
ruso
1.966.000+ artículos
japonés
1.405.000+ artículos
Alemán
2,887,000+ Artículo
Francés
2595000+ artículos
italiano
1.850.000+ entradas
Chino
1.406.000+ entradas/entradas
farsi
994.000+ artículos
portugués
1.119.000+ artículos


In [110]:
driver.close()

#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [111]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [112]:
driver = webdriver.Chrome()

In [113]:
driver.get(url)

In [114]:
xpath = '//*[@id="mw-content-text"]/div[1]/table[1]'

tabla = driver.find_elements(By.XPATH, xpath)
tabla

[<selenium.webdriver.remote.webelement.WebElement (session="358ef17ae13ea7da1ff4838377d1896e", element="f.9D4A6F04C0CCD23C536B183BCE986A90.d.275AD73B433AB31E5F218C4B86E48354.e.52")>]

In [115]:
encabezado = tabla[0].find_elements(By.CLASS_NAME, 'headerSort')
encabezado

[<selenium.webdriver.remote.webelement.WebElement (session="358ef17ae13ea7da1ff4838377d1896e", element="f.9D4A6F04C0CCD23C536B183BCE986A90.d.275AD73B433AB31E5F218C4B86E48354.e.53")>,
 <selenium.webdriver.remote.webelement.WebElement (session="358ef17ae13ea7da1ff4838377d1896e", element="f.9D4A6F04C0CCD23C536B183BCE986A90.d.275AD73B433AB31E5F218C4B86E48354.e.54")>,
 <selenium.webdriver.remote.webelement.WebElement (session="358ef17ae13ea7da1ff4838377d1896e", element="f.9D4A6F04C0CCD23C536B183BCE986A90.d.275AD73B433AB31E5F218C4B86E48354.e.55")>,
 <selenium.webdriver.remote.webelement.WebElement (session="358ef17ae13ea7da1ff4838377d1896e", element="f.9D4A6F04C0CCD23C536B183BCE986A90.d.275AD73B433AB31E5F218C4B86E48354.e.56")>]

In [116]:
res = []

for e in encabezado: 
    res.append(e.text)

In [117]:
res

['Language', 'Native speakers\n(in millions)', 'Language family', 'Branch']

In [118]:
lista_sin_n = [elemento.replace('\n', ' ') for elemento in res]
lista_sin_n

['Language', 'Native speakers (in millions)', 'Language family', 'Branch']

In [119]:
filas = driver.find_elements(By.TAG_NAME, 'tr')
filas[1].text.split()

['Mandarin', 'Chinese', '941', 'Sino-Tibetan', 'Sinitic']

In [120]:
idiomas = []

# Iterar sobre las filas del índice 1 al 11 (10 elementos)
for i in range(1, 11):
    
    # Aplicar split a cada fila y agregar el resultado a la lista 'resultados'
    idiomas.append(filas[i].text.split())

In [121]:
idiomas

[['Mandarin', 'Chinese', '941', 'Sino-Tibetan', 'Sinitic'],
 ['Spanish', '486', 'Indo-European', 'Romance'],
 ['English', '380', 'Indo-European', 'Germanic'],
 ['Hindi', '345', 'Indo-European', 'Indo-Aryan'],
 ['Bengali', '237', 'Indo-European', 'Indo-Aryan'],
 ['Portuguese', '236', 'Indo-European', 'Romance'],
 ['Russian', '148', 'Indo-European', 'Balto-Slavic'],
 ['Japanese', '123', 'Japonic', 'Japanese'],
 ['Yue', 'Chinese', '86', 'Sino-Tibetan', 'Sinitic'],
 ['Vietnamese', '85', 'Austroasiatic', 'Vietic']]

In [122]:
idiomas

# Crear una nueva lista para almacenar los resultados combinados
resultados_combinados = []

# Iterar sobre las sublistas en 'idiomas' y combinar 'Mandarin' y 'Chinese', y 'Yue' y 'Chinese'
for sublist in idiomas:
    # Comprobar si la sublista contiene 'Mandarin' y 'Chinese'
    if 'Mandarin' in sublist and 'Chinese' in sublist:
        # Combina las sublistas y agrega la lista combinada a 'resultados_combinados'
        resultado_combinado = [sublist[0] + ' ' + sublist[1]] + sublist[2:]
        resultados_combinados.append(resultado_combinado)
    # Comprobar si la sublista contiene 'Yue' y 'Chinese'
    elif 'Yue' in sublist and 'Chinese' in sublist:
        # Combina las sublistas y agrega la lista combinada a 'resultados_combinados'
        resultado_combinado = [sublist[0] + ' ' + sublist[1]] + sublist[2:]
        resultados_combinados.append(resultado_combinado)
    else:
        # Si no contiene 'Mandarin' y 'Chinese' ni 'Yue' y 'Chinese', simplemente agrega la sublista
        resultados_combinados.append(sublist)

# Ahora 'resultados_combinados' contendrá las sublistas combinadas de 'Mandarin' y 'Chinese', y 'Yue' y 'Chinese'
resultados_combinados

[['Mandarin Chinese', '941', 'Sino-Tibetan', 'Sinitic'],
 ['Spanish', '486', 'Indo-European', 'Romance'],
 ['English', '380', 'Indo-European', 'Germanic'],
 ['Hindi', '345', 'Indo-European', 'Indo-Aryan'],
 ['Bengali', '237', 'Indo-European', 'Indo-Aryan'],
 ['Portuguese', '236', 'Indo-European', 'Romance'],
 ['Russian', '148', 'Indo-European', 'Balto-Slavic'],
 ['Japanese', '123', 'Japonic', 'Japanese'],
 ['Yue Chinese', '86', 'Sino-Tibetan', 'Sinitic'],
 ['Vietnamese', '85', 'Austroasiatic', 'Vietic']]

In [123]:
df = pd.DataFrame(resultados_combinados, columns = lista_sin_n)
df

Unnamed: 0,Language,Native speakers (in millions),Language family,Branch
0,Mandarin Chinese,941,Sino-Tibetan,Sinitic
1,Spanish,486,Indo-European,Romance
2,English,380,Indo-European,Germanic
3,Hindi,345,Indo-European,Indo-Aryan
4,Bengali,237,Indo-European,Indo-Aryan
5,Portuguese,236,Indo-European,Romance
6,Russian,148,Indo-European,Balto-Slavic
7,Japanese,123,Japonic,Japanese
8,Yue Chinese,86,Sino-Tibetan,Sinitic
9,Vietnamese,85,Austroasiatic,Vietic


In [124]:
driver.close()

### BONUS QUESTIONS

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [125]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [126]:
driver = webdriver.Chrome()

In [127]:
driver.get(url)

In [128]:
peliculas = driver.find_element(By.XPATH, '//*[@id="__next"]/main/div/div[3]/section/div/div[2]/div/ul')
peliculas

<selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333bdcb415", element="f.EF2BCCD542EACB1FD9DCFAD87E7060CD.d.13564E3DB4BE40CA49A502F130B84A18.e.70")>

In [129]:
len(peliculas.find_elements(By.TAG_NAME,'li'))

250

In [130]:
pel = peliculas.find_elements(By.TAG_NAME,'li')
pel

[<selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333bdcb415", element="f.EF2BCCD542EACB1FD9DCFAD87E7060CD.d.13564E3DB4BE40CA49A502F130B84A18.e.71")>,
 <selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333bdcb415", element="f.EF2BCCD542EACB1FD9DCFAD87E7060CD.d.13564E3DB4BE40CA49A502F130B84A18.e.72")>,
 <selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333bdcb415", element="f.EF2BCCD542EACB1FD9DCFAD87E7060CD.d.13564E3DB4BE40CA49A502F130B84A18.e.73")>,
 <selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333bdcb415", element="f.EF2BCCD542EACB1FD9DCFAD87E7060CD.d.13564E3DB4BE40CA49A502F130B84A18.e.74")>,
 <selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333bdcb415", element="f.EF2BCCD542EACB1FD9DCFAD87E7060CD.d.13564E3DB4BE40CA49A502F130B84A18.e.75")>,
 <selenium.webdriver.remote.webelement.WebElement (session="f24cf57a65fb61ed05db88333

In [131]:
pel[0].text.split('\n')

['1. Cadena perpetua', '1994', '2h 22m', '13', '9,3', ' (2,9 M)', 'Puntuar']

In [132]:
df = pd.DataFrame([e.text.split('\n') for e in pel])
df

Unnamed: 0,0,1,2,3,4,5,6
0,1. Cadena perpetua,1994,2h 22m,13,93,"(2,9 M)",Puntuar
1,2. El padrino,1972,2h 55m,18,92,(2 M),Puntuar
2,3. El caballero oscuro,2008,2h 32m,12,90,"(2,8 M)",Puntuar
3,4. El padrino parte II,1974,3h 22m,18,90,"(1,4 M)",Puntuar
4,5. 12 hombres sin piedad,1957,1h 36m,A,90,(856 mil),Puntuar
...,...,...,...,...,...,...,...
245,246. El gigante de hierro,1999,1h 26m,A,81,(225 mil),Puntuar
246,247. Criadas y señoras,2011,2h 26m,A,81,(489 mil),Puntuar
247,248. Sucedió una noche,1934,1h 45m,A,81,(112 mil),Puntuar
248,249. Los cuatrocientos golpes,1959,1h 32m,A,81,(127 mil),Puntuar


In [133]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1. Cadena perpetua,1994,2h 22m,13,93,"(2,9 M)",Puntuar
1,2. El padrino,1972,2h 55m,18,92,(2 M),Puntuar
2,3. El caballero oscuro,2008,2h 32m,12,90,"(2,8 M)",Puntuar
3,4. El padrino parte II,1974,3h 22m,18,90,"(1,4 M)",Puntuar
4,5. 12 hombres sin piedad,1957,1h 36m,A,90,(856 mil),Puntuar


In [134]:
driver.close()

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [135]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [136]:
driver = webdriver.Chrome()

In [137]:
driver.get(url)

In [138]:
driver.find_element(By.XPATH, '//*[@id="list-view-option-detailed"]').click()

In [139]:
pelis = driver.find_element(By.XPATH, '//*[@id="__next"]/main/div/div[3]/section/div/div[2]/div/ul')
pelis

<selenium.webdriver.remote.webelement.WebElement (session="8802e245680969d1d201e2e1b14de9cd", element="f.5489EB9DAC9613CE115D9E18B0C62DF3.d.EE13D3F69DDD01F0F06803C83D4C2D98.e.67")>

In [140]:
pel = pelis.find_elements(By.TAG_NAME,'li')
len(pel)

75

In [141]:
pel[0].text.split('\n')

['1. Cadena perpetua',
 '1994',
 '2h 22m',
 '13',
 '9,3',
 ' (2,9 M)',
 'Puntuar',
 'Andy Dufresne es encarcelado por matar a su esposa y al amante de esta. Tras una dura adaptación, intenta mejorar las condiciones de la prisión y dar esperanza a sus compañeros.',
 'DirectorFrank DarabontEstrellasTim RobbinsMorgan FreemanBob Gunton',
 'Votos2.865.062']

In [142]:
nombre = pel[0].find_element(By.TAG_NAME, 'h3').text
nombre

'1. Cadena perpetua'

In [143]:
año = pel[0].find_elements(By.TAG_NAME, 'span')[1].text
año

'1994'

In [144]:
dur = pel[0].find_elements(By.TAG_NAME, 'span')[2].text
dur

'2h 22m'

In [145]:
edad = pel[0].find_elements(By.TAG_NAME, 'span')[3].text
edad

'13'

In [146]:
est = pel[0].find_elements(By.TAG_NAME, 'span')[4].text
est

'9,3\n (2,9 M)\nPuntuar'

In [147]:
est.split('\n')

['9,3', ' (2,9 M)', 'Puntuar']

In [148]:
reco = est.split('\n')[0]
usuarios = est.split('\n')[1].strip().replace('(', '').replace(')', '')

In [149]:
reco

'9,3'

In [150]:
usuarios

'2,9 M'

In [151]:
desc = pel[0].find_element(By.CSS_SELECTOR, 'div.ipc-html-content-inner-div').text
desc

'Andy Dufresne es encarcelado por matar a su esposa y al amante de esta. Tras una dura adaptación, intenta mejorar las condiciones de la prisión y dar esperanza a sus compañeros.'

In [152]:
director = pel[0].find_element(By.CSS_SELECTOR, 'a.ipc-link.ipc-link--base.dli-director-item').text
director

'Frank Darabont'

In [153]:
estrellas = [e.text for e in pel[0].find_elements(By.TAG_NAME, 'span')[11:]][:-1]
estrellas

['Estrellas', 'Tim Robbins', 'Morgan Freeman', 'Bob Gunton']

In [154]:
votos = pel[0].find_element(By.CSS_SELECTOR, 'div.sc-f24f1c5c-0.cPpOqU').text[5:]
votos

'2.865.062'

In [155]:
data = []

for p in pel:
    
    est = p.find_elements(By.TAG_NAME, 'span')[4].text
    
    dictio = {'nombre': p.find_element(By.TAG_NAME, 'h3').text,
              'año': p.find_elements(By.TAG_NAME, 'span')[1].text,
              'duracion': p.find_elements(By.TAG_NAME, 'span')[2].text,
              'edad': p.find_elements(By.TAG_NAME, 'span')[3].text,
              'reco': est.split('\n')[0],
              'usuarios': est.split('\n')[1].strip().replace('(', '').replace(')', ''),
              'desc': p.find_element(By.CSS_SELECTOR, 'div.ipc-html-content-inner-div').text,
              'director': p.find_element(By.CSS_SELECTOR, 'a.ipc-link.ipc-link--base.dli-director-item').text,
              'estrellas': '-'.join([e.text for e in p.find_elements(By.TAG_NAME, 'span')[11:]][:-1]),
              'votos': p.find_element(By.CSS_SELECTOR, 'div.sc-f24f1c5c-0.cPpOqU').text[5:]

             }
    
    
    data.append(dictio)

In [156]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,nombre,año,duracion,edad,reco,usuarios,desc,director,estrellas,votos
0,1. Cadena perpetua,1994,2h 22m,13,93,"2,9 M",Andy Dufresne es encarcelado por matar a su es...,Frank Darabont,Estrellas-Tim Robbins-Morgan Freeman-Bob Gunton,2.865.062
1,2. El padrino,1972,2h 55m,18,92,2 M,El envejecido patriarca de una dinastía del cr...,Francis Ford Coppola,Estrellas-Marlon Brando-Al Pacino-James Caan,1.995.660
2,3. El caballero oscuro,2008,2h 32m,12,90,"2,8 M",Cuando la amenaza conocida como el Joker causa...,Christopher Nolan,Estrellas-Christian Bale-Heath Ledger-Aaron Ec...,2.846.250
3,4. El padrino parte II,1974,3h 22m,18,90,"1,4 M",Se retratan los inicios de la vida y la carrer...,Francis Ford Coppola,Estrellas-Al Pacino-Robert De Niro-Robert Duvall,1.354.041
4,5. 12 hombres sin piedad,1957,1h 36m,A,90,856 mil,Un miembro del jurado trata de evitar un error...,Sidney Lumet,Estrellas-Henry Fonda-Lee J. Cobb-Martin Balsam,856.383


In [157]:
df.tail()

Unnamed: 0,nombre,año,duracion,edad,reco,usuarios,desc,director,estrellas,votos
70,71. El caballero oscuro: La leyenda renace,2012,2h 44m,12,84,"1,8 M",Ocho años después del reinado de anarquía del ...,Christopher Nolan,Estrellas-Christian Bale-Tom Hardy-Anne Hathaway,1.813.165
71,72. ¿Teléfono rojo? Volamos hacia Moscú,1964,1h 35m,18,84,515 mil,Un general enajenado provoca un potencial holo...,Stanley Kubrick,Estrellas-Peter Sellers-George C. Scott-Sterli...,515.120
72,73. American Beauty,1999,2h 2m,18,83,"1,2 M",Un padre de clase media sexualmente frustrado ...,Sam Mendes,Estrellas-Kevin Spacey-Annette Bening-Thora Birch,1.204.792
73,74. Old Boy,2003,2h,18,83,630 mil,Tras ser secuestrado y aprisionado durante 15 ...,Park Chan-wook,Estrellas-Choi Min-sik-Yoo Ji-tae-Kang Hye-jeong,629.858
74,75. Coco,2017,1h 45m,A,84,583 mil,El aspirante a músico Miguel le planta cara a ...,Lee Unkrich,Estrellas-Anthony Gonzalez-Gael García Bernal-...,582.532


In [158]:
driver.close()

#### Book name,price and stock availability as a pandas dataframe.

In [159]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [160]:
driver = webdriver.Chrome()

In [161]:
driver.get(url)

In [162]:
libros = driver.find_element(By.TAG_NAME, 'ol')
libros

<selenium.webdriver.remote.webelement.WebElement (session="1189d45675730cc05f46a442796b5a65", element="f.83A6527936B88D33A3A47277FC2A41DF.d.080E2E445FD4D72828B3F643A53DE906.e.48")>

In [163]:
libros = libros.find_elements(By.TAG_NAME, 'li')

In [164]:
nombre = libros[0].find_element(By.TAG_NAME, 'h3').find_element(By.TAG_NAME, 'a').get_attribute('title')
nombre

'A Light in the Attic'

In [165]:
precio  = float(libros[0].find_elements(By.TAG_NAME, 'p')[1].text[1:])
precio

51.77

In [166]:
stock = libros[0].find_elements(By.TAG_NAME, 'p')[2].text
stock

'In stock'

In [167]:
estrellas = libros[0].find_elements(By.TAG_NAME, 'p')[0].get_attribute('class').split()[-1]
estrellas

'Three'

In [168]:
data = []

for l in libros:
    
    dictio = {'nombre': l.find_element(By.TAG_NAME, 'h3').find_element(By.TAG_NAME, 'a').get_attribute('title'),
              'precio': float(l.find_elements(By.TAG_NAME, 'p')[1].text[1:]),
              'stock': l.find_elements(By.TAG_NAME, 'p')[2].text,
              'estrellas': l.find_elements(By.TAG_NAME, 'p')[0].get_attribute('class').split()[-1]
             }
    
    data.append(dictio)

In [169]:
df = pd.DataFrame(data)
df

Unnamed: 0,nombre,precio,stock,estrellas
0,A Light in the Attic,51.77,In stock,Three
1,Tipping the Velvet,53.74,In stock,One
2,Soumission,50.1,In stock,One
3,Sharp Objects,47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,54.23,In stock,Five
5,The Requiem Red,22.65,In stock,One
6,The Dirty Little Secrets of Getting Your Dream...,33.34,In stock,Four
7,The Coming Woman: A Novel Based on the Life of...,17.93,In stock,Three
8,The Boys in the Boat: Nine Americans and Their...,22.6,In stock,Four
9,The Black Maria,52.15,In stock,One


In [170]:
df.estrellas.unique()

array(['Three', 'One', 'Four', 'Five', 'Two'], dtype=object)

In [171]:
dictio = {'Three': 3, 'One': 1, 'Four': 4, 'Five': 5, 'Two': 2}

df.estrellas = df.estrellas.apply(lambda x: dictio[x])

In [172]:
df

Unnamed: 0,nombre,precio,stock,estrellas
0,A Light in the Attic,51.77,In stock,3
1,Tipping the Velvet,53.74,In stock,1
2,Soumission,50.1,In stock,1
3,Sharp Objects,47.82,In stock,4
4,Sapiens: A Brief History of Humankind,54.23,In stock,5
5,The Requiem Red,22.65,In stock,1
6,The Dirty Little Secrets of Getting Your Dream...,33.34,In stock,4
7,The Coming Woman: A Novel Based on the Life of...,17.93,In stock,3
8,The Boys in the Boat: Nine Americans and Their...,22.6,In stock,4
9,The Black Maria,52.15,In stock,1


In [173]:
# todas las paginas

data = []

for i in range(49):
    
    libros = driver.find_element(By.TAG_NAME, 'ol').find_elements(By.TAG_NAME, 'li')
    
    for l in libros:

        dictio = {'nombre': l.find_element(By.TAG_NAME, 'h3').find_element(By.TAG_NAME, 'a').get_attribute('title'),
                  'precio': float(l.find_elements(By.TAG_NAME, 'p')[1].text[1:]),
                  'stock': l.find_elements(By.TAG_NAME, 'p')[2].text,
                  'estrellas': l.find_elements(By.TAG_NAME, 'p')[0].get_attribute('class').split()[-1]

                 }

        data.append(dictio)
        
    
    if i!=49:  # aqui salio el error, pero el dato esta
        boton = driver.find_element(By.CSS_SELECTOR, '#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a')

        boton.click()
    
        time.sleep(1)
    
    
    
df = pd.DataFrame(data)

dictio = {'Three': 3, 'One': 1, 'Four': 4, 'Five': 5, 'Two': 2}

df.estrellas = df.estrellas.apply(lambda x: dictio[x]) 


df

Unnamed: 0,nombre,precio,stock,estrellas
0,A Light in the Attic,51.77,In stock,3
1,Tipping the Velvet,53.74,In stock,1
2,Soumission,50.10,In stock,1
3,Sharp Objects,47.82,In stock,4
4,Sapiens: A Brief History of Humankind,54.23,In stock,5
...,...,...,...,...
975,Icing (Aces Hockey #2),40.44,In stock,4
976,"Hawkeye, Vol. 1: My Life as a Weapon (Hawkeye #1)",45.24,In stock,3
977,Having the Barbarian's Baby (Ice Planet Barbar...,34.96,In stock,4
978,"Giant Days, Vol. 1 (Giant Days #1-4)",56.76,In stock,4
