# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [3]:
# This is the url you will scrape in this exercise
developers_url = 'https://github.com/trending/developers'

In [4]:
#your code

trending_developers_soup = BeautifulSoup(requests.get(developers_url).content, "lxml")

print(trending_developers_soup)

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" integrity="sha512-UXiu4O52iBFkqt6Kx5t+pqHYP2/LWWIw9+l5ia74TWw+xPzpH44BFfAQp7yzCe0XFGZa72Xiqyml6tox1KkUjw==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" integrity="sha512-IX1PnI5wWBz8Kgb1JI0f2QFa/WuRQQHJHe0vkKinQ

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [5]:
#your code

developer_articles = trending_developers_soup.find_all("article", class_= "Box-row d-flex")

developer_names = [developer_article.find("h1").text for developer_article in developer_articles]

developer_names_no_spaces = [developer_name.strip() for developer_name in developer_names]

print(developer_names_no_spaces)

['Olivier Halligon', 'Adrian Wälchli', "Bruno D'Luka", 'Michael Waskom', 'Samuel Berthe', 'James N', 'Lee Robinson', 'George Mamadashvili', 'Ariel Mashraki', 'Fred K. Schott', 'visiky', 'J. Nick Koston', 'Juliette', 'Benjamin Pasero', 'Muntashir Al-Islam', 'metonym', 'Tõnis Tiigi', 'Abdelkader Boudih', 'Patrick Collins', 'Piotr Machowski', 'João Pedro Schmitz', 'Bryan C Guner', 'James Newton-King', 'enjoy-digital', 'Florian Roth']


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [6]:
# This is the url you will scrape in this exercise
repositories_url = 'https://github.com/trending/python?since=daily'

In [7]:
#your code

trending_repositories_soup = BeautifulSoup(requests.get(repositories_url).content, "lxml")

print(trending_repositories_soup)

repository_articles = trending_repositories_soup.find_all("article", class_= "Box-row")

print(len(repository_articles))

repository_names = [repository_article.find("h1").text for repository_article in repository_articles]

repository_names_no_spaces = [repository_name.strip().replace(" ", "").replace("\n", "") for repository_name in repository_names]

print(repository_names_no_spaces)

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" integrity="sha512-UXiu4O52iBFkqt6Kx5t+pqHYP2/LWWIw9+l5ia74TWw+xPzpH44BFfAQp7yzCe0XFGZa72Xiqyml6tox1KkUjw==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" integrity="sha512-IX1PnI5wWBz8Kgb1JI0f2QFa/WuRQQHJHe0vkKinQ

#### Display all the image links from Walt Disney wikipedia page

In [8]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [9]:
#your code

base_url = "https://en.wikipedia.org"

disney_soup = BeautifulSoup(requests.get(url).content, "lxml")

first_image_element = disney_soup.find("td", class_= "infobox-image")

first_image_url = base_url + first_image_element.find("a")["href"]

image_divs = disney_soup.find_all("div", class_= "thumbinner")

print(len(image_divs))

image_urls = [base_url + image_div.find("a")["href"] for image_div in image_divs]

image_urls.insert(0, first_image_url)

print(image_urls)

14
['https://en.wikipedia.org/wiki/File:Walt_Disney_1946.JPG', 'https://en.wikipedia.org/wiki/File:Walt_Disney_envelope_ca._1921.jpg', 'https://en.wikipedia.org/wiki/File:Newman_Laugh-O-Gram_(1921).webm', 'https://en.wikipedia.org/wiki/File:Trolley_Troubles_poster.jpg', 'https://en.wikipedia.org/wiki/File:Steamboat-willie.jpg', 'https://en.wikipedia.org/wiki/File:Walt_Disney_1935.jpg', 'https://en.wikipedia.org/wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg', 'https://en.wikipedia.org/wiki/File:Disney_drawing_goofy.jpg', 'https://en.wikipedia.org/wiki/File:DisneySchiphol1951.jpg', 'https://en.wikipedia.org/wiki/File:WaltDisneyplansDisneylandDec1954.jpg', 'https://en.wikipedia.org/wiki/File:Walt_disney_portrait_right.jpg', 'https://en.wikipedia.org/wiki/File:Walt_Disney_Grave.JPG', 'https://en.wikipedia.org/wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg', 'https://en.wikipedia.org/wiki/File:Disney_Oscar_1953_(cropped).jpg', 'https://en.wikipedia.org/wiki

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [10]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [11]:
#your code

base_url = "https://en.wikipedia.org"

wiki_soup = BeautifulSoup(requests.get(url).content, "lxml")

first_div_element = wiki_soup.find("div", id= "mw-content-text")

inner_div_element = first_div_element.find("div", class_="mw-parser-output")

h2_elements = inner_div_element.find_all("h2")

ul_elements = [h2_element.find_next_sibling("ul") for h2_element in h2_elements]

ul_elements_clean = [ul_element for ul_element in ul_elements if ul_element is not None]

a_elements_lists = [ul_element_clean.find_all("a") for ul_element_clean in ul_elements_clean]

urls = [base_url + a_element["href"] for a_elements_list in a_elements_lists for a_element in a_elements_list]

wiki_python_page_soup = BeautifulSoup(requests.get(urls[0]).content, "lxml")

a_elements = wiki_python_page_soup.find_all("a")

a_elements_clean = [a_element for a_element in a_elements if (a_element.has_attr("href") and "/" in a_element["href"])]

a_elements_no_images = [a_element_clean for a_element_clean in a_elements_clean if a_element_clean.has_attr("class") == False]

url_list = [base_url + a_element_no_images["href"] for a_element_no_images in a_elements_no_images]

print(url_list)
print(len(url_list))






['https://en.wikipedia.org/wiki/Pythonides', 'https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section#Length', 'https://en.wikipedia.org/wiki/Wikipedia:Summary_style', 'https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section#Provide_an_accessible_overview', 'https://en.wikipedia.org/wiki/Taxonomy_(biology)', 'https://en.wikipedia.org/wiki/Template:Taxonomy/Pythonidae', 'https://en.wikipedia.org/wiki/Animal', 'https://en.wikipedia.org/wiki/Chordate', 'https://en.wikipedia.org/wiki/Reptile', 'https://en.wikipedia.org/wiki/Squamata', 'https://en.wikipedia.org/wiki/Snake', 'https://en.wikipedia.org/wiki/Pythonoidea', 'https://en.wikipedia.org/wiki/Leopold_Fitzinger', 'https://en.wikipedia.org/wiki/Synonym_(taxonomy)', 'https://en.wikipedia.org/wiki/Family_(biology)', 'https://en.wikipedia.org/wiki/Venomous_snake', 'https://en.wikipedia.org/wiki/Snake', 'https://en.wikipedia.org/wiki/Genus', 'https://en.wikipedia.org/wiki/Species', 'https://en.wikipedia.org/w/index

#### Number of Titles that have changed in the United States Code since its last release point 

In [12]:
# This is the url you will scrape in this exercise
code_url = 'http://uscode.house.gov/download/download.shtml'

In [13]:
#your code

code_soup = BeautifulSoup(requests.get(code_url).content, "lxml")

changed_title_elements = code_soup.find_all("div", class_="usctitlechanged")

changed_titles = [changed_title_element.text.strip() for changed_title_element in changed_title_elements]

print(changed_titles)

['Title 31 - Money and Finance ٭', 'Title 34 - Crime Control and Law Enforcement']


#### A Python list with the top ten FBI's Most Wanted names 

In [14]:
# This is the url you will scrape in this exercise
fbi_url = 'https://www.fbi.gov/wanted/topten'

In [15]:
#your code 

fbi_soup = BeautifulSoup(requests.get(fbi_url).content, "lxml")

wanted_list = fbi_soup.find_all("li", class_= "portal-type-person castle-grid-block-item")

wanted_names = [wanted.find("h3").text.strip() for wanted in wanted_list]

print(wanted_names)

['BHADRESHKUMAR CHETANBHAI PATEL', 'OMAR ALEXANDER CARDENAS', 'ALEJANDRO ROSALES CASTILLO', 'RUJA IGNATOVA', 'JASON DEREK BROWN', 'ARNOLDO JIMENEZ', 'ALEXIS FLORES', 'JOSE RODOLFO VILLARREAL-HERNANDEZ', 'YULAN ADONAY ARCHAGA CARIAS', 'RAFAEL CARO-QUINTERO']


####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [16]:
# This is the url you will scrape in this exercise
earthquake_url = 'https://www.emsc-csem.org/Earthquake/'

In [17]:
#your code

import re

columns = ["date", "time", "latitude", "longitude", "region"]

earthquake_soup = BeautifulSoup(requests.get(earthquake_url).content, "lxml")

table_body = earthquake_soup.find("tbody", id = "tbody")

table_rows = table_body.find_all("tr")

data_for_df = []

for table_row in table_rows[:20]:
    data = []
    date_and_hour_element = table_row.find("td", class_= "tabev6")
    date_and_hour = date_and_hour_element.find("a").text.split("\xa0\xa0\xa0")
    date = date_and_hour[0]
    time = date_and_hour[1]
    position_numbers = table_row.find_all("td", class_= "tabev1")
    position_letters = table_row.find_all("td", class_= "tabev2")
    position_and_letters = list(zip(position_numbers, position_letters))
    latitude = position_and_letters[0][0].text + position_and_letters[0][1].text
    longitude = position_and_letters[1][0].text + position_and_letters[1][1].text
    region = table_row.find("td", class_="tb_region").text
    data.append(date)
    data.append(time)
    data.append(latitude.replace("\xa0", " ").strip())
    data.append(longitude.replace("\xa0", " ").strip())
    data.append(region.replace("\xa0", " ").strip())
    data_for_df.append(data)
    
earthquakes = pd.DataFrame(data_for_df, columns = columns)

print(earthquakes)

          date        time latitude longitude  \
0   2022-08-06  12:43:50.8  32.51 N  142.41 E   
1   2022-08-06  12:41:21.8  35.46 N    3.61 W   
2   2022-08-06  12:30:04.0   7.61 S  122.42 E   
3   2022-08-06  12:23:45.7  36.19 N  141.13 E   
4   2022-08-06  12:21:42.1  35.34 N    3.63 W   
5   2022-08-06  12:01:55.0  17.88 N  120.52 E   
6   2022-08-06  11:57:42.3  35.58 N    3.61 W   
7   2022-08-06  11:48:25.0  12.36 N   87.74 W   
8   2022-08-06  11:48:02.4  51.24 N  160.92 E   
9   2022-08-06  11:39:48.6  35.61 N    3.53 W   
10  2022-08-06  11:33:53.0  15.80 N   94.36 W   
11  2022-08-06  11:18:31.9  35.77 N    3.16 W   
12  2022-08-06  11:03:55.5  35.64 N    3.58 W   
13  2022-08-06  11:00:34.0  24.30 S   67.09 W   
14  2022-08-06  10:59:03.0  11.18 N   86.96 W   
15  2022-08-06  10:52:36.5  41.36 N  125.90 W   
16  2022-08-06  10:44:32.9  61.30 N  146.66 W   
17  2022-08-06  10:13:14.6  44.48 N  115.18 W   
18  2022-08-06  10:12:54.0  35.67 N    3.41 W   
19  2022-08-06  10:0

#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [18]:
# This is the url you will scrape in this exercise
hack_url ='https://mlh.io/seasons/na-2020/events'

In [19]:
#your code

from datetime import datetime, timedelta

day = timedelta(days=1)

columns = ["Title", "Dates", "Days", "City", "State"]

hack_soup = BeautifulSoup(requests.get(hack_url).content, "lxml")

event_items = hack_soup.find_all("a", class_="event-link")

events_for_df = []

for event_item in event_items[0:25]:
    event_title = event_item.find("h3").text
    event_date_string = event_item.find("p", class_="event-date").text.strip()
    event_dates = event_item.find_all("meta")
    event_days = (datetime.strptime(event_dates[1]["content"], "%Y-%m-%d").date() - datetime.strptime(event_dates[0]["content"], "%Y-%m-%d").date()) + day
    event_city = event_item.find("span", itemprop = "city").text
    event_state = event_item.find("span", itemprop = "state").text
    event_info = [event_title, event_date_string, event_days.days, event_city, event_state]
    events_for_df.append(event_info)
    
events_df = pd.DataFrame(events_for_df, columns = columns)
print(events_df)

                                 Title            Dates  Days            City  \
0                              HackMTY  Aug 24th - 25th     2       Monterrey   
1                        Citizen Hacks    Sep 6th - 8th     3         Toronto   
2                             PennApps    Sep 6th - 8th     3    Philadelphia   
3    Hackathon de Futuras Tecnologías     Sep 7th - 8th     2         Torreón   
4                       Hack the North  Sep 13th - 15th     3        Waterloo   
5                             HopHacks  Sep 13th - 15th     3       Baltimore   
6                        BigRed//Hacks  Sep 20th - 22nd     3          Ithaca   
7                             HackRice  Sep 20th - 21st     2         Houston   
8                            SBUHacks   Sep 20th - 21st     2     Stony Brook   
9                           ShellHacks  Sep 20th - 22nd     3           Miami   
10                            sunhacks  Sep 20th - 22nd     3           Tempe   
11                    Kent H

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [21]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url

# NOTE: I'VE USED SELENIUM BECAUSE REQUESTS DOESNT BRING ALL CONTENTS. THE JS NEEDS TO BE LOADED AS WELL

account_name = "RajadoresFutbol"

twitter_url = 'https://twitter.com/'

final_url = twitter_url + account_name

In [15]:
#your code

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')

driver = webdriver.Chrome("chromedriver.exe", options = options)

driver.implicitly_wait(10)
driver.get(final_url)

number_of_tweets = driver.find_element(By.CSS_SELECTOR, '.css-901oao.css-1hf3ou5.r-14j79pv.r-37j5jr.r-n6v787.r-16dba41.r-1cwl3u0.r-bcqeeo.r-qvutc0').text
print(number_of_tweets)

print("finished")
driver.close()

1.501 Tweets
finished


#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [22]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url

# We'll use the same url as in the previous question

# NOTE: I'VE USED SELENIUM BECAUSE REQUESTS DOESNT BRING ALL CONTENTS. THE JS NEEDS TO BE LOADED AS WELL


In [14]:
#your code

from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')

driver = webdriver.Chrome("chromedriver.exe", options = options)

driver.implicitly_wait(10)
driver.get(final_url)

possible_elements = driver.find_elements(By.CSS_SELECTOR, '.css-1dbjc4n.r-13awgt0.r-18u37iz.r-1w6e6rj')

spans = []

for possible_element in possible_elements:
    span = possible_element.find_elements(By.CSS_SELECTOR, "span.css-901oao.css-16my406.r-poiln3.r-bcqeeo.r-qvutc0")
    spans.append(span)
    
flat_spans = [span for span_list in spans for span in span_list]

followers = flat_spans[4]

print(followers.text)

print("finished")
driver.close()

46,2 mil
finished


#### List all language names and number of related articles in the order they appear in wikipedia.org

In [24]:
# This is the url you will scrape in this exercise
wiki_url = 'https://www.wikipedia.org/'

In [25]:
#your code

wiki_soup = BeautifulSoup(requests.get(wiki_url).content, "lxml")

central_element = wiki_soup.find("div", class_="central-featured")

language_elements = central_element.find_all("a")

language_and_articles = [(language_element.find("strong").text, language_element.find("small").text.replace("\xa0", ".")) for language_element in language_elements]

print(language_and_articles)

[('English', '6.458.000+ articles'), ('日本語', '1.314.000+ 記事'), ('Русский', '1.798.000+ статей'), ('Deutsch', '2.667.000+ Artikel'), ('Español', '1.755.000+ artículos'), ('Français', '2.400.000+ articles'), ('Italiano', '1.742.000+ voci'), ('中文', '1.256.000+ 条目 / 條目'), ('Português', '1.085.000+ artigos'), ('Polski', '1.512.000+ haseł')]


#### A list with the different kind of datasets available in data.gov.uk 

In [26]:
# This is the url you will scrape in this exercise
gov_url = 'https://data.gov.uk/'

In [27]:
#your code 

gov_soup = BeautifulSoup(requests.get(gov_url).content, "lxml")

data_block = gov_soup.find("ul", class_= "govuk-list dgu-topics__list")

data_urls = data_block.find_all("a", class_= "govuk-link")

data_types = [data_url.text for data_url in data_urls]

print(data_types)

['Business and economy', 'Crime and justice', 'Defence', 'Education', 'Environment', 'Government', 'Government spending', 'Health', 'Mapping', 'Society', 'Towns and cities', 'Transport', 'Digital service performance', 'Government reference data']


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [28]:
# This is the url you will scrape in this exercise
wiki_languages_url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [29]:
#your code

wiki_languages_soup = BeautifulSoup(requests.get(wiki_languages_url).content, "lxml")

columns = ["Language", "Speakers"]

table_element = wiki_languages_soup.find("tbody")

language_elements = table_element.find_all("tr")

language_elements.pop(0)

language_rows = [language_element.find_all("td") for language_element in language_elements]

language_speakers = [[language_row[1].text.replace("\n", ""), language_row[2].text.replace("\n", "") + " millions"] for language_element in language_elements for language_row in language_rows]

final_dataframe = pd.DataFrame(language_speakers[:10], columns = columns)

print(final_dataframe)

                              Language        Speakers
0                     Mandarin Chinese  929.0 millions
1                              Spanish  474.7 millions
2                              English  372.9 millions
3  Hindi (Sanskritised Hindustani)[11]  343.9 millions
4                              Bengali  233.7 millions
5                           Portuguese  232.4 millions
6                              Russian  154.0 millions
7                             Japanese  125.3 millions
8                  Western Punjabi[12]   92.7 millions
9                          Yue Chinese   85.2 millions


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [1]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
# NOTE: I'VE USED SELENIUM BECAUSE REQUESTS DOESNT BRING ALL CONTENTS. THE JS NEEDS TO BE LOADED AS WELL


account_name = "RajadoresFutbol"

twitter_url = 'https://twitter.com/'

final_url = twitter_url + account_name

In [13]:
# your code

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')

driver = webdriver.Chrome("chromedriver.exe", options = options)

driver.implicitly_wait(10)
driver.get(final_url)

tweet_elements = driver.find_elements_by_xpath('//div[@data-testid="tweetText"]')

for tweet_element in tweet_elements:
    print(tweet_element.text)
    print("*********************")

print("finished")
driver.close()

Desde mañana, a pesar de seguir rajando de aquellos que han pasado por nuestro fútbol, ampliaremos el espectro y analizaremos jugadores de otras ligas, de la Champions, entrenadores, presidentes, estadios... Cualquier cosa relacionada con el fútbol susceptible de ser rajada.
*********************
La radio. Cuando los chinos no veían La Liga, había jornadas con 5 partidos a la misma hora. Poner la radio era vivir al límite. Sonaba la sintonía de gol y tú enchufando el desfibrilador. Comentaristas agonizando a 250 pulsaciones y el balón a 2km de las porterías. Puritos Dux.
*********************
finished


#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [32]:
# This is the url you will scrape in this exercise 
top_url = 'https://www.imdb.com/chart/top'

In [42]:
# your code
# WARNING: THIS CELL TAKES A FEW MINUTES TO RUN

top_soup = BeautifulSoup(requests.get(top_url).content, "lxml")

columns = ["Movie name", "Initial release", "Director", "Stars"]

table_element = top_soup.find("tbody", class_ = "lister-list")

movies_base_url = "https://www.imdb.com"

table_rows = table_element.find_all("tr")

movies_for_df = []

for table_row in table_rows:
    movie_name_element = table_row.find("td", class_= "titleColumn")
    movie_name_url_element = movie_name_element.find("a")
    movie_name = movie_name_url_element.text
    movie_name_url = movies_base_url + movie_name_url_element["href"]
    movie_release = movie_name_element.find("span").text
    rating = table_row.find("td", class_="ratingColumn imdbRating").text.strip()

    movie_soup = BeautifulSoup(requests.get(movie_name_url).content, "lxml")
    director_element = movie_soup.find("div", class_= "sc-fa02f843-0 fjLeDR")
    director = director_element.find("a").text
    movie_info = [movie_name, movie_release, director, rating]
    
    movies_for_df.append(movie_info)

movies_df = pd.DataFrame(movies_for_df, columns = columns)

movies_df.head()

Unnamed: 0,Movie name,Initial release,Director,Stars
0,Sueño de Libertad,(1994),Frank Darabont,9.2
1,El Padrino,(1972),Francis Ford Coppola,9.2
2,Batman - El caballero de la noche,(2008),Christopher Nolan,9.0
3,El Padrino 2ª Parte,(1974),Francis Ford Coppola,9.0
4,12 hombres en pugna,(1957),Sidney Lumet,8.9


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [43]:
#This is the url you will scrape in this exercise
random_url = 'http://www.imdb.com/chart/top'

In [52]:
#your code

import random

random_soup = BeautifulSoup(requests.get(random_url).content, "lxml")

columns = ["Movie name", "Initial release", "Summary"]

random_table_element = top_soup.find("tbody", class_ = "lister-list")

random_movies_base_url = "https://www.imdb.com"

random_table_rows = random_table_element.find_all("tr")

random.shuffle(random_table_rows)

random_movies_for_df = []

for random_table_row in random_table_rows[:10]:
    random_movie_name_element = random_table_row.find("td", class_= "titleColumn")
    random_movie_name_url_element = random_movie_name_element.find("a")
    random_movie_name = random_movie_name_url_element.text
    random_movie_name_url = random_movies_base_url + random_movie_name_url_element["href"]
    random_movie_release = random_movie_name_element.find("span").text
    
    random_movie_soup = BeautifulSoup(requests.get(random_movie_name_url).content, "lxml")
    random_summary = random_movie_soup.find("span", class_= "sc-16ede01-0 fMPjMP").text
    random_movie_info = [random_movie_name, random_movie_release, random_summary]
    
    random_movies_for_df.append(random_movie_info)
    
random_movies_df = pd.DataFrame(random_movies_for_df, columns = columns)

print(random_movies_df)

                     Movie name Initial release  \
0              El séptimo sello          (1957)   
1                  Forrest Gump          (1994)   
2  La maldición del Perla Negra          (2003)   
3                         Rocky          (1976)   
4         Intriga internacional          (1959)   
5                       El Pibe          (1921)   
6                 The Lion King          (1994)   
7                     Toy Story          (1995)   
8                      La caída          (2004)   
9              Relatos salvajes          (2014)   

                                             Summary  
0  A knight returning to Sweden after the Crusade...  
1  The presidencies of Kennedy and Johnson, the V...  
2  Blacksmith Will Turner teams up with eccentric...  
3  A small-time Philadelphia boxer gets a supreme...  
4  A New York City advertising executive goes on ...  
5  The Tramp cares for an abandoned child, but ev...  
6  Lion prince Simba and his father are targeted ... 

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [17]:
#https://openweathermap.org/current
#city = city=input('Enter the city:')
#url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

#NOTE: I'VE USED SELENIUM BECAUSE THE API KEY DOESNT WORK

weather_url = "https://openweathermap.org/city/3435910"

In [28]:
# your code
# Hay que usar Selenium

options = Options()
options.add_argument('--headless')

driver = webdriver.Chrome("chromedriver.exe", options = options)

driver.implicitly_wait(10)
driver.get(weather_url)

temperature_element = driver.find_element(By.CLASS_NAME, "current-temp")
sensation_element = driver.find_element(By.CLASS_NAME, "bold")
weather_list = driver.find_elements_by_xpath("//li[@data-v-3208ab85]")

print(temperature_element.text)
print(sensation_element.text)

for weather_element in weather_list:
    print(weather_element.text)
    print()

print("finished")

driver.quit()


13°C
Feels like 12°C. Scattered clouds. Fresh Breeze
8.2m/s ESE

1027hPa

Humidity:
66%

UV:
1

Dew point:
7°C

Visibility:
10.0km

finished


#### Book name,price and stock availability as a pandas dataframe.

In [71]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
books_url = 'http://books.toscrape.com/'

In [76]:
#your code
# WARNING: This cell takes a couple of minutes to run

all_books = []
all_articles = []

columns = ["Title", "Price", "Availability"]

for i in range(1,51):
    books_catalogue_url = f"https://books.toscrape.com/catalogue/page-{i}.html"
    books_soup = BeautifulSoup(requests.get(books_url).content, "lxml")
    article_elements = books_soup.find_all("article")
    all_articles.append(article_elements)

articles_flat = [article for article_elements in all_articles for article in article_elements]

for article_flat in articles_flat:
    title_element = article_flat.find("h3")
    title = title_element.find("a")["title"]
    price = article_flat.find("p", class_= "price_color").text
    stock = article_flat.find("p", class_= "instock availability").text.strip()
    book_info = [title, price, stock]
    all_books.append(book_info)

all_books_df = pd.DataFrame(all_books, columns = columns)

all_books_df.head()


Unnamed: 0,Title,Price,Availability
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock
