# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [800]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm.auto import tqdm

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
print(soup)

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [None]:
#ok, então a classe para nome curto é na tag 'a' com 'class' = 'link-gray'
crude_hacker_name = soup.find_all('a', attrs={'class' : 'link-gray'})

#One problem is that, there is other classes that have 'link-gray' in it, and needs to be removed from crude_hacker_name 
tofilter = soup.find_all('a',attrs={'class':"py-2 lh-condensed-ultra d-block link-gray no-underline f5"})
tofilter = tofilter + soup.find_all('a',attrs={'class':"py-2 pb-0 lh-condensed-ultra d-block link-gray no-underline f5"})
                                            
#if our crude_hacker_name is in this tofilter list, it is not saved
clean_hacker_names = [name for name in crude_hacker_name if name not in tofilter]

#and now we select only the text:
hacker_names = [name.text.strip() for name in clean_hacker_names]

#testing
print(hacker_names[:5])

In [None]:
#the complete name is <h1 class="h3 lh-condensed">
crude_full_names = soup.find_all('h1', attrs={'class' : "h3 lh-condensed"})
full_names = [full_name.text.strip() for full_name in crude_full_names]

#testing
print(full_names[:5])

In [None]:
#If you notice, there are some words in the end of the list, that are not related to programer names:
print(hacker_names[-4:])

#notice that after pairing the hacker_names with the full names, those words wont be carried on.

In [None]:
#NOW THE ANSWER:
list_of_names = [f'{hacker_names[i]} ({full_names[i]})' for i in range(len(full_names))]
list_of_names

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [None]:
# We first make the soup
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
print(soup)

In [None]:
#ok, one interesting thing is that, in github, the link to the repository has "name_of_person"/"name_of_repository"

#all the repositories are in tag 'h1', since they are kind of "headers" of cells

repos_crude = soup.find_all('h1')

#now, every "h1" has a subclass 'a' where the "href" is the name of the repository:
person_and_repos = [repo.find_all('a')[0]['href'] for repo in repos_crude if (len(repo.find_all('a')) != 0)]

#we now print the repositores

print('The trending repositories are:\b')
for text in person_and_repos:
    a = text.split('/')[2]
    b = text.split('/')[1]
    print(f'repository: {a} | creator: {b}')

#### Display all the image links from Walt Disney wikipedia page.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [None]:
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
print(soup)

In [None]:
#SUPPOSE WE ONLY WANT THE IMAGES/PHOTOGRAPHS, to the tiny images that appear on the bottom, like country flags, stars, etc

#all images are inside a tag div and a class called "thumbinner", with exception of the first one
images_thumbinner = soup.find_all('div',attrs={'class':"thumbinner"})

#image links are stored in 'src', in the html text
images = ['https:' + thumbinner.find_all('img')[0]['src'] for thumbinner in images_thumbinner]

#However, there is some other images, which are inside a Table tag: <table class="infobox biography vcard"...
#those are the images of walt disney's portrait and his signature
html_chuncks = soup.find_all('table')
table_images = [html_chunck.find_all('a',attrs={'class':'image'})[0] for html_chunck in html_chuncks\
                                    if (len(html_chunck.find_all('a',attrs={'class':'image'})) != 0)]

#and now we apply again:
images_2 = ['https:' + html_chunck.find_all('img')[0]['src'] for html_chunck in table_images]
images_2

#now we join everything:
images = images_2+images
print('important image links are:\n')
print(images)

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [None]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#we will need a function
def try_href(html_chunk):
    '''This function returns the href class of a html chunck, if exists '''
    try:
        return html_chunk['href']
    except:
        pass

In [None]:
#links are in anchored tagged classes, so we will get all them with try and except
html_chunks = soup.find_all('a')

# we first now create a list of links
crude_links = []
for html_chunk in html_chunks:
    crude_links.append(try_href(html_chunk))

#we now remove none types
links = [link for link in crude_links if (type(link) == str) ]
links
    
#and now we add "https://en.wikipedia.org" to the link in case it does not have
links = ['https://en.wikipedia.org'+link if link.startswith('/') else link for link in links]

#and now we drop everything that does not starts with 'https://'
links = [link for link in links if link.startswith('https://')]

#now we print the links:
print('the found links are:\n')
links

#### Find the number of titles that have changed in the United States Code since its last release point.

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#according to the page,  "Titles in bold have been changed since the last release point."
#also, according to the html, the changed titles have classe usctitlechanged (<div class="usctitlechanged"...)

html_chunks = soup.find_all('div',attrs={'class':'usctitlechanged'})

#from those chunks, we need to:
#   1 - take the text
#   2 - strip "\n"s and spaces
#   3 - remove ' *' at the end of title if exists

titles = [chunk.text.strip() for chunk in html_chunks]
titles = [title.split(' ٭')[0] if title.endswith(' ٭') else title for title in titles]

print('The changed titles are:\n')
titles

#### Find a Python list with the top ten FBI's Most Wanted names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#the wanted dudes names is stored in html in a class title, tagged h3 (<h3 class="title">)

#after getting it, we extract the text, and treat it with strip:

html_chunks = soup.find_all('h3',attrs={'class':'title'})

names = [chunk.text.strip() for chunk in html_chunks]

print('the most wanted fellows are:\n')
names


####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)


#We first find all tables and select the fourth table, since this is the one in the html that has our info!
table = soup.find_all('table')[3]

# The head of the table (columns names) and the rows (table body) are located with tags named, respectively, 
# '<thead' and '<tbody'.
head = table.find_all('thead')[0]
body = table.find_all('tbody')[0]

#now we look at the header of the table... The column names are located in <th class="th2"...
#we will store the text from the chunks as columns
headers_chunks = head.find_all('th',attrs={'class':'th2'})
columns = [headers_chunk.text for headers_chunk in headers_chunks]

#now we look at the body of the table: all info is in <td>
data = [chunk.text for chunk in body.find_all('td')]

In [None]:
#now we put it into a dataframe
nrows = int(len(data)/13)
ncols = 13

df = pd.DataFrame(np.array(data).reshape((nrows, ncols)))
#From the df, we can see that columns 0,1,2,3 and 9 can be droped

df = df.iloc[:,[3,4,5,6,7,8,10,11]]

#we now will addapt our pandas dataframes with new columns, based on the columns obtained previously:
columns = ['Date & Time UTC', 'Lat deg', 'Lat', 'Long deg', 'Long','Depth km','Mag','Region name']
df.columns = columns

#and now we treat the column 'Date & Time'
df.iloc[:,0] = df.iloc[:,0].apply(lambda x: x.split('.')[0])
df.iloc[:,0] = df.iloc[:,0].apply(lambda x: x.split('earthquake')[1])

In [None]:
#Our pandas dataframe is ready!!!
df

#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#just look for "<div class="central-featured"", and that "anchored" <a>
central = soup.find_all('div',attrs={'class':'central-featured'})[0]
crude_languages = central.find_all('a')

#now separate the languages from numbers
lang_name = [lang.find_all('strong')[0].text for lang in crude_languages]
art_numb = [lang.find_all('bdi')[0].text for lang in crude_languages]

#now generate the pandas dataframe
wiki_langs = pd.DataFrame({'language':lang_name, 'articles number':art_numb})
wiki_langs.sort_values(by='articles number',ascending=False)

#### A list with the different kind of datasets available in data.gov.uk.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#all info needed is in <div class="grid-row dgu-topics">
mini_soup = soup.find_all('div',attrs={'class':"grid-row dgu-topics"})[0]
mini_soup

#datasets are within <h2> tags
datasets = [element.text for element in mini_soup.find_all('h2')]

#descripitions are within <p> tags
dsets_desc = [element.text for element in mini_soup.find_all('p')]

In [None]:
#finally create our pandas dataframe to sumarize
datasetes = pd.DataFrame({'Dataset':datasets,'Description':dsets_desc})
datasetes

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#first we find all tables and select the table with the numbers (which is the first one)
table = soup.find_all('table')[0]

#we separate the body
body = table.find_all('tbody')[0]

#we extract the columns
columns = [element.text.strip() for element in body.find_all('th')]

#now we collect all row values
data = [element.text.strip() for element in body.find_all('td')]

#now we put it into a dataframe: each row has 5 elements.
nrows = int(len(data)/5)
ncols = 5

df = pd.DataFrame(np.array(data).reshape((nrows, ncols)),columns = columns)

print('showing the top 10 languages')
df.head(10)

## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#first we find all tables and select the table with the numbers (which is the first one)
table = soup.find_all('table')[0]

#we separate the body
head = table.find_all('thead')[0]

#we separate the body
body = table.find_all('tbody')[0]

In [None]:
#we extract the columns
columns = [element.text.strip() for element in head.find_all('th')]

#now we collect all row values
data = [element.text.strip() for element in body.find_all('td')]

#now we put it into a dataframe: each row has 5 elements.
nrows = int(len(data)/5)
ncols = 5

df = pd.DataFrame(np.array(data).reshape((nrows, ncols)),columns = columns)

#filtering the important columns
df = df.iloc[:,[1,2]]

#creating a year column
df['Year'] = df.iloc[:,0].apply(lambda x: x.split('\n')[-1])

#cleaning the name of the title
df.iloc[:,0] = df.iloc[:,0].apply(lambda x: x.split('\n')[1])

#rename columns
df.columns =['Title', 'IMDb Rating', 'Year']

In [None]:
#getting director, etc...
#The minisoup body.find_all('a'), only has information on odd numbers

#this will give information about directors and two actors
info_minisoup = body.find_all('a')

#getting only directors
directors = [info_minisoup[i]['title'].split(' (dir.)')[0] for i in range(len(info_minisoup)) if (i % 2 == 1)]

#updating the dataframe
df['director'] = directors

#reordering the dataframe
imdb_df = df.iloc[:,[0,2,3,1]]
imdb_df

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [814]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'
import numpy as np

In [None]:
# your code here
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)

In [None]:
#first we find all tables and select the table with the numbers (which is the first one)
table = soup.find_all('table')[0]

#we separate the body
head = table.find_all('thead')[0]

#we separate the body
body = table.find_all('tbody')[0]

In [815]:
#we select the movies:
movies = body.find_all('td',attrs={'class':'titleColumn'})

#and 10 random movies:
np.random(10,10)

TypeError: 'module' object is not callable

In [818]:
np.random.randint(10,)

3

In [None]:
response = requests.get(links[0])
html = response.content
subsoup = BeautifulSoup(html)

In [None]:
# summary text is in <div class="summary_text">
subsoup.find_all('div',attrs={'class':'summary_text'})[0].text.strip()

In [None]:
#getting the 10 first links:
minisoup = soup.find_all('td',attrs={'class':'titleColumn'})
links = ['https://www.imdb.com/' + chunck.find_all('a')[0]['href'] for chunck in minisoup]

In [802]:
#colect all summary text, but from the first 10!
sum_text = []

for link in tqdm(links[:10]):
    response = requests.get(link)
    html = response.content
    subsoup = BeautifulSoup(html)
    
    sum_text.append(subsoup.find_all('div',attrs={'class':'summary_text'})[0].text.strip())

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




In [808]:
pd.DataFrame(sum_text,columns=['sinopsis'])

Unnamed: 0,sinopsis
0,Two imprisoned men bond over a number of years...
1,The aging patriarch of an organized crime dyna...
2,The early life and career of Vito Corleone in ...
3,When the menace known as the Joker wreaks havo...
4,A jury holdout attempts to prevent a miscarria...
5,"In German-occupied Poland during World War II,..."
6,Gandalf and Aragorn lead the World of Men agai...
7,"The lives of two mob hitmen, a boxer, a gangst..."
8,A bounty hunting scam joins two men in an unea...
9,A meek Hobbit from the Shire and eight compani...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here