# Scrapping Exercises - Part 1 - 2

You just had you first scrapping experience using Selectors on Capgemini's webpages to make some parsing on it.

As we mentionned earlier, scrapping has two parts : 
- Parsing : getting information from a webpage
- Crawling : traveling from pages to pages

Here we'are again going to have a parsing experience but on a much more `scrapable`page. You'll soon understand why ;) !

## 1. Make necessary imports

In [1]:
import requests
from scrapy.selector import Selector

## 2. Get some information from Ecole Polytechniques's website

This second exercice is closer to a scrapping approach. In the next parse of the course you will learn how to go from pages to pages.
In order to be able, we'll ask you to get some information from [this page](https://www.polytechnique.edu/fr/actualités).

In [2]:
# Connection to Polytechnique actu page
url = 'https://www.polytechnique.edu/fr/actualités'
html = requests.get(url).content
sel = Selector(text=html)

__<font color='Blue'>
Exercice 1 : Using selectors, extract the dates of all the articles of the page
</font>__

In [3]:
# Extract the dates of articles
xpath = '//div[@class="field-item even"]'
dates = sel.xpath(xpath).xpath('text()').extract()

# Displaying part of the response
dates[:5]

['30 Janvier 2020',
 '28 Janvier 2020',
 '28 Janvier 2020',
 '27 Janvier 2020',
 '27 Janvier 2020']

__<font color='Blue'>
Exercice 2 : Using selectors, extract the links of all the articles of the page
</font>__

In [4]:
# Extract the url of the articles
xpath = '//div[@class="field-item even"]//h3'
liens = sel.xpath(xpath).css('::attr(href)').extract()

# Displaying part of the response
liens[:5]

['/fr/content/lx-la-plus-internationale-des-universites-francaises-selon-le-classement',
 '/fr/content/lecole-polytechnique-fixe-ses-priorites-pour-lannee-2020',
 '/fr/content/lamplification-laser-35-ans-apres-des-potentialites-encore-inexploitees',
 '/fr/content/lx-remporte-pour-la-premiere-fois-le-concours-de-programmation-swerc',
 '/fr/content/colloque-la-multidisciplinarite-pour-mieux-comprendre-les-reseaux-0']

__<font color='Blue'>
Exercice 3 : Using selectors, extract the link of the fllowing page of articles
</font>__

In [5]:
# Extracting the link of the next page
css_locator = 'li.pager-next ::attr(href)'
next_page = sel.css(css_locator).extract_first()

# Displayong reponse
print(next_page)

/fr/actualit%C3%A9s?page=1


## 3. Doing the same article by article

The purpose of this part is to get closer of a scrapping approach.
Now what you are going to do is :
- Get a list of hmtl objects
- And parse them one by one

__<font color='Blue'>
Exercice 4 : Using selectors, get a list of the articles of the page
</font>__

In [6]:
# Extract a list of selectors whicj contains every article
xpath = '//li[contains(@class,"views-row")]'
articles = sel.xpath(xpath)[:-1]

# Gettting only the first article
article = articles[1]
article 

<Selector xpath='//li[contains(@class,"views-row")]' data='<li class="views-row views-row-2 view...'>

__<font color='Blue'>
Exercice 5 : For a given article, extract the url of the dedicated webapge
</font>__

In [7]:
# Get the url of the article
css_locator = 'div.field-items ::attr(href)'
article.css(css_locator).extract_first()

'/fr/content/lecole-polytechnique-fixe-ses-priorites-pour-lannee-2020'

__<font color='Blue'>
Exercice 6 : For a given article, extract the title and date
</font>__

_hint : you can do it with a single selector_

In [8]:
# Get title and date information 
css_locator = 'div.field-items ::text'
rep = article.css(css_locator).extract()
titre = rep[0]
date = rep[1]

# Displaying results
print('Date :\t', date)
print('Titre :\t', titre)

Date :	 28 Janvier 2020
Titre :	 L’École polytechnique fixe ses priorités pour l’année 2020 


__<font color='Blue'>
Exercice 7 : For a given article, extract the short description related to it
</font>__

In [9]:
# Get short description
css_locator = 'p ::text'
content = article.css(css_locator).extract_first()

# Displaying results
print(content)

Les priorités de l’X pour 2020 s’inscrivent dans sa stratégie de s’affirmer comme une institution d’enseignement et de recherche scientifique et technologique de rang mondial dans le cadre de l’Institut Polytechnique de Paris. L'École se donne les moyens de ses ambitions, en préservant sa singularité, liée à son ancrage militaire historique.


__<font color='Blue'>
Exercice 8 : For a given article, extract the #hashtags mentionning related topics
</font>__


In [10]:
# Get mentions
css_locator = 'strong ::text'
subjects = article.css(css_locator).extract()

# Clean mentions
subjects = [subject for subject in subjects if subject not in ['#', ', ']]

# Display mentions
subjects

['Institutionnel', 'Campus']

In [11]:
# More information about mention
css_locator = 'strong ::attr(href)'
links = article.css(css_locator).extract()

# Displaying mentions links
print(links)

['/fr/type/institutionnel', '/fr/type/campus']


## 5. (Bonus) Looking at an Article page 


(this is quite similar but allows us to get all the content of the article)

In [12]:
# Getting the selector
url = 'https://www.polytechnique.edu/fr/content/marie-paule-cani-professeure-lx-elue-lacademie-des-sciences'
html = requests.get(url).content
sel = Selector(text=html)

In [13]:
# Extracting title
xpath = '//div[@class="field-item even"]//h1//text()'
title = sel.xpath(xpath).extract_first()
print(title)

Marie-Paule Cani, professeure à l’X, élue à l’Académie des sciences 


In [14]:
# Extracting date
xpath = '//div[@class="content-date"]//text()'
date = sel.xpath(xpath).extract_first()
print(date)

19 Décembre 2019


In [15]:
# Extracting all content
xpath = '//div[@class="field-item even"]//p//text()'
content = sel.xpath(xpath).extract()[:-7]

# Display information
print(len(content), 'paragraphs')
print('\nFirst one : \n', content[1])

6 paragraphs

First one : 
 Ancienne élève de l’Ecole normale supérieure, agrégée de mathématiques, Marie-Paule Cani se passionne dès 1987 pour les images de synthèse et l’animation es mondes virtuels et soutient une thèse en informatique graphique en 1990.


## Congratulations !

__Now that you have all the selectors, you can think of a spider using them__
- As there are many wat to get a selectors, yours might be different to ours.
- You can try to replace them and re run the spider ;) !