## Brief Projet : 

Vous êtes un data analyst ,vous travaillez en bénévolat pour la médiathèque de la ville qui a mis en place un site web pour la vente de certains livres de son stock.. et elle souhaite analyser les caractéristiques sa clientèle pour mieux comprendre les tendances de leurs reservations/achats de livres. 

La médiathèque vous donne accès à leur site : http://books.toscrape.com/index.html. et souhaite que vous collectez/analysez leurs données.


**Étapes du projet** :

1. **Collecte de données** : 
* Utiliser la bibliothèque `requests` pour envoyer des requêtes HTTP au site web qui répertorie les livres. Vous récupérez le contenu HTML de la page web.

* A l'aide de la bibliothèque `Beautiful Soup`, analyser le contenu HTML du site et extraire les informations pertinentes: parcourir le code HTML, identifier les balises cibles (qui contiennent les données sur les livres, telles que `<div>` ou `<li>` ) et extraire les informations pertinentes telles que le nom du livre, la catégorie, la note moyenne des avis, le nombre de livres en stock, le prix etc.

2. **Nettoyage et préparation des données** : nettoyer les valeurs, convertir les types de données si nécessaire, gérer les valeurs manquantes, etc.

3. **Stockage des données** : 
* Proposer une modélisation de base de données SQL adaptée. 
* Créer le schéma de la base de données , les différentes tables pour stocker les données propres sur les livres.

4. **Analyse des données** : 

* Faire une analyse exploratoire des données : identification de KPIs pertinents ,création de graphiques, le calcul de statistiques descriptives, l'identification de tendances, etc => pour aider la médiathèque à mieux faire son étude de clientèle. 



### Importer les librairies

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Collecte de données

####  Récupérer les livres de la première page

In [2]:
# URL de la page principale
url = "http://books.toscrape.com/"

#code pour retrouver les élements de la page
response = requests.get(url)

#qu'est ce que l'element response ?
response

<Response [200]>

L'élément est une réponse http du serveur à la requests.get(url) du client

In [3]:
#quels sont ses attributs 
response.__attrs__

['_content',
 'status_code',
 'headers',
 'url',
 'history',
 'encoding',
 'reason',
 'cookies',
 'elapsed',
 'request']

La réponse est un objet qui contient tous ces éléments

In [4]:
#exemple
print(response.status_code)
print()
print (response.headers)
print()
print(response.content)

200

{'Date': 'Tue, 05 Aug 2025 08:36:45 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Last-Modified': 'Wed, 08 Feb 2023 21:02:32 GMT', 'ETag': 'W/"63e40de8-c85e"', 'Content-Encoding': 'br'}



In [6]:
#on utilise la librairie Beatiful Soup pour lire / analyser les documents html
soup = BeautifulSoup(response.content, 'html.parser')
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [10]:
#Comment accèder aux elements de la page ?
temp = soup.find_all('h3')
print(type(temp))
print(len(temp))
print()
print(temp[0])
print(temp[1])
print(temp[2])
print(temp[3])

print(temp[-1])

<class 'bs4.element.ResultSet'>
20

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>
<h3><a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a></h3>
<h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>
<h3><a href="catalogue/its-only-the-himalayas_981/index.html" title="It's Only the Himalayas">It's Only the Himalayas</a></h3>


In [11]:
#Comment accéder aux élements de la balise ?
temp_elt = temp[0]
print(temp_elt.a.attrs['href'])
print(temp_elt.a.get('href'))

catalogue/a-light-in-the-attic_1000/index.html
catalogue/a-light-in-the-attic_1000/index.html


#### Récupérer la liste des catégories

In [12]:
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [13]:
# Créer une liste avec les liens href de toutes les catégories
categories_urls=[]
categorie_list=soup.find('ul',{'class':'nav-list'})
categories=categorie_list.find_all('a')
for elt in categories:
    
    categories_urls.append('http://books.toscrape.com/'+ elt.get('href'))
    
categories_urls

['http://books.toscrape.com/catalogue/category/books_1/index.html',
 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html',
 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html',
 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html',
 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html',
 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html',
 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html',
 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html',
 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.html',
 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
 'http://books.toscrape.com/catalogue/category/books/religion_12/index.html',
 'http://books.toscrape.com/catalogue/category/books/nonfictio

#### Récupérer les livres de toutes les pages  

In [14]:

# Fonction pour extraire les informations d'une page donnée
def extract_book_info(page_url):
    books=[]
    response = requests.get(page_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        #livres = soup.find_all(class_='product_pod')
        livres = soup.find_all('article', {'class': 'product_pod'})
        #print(livres)
        for livre in livres:

            titre = livre.h3.a.get('title')

            rating = livre.find('p', {'class': 'star-rating'}).get('class')[1]

            prix = livre.select('div p.price_color')[0].text[2:]

            disponibilite = livre.select('div p.availability')[0].text.strip()

            books.append({'title': titre, 'rating': rating, 'price': prix, 'availability': disponibilite})
        df_books=pd.DataFrame(books)
            
            
    return df_books

In [20]:
p=extract_book_info('http://books.toscrape.com/catalogue/page-2.html')
p

Unnamed: 0,title,rating,price,availability
0,In Her Wake,One,12.84,In stock
1,How Music Works,Two,37.32,In stock
2,Foolproof Preserving: A Guide to Small Batch J...,Three,30.52,In stock
3,Chase Me (Paris Nights #2),Five,25.27,In stock
4,Black Dust,Five,34.53,In stock
5,Birdsong: A Story in Pictures,Three,54.64,In stock
6,America's Cradle of Quarterbacks: Western Penn...,Three,22.5,In stock
7,Aladdin and His Wonderful Lamp,Three,53.13,In stock
8,Worlds Elsewhere: Journeys Around Shakespeareâ...,Five,40.3,In stock
9,Wall and Piece,Four,44.18,In stock


In [16]:
print(p.shape[0])
p.head()

20


Unnamed: 0,title,rating,price,availability
0,In Her Wake,One,12.84,In stock
1,How Music Works,Two,37.32,In stock
2,Foolproof Preserving: A Guide to Small Batch J...,Three,30.52,In stock
3,Chase Me (Paris Nights #2),Five,25.27,In stock
4,Black Dust,Five,34.53,In stock


In [21]:
print(p.shape[1])

4


In [24]:
print(p.shape)

(20, 4)


In [25]:
#Récupérer le nbre de pages Max du site

nb_pages = int(soup.find('li', {'class': 'current'}).get_text().split()[-1])

print(nb_pages)

50


In [35]:
# Boucle sur l'ensemble des pages pour avoir la totalité des 1000 livres
#nb_pages = 50

all_books_2 = []
for i in range(nb_pages) : 
    url = f'http://books.toscrape.com/catalogue/page-{i+1}.html'
    print(url)
    df = extract_book_info(url)
    all_books_2.append(df)

all_books_2

http://books.toscrape.com/catalogue/page-1.html
http://books.toscrape.com/catalogue/page-2.html
http://books.toscrape.com/catalogue/page-3.html
http://books.toscrape.com/catalogue/page-4.html
http://books.toscrape.com/catalogue/page-5.html
http://books.toscrape.com/catalogue/page-6.html
http://books.toscrape.com/catalogue/page-7.html
http://books.toscrape.com/catalogue/page-8.html
http://books.toscrape.com/catalogue/page-9.html
http://books.toscrape.com/catalogue/page-10.html
http://books.toscrape.com/catalogue/page-11.html
http://books.toscrape.com/catalogue/page-12.html
http://books.toscrape.com/catalogue/page-13.html
http://books.toscrape.com/catalogue/page-14.html
http://books.toscrape.com/catalogue/page-15.html
http://books.toscrape.com/catalogue/page-16.html
http://books.toscrape.com/catalogue/page-17.html
http://books.toscrape.com/catalogue/page-18.html
http://books.toscrape.com/catalogue/page-19.html
http://books.toscrape.com/catalogue/page-20.html
http://books.toscrape.com/cat

[                                                title rating  price  \
 0                                A Light in the Attic  Three  51.77   
 1                                  Tipping the Velvet    One  53.74   
 2                                          Soumission    One  50.10   
 3                                       Sharp Objects   Four  47.82   
 4               Sapiens: A Brief History of Humankind   Five  54.23   
 5                                     The Requiem Red    One  22.65   
 6   The Dirty Little Secrets of Getting Your Dream...   Four  33.34   
 7   The Coming Woman: A Novel Based on the Life of...  Three  17.93   
 8   The Boys in the Boat: Nine Americans and Their...   Four  22.60   
 9                                     The Black Maria    One  52.15   
 10     Starving Hearts (Triangular Trade Trilogy, #1)    Two  13.99   
 11                              Shakespeare's Sonnets   Four  20.66   
 12                                        Set Me Free   Five  1

In [27]:
print(type(all_books_2))
print(len(all_books_2))
print (all_books_2[0].shape[0])

<class 'list'>
50
20


In [34]:
all_books_2

[                                                title rating  price  \
 0                                A Light in the Attic  Three  51.77   
 1                                  Tipping the Velvet    One  53.74   
 2                                          Soumission    One  50.10   
 3                                       Sharp Objects   Four  47.82   
 4               Sapiens: A Brief History of Humankind   Five  54.23   
 5                                     The Requiem Red    One  22.65   
 6   The Dirty Little Secrets of Getting Your Dream...   Four  33.34   
 7   The Coming Woman: A Novel Based on the Life of...  Three  17.93   
 8   The Boys in the Boat: Nine Americans and Their...   Four  22.60   
 9                                     The Black Maria    One  52.15   
 10     Starving Hearts (Triangular Trade Trilogy, #1)    Two  13.99   
 11                              Shakespeare's Sonnets   Four  20.66   
 12                                        Set Me Free   Five  1

In [None]:
df_books_ = pd.concat(all_books_2)
df_books_.shape

Unnamed: 0,title,rating,price,availability
0,A Light in the Attic,Three,51.77,In stock
1,Tipping the Velvet,One,53.74,In stock
2,Soumission,One,50.10,In stock
3,Sharp Objects,Four,47.82,In stock
4,Sapiens: A Brief History of Humankind,Five,54.23,In stock
...,...,...,...,...
15,Alice in Wonderland (Alice's Adventures in Won...,One,55.53,In stock
16,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",Four,57.06,In stock
17,A Spy's Devotion (The Regency Spies of London #1),Five,16.97,In stock
18,1st to Die (Women's Murder Club #1),One,53.98,In stock


In [29]:
df_books_.head()

Unnamed: 0,title,rating,price,availability
0,A Light in the Attic,Three,51.77,In stock
1,Tipping the Velvet,One,53.74,In stock
2,Soumission,One,50.1,In stock
3,Sharp Objects,Four,47.82,In stock
4,Sapiens: A Brief History of Humankind,Five,54.23,In stock


In [36]:
print('All_books:',type(all_books_2),'de ',len(all_books_2),'pages,  avec' ,all_books_2[0].shape[0], 'livres par page')

All_books: <class 'list'> de  50 pages,  avec 20 livres par page


In [None]:
# Affiche tous les livres de la dernière page, il y en a 20
all_books_2[len(all_books_2)-1]

Unnamed: 0,title,rating,price,availability
0,Frankenstein,Two,38.0,In stock
1,Forever Rockers (The Rocker #12),Three,28.8,In stock
2,Fighting Fate (Fighting #6),Three,39.24,In stock
3,Emma,Two,32.93,In stock
4,"Eat, Pray, Love",Three,51.32,In stock
5,Deep Under (Walker Security #1),Five,47.09,In stock
6,Choosing Our Religion: The Spiritual Lives of ...,Four,28.42,In stock
7,Charlie and the Chocolate Factory (Charlie Buc...,Three,22.85,In stock
8,Charity's Cross (Charles Towne Belles #4),One,41.24,In stock
9,Bright Lines,Five,39.07,In stock


In [53]:
# Affiche tous les livres de la dernière page, il y en a 20
# all_books_2[len(all_books_2)-1]
# df_books_ = pd.concat(all_books_2)
nb_pages = len(all_books_2)


# df_books_ = pd.concat(all_books_2)

for i in range(nb_pages) : 
    url = f'http://books.toscrape.com/catalogue/page-{i+1}.html'    
    print(extract_book_info(url))
    #all_books_2.append(df)

#all_books_2

                                                title rating  price  \
0                                A Light in the Attic  Three  51.77   
1                                  Tipping the Velvet    One  53.74   
2                                          Soumission    One  50.10   
3                                       Sharp Objects   Four  47.82   
4               Sapiens: A Brief History of Humankind   Five  54.23   
5                                     The Requiem Red    One  22.65   
6   The Dirty Little Secrets of Getting Your Dream...   Four  33.34   
7   The Coming Woman: A Novel Based on the Life of...  Three  17.93   
8   The Boys in the Boat: Nine Americans and Their...   Four  22.60   
9                                     The Black Maria    One  52.15   
10     Starving Hearts (Triangular Trade Trilogy, #1)    Two  13.99   
11                              Shakespeare's Sonnets   Four  20.66   
12                                        Set Me Free   Five  17.46   
13  Sc