<a href="https://colab.research.google.com/github/yunuserbas/Scraping/blob/main/3_LC_Scraping_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping

Le Web scraping (de l’anglais scraping = « gratter/racler ») consiste à extraire des données de sites Internet et à les enregistrer afin de les analyser ou de les utiliser de toute autre façon.

In [1]:
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup

# IMDB TOP 50

The 50 Best Movies Ever Made

In [2]:
response = requests.get("https://www.imdb.com/list/ls055386972/")
response

<Response [200]>

In [19]:
soup = BeautifulSoup(response.content)

 On aimerait aggréger dans un DataFrame, pour chaque film, le titre,
la durée et l'année.

Dans un premier temps, il faut repérer la balise et la classe qui contient les informations que l'on souhaite récupérer:

En inspectant avec le curseur, on observe que le titre, l'année et la durée sont contenus à l'intérieur d'une balise div, qui a la class "lister_item_content".

# Le titre

In [4]:
# On accède à la balise h3 pour chaque film, à l'intérieur de la balise div de la class lister_item-content.
# On a ici l'info pour chaque film. Il faut continuer pour aller chercher chacun des films.

# On a ici tous les films, mais on voit aussi les attributs href. On veut récupérer juste le string associé:



for movie in soup.find_all("div", class_="lister-item-content"):
  print(movie.find("h3").find("a").string)

The Godfather
Schindler's List
12 Angry Men
Life Is Beautiful
The Good, the Bad and the Ugly
The Shawshank Redemption
The Pursuit of Happyness
Seven Samurai
The Intouchables
Central Station
Requiem for a Dream
A Beautiful Mind
Hachi: A Dog's Tale
Taken
My Sassy Girl
Amores perros
The Shining
Apocalypto
Gladiator
Cast Away
The Dark Knight
The Pianist
Titanic
3-Iron
Braveheart
It's a Wonderful Life
Spring, Summer, Fall, Winter... and Spring
Alien
Memories of Murder
The Return
I Saw the Devil
Children of Heaven
A Separation
The Sixth Sense
A Moment to Remember
Departures
The Road Home
Saving Private Ryan
The Bridge on the River Kwai
Ben-Hur
The Exorcist
The Secret in Their Eyes
Léon: The Professional
The Green Mile
Gran Torino
Kill Bill: Vol. 1
Jurassic Park
Terminator 2: Judgment Day
Back to the Future
Finding Nemo


# L'année

On repère la `class="lister-item-year text-muted unbold">` qui contient l'année.

In [5]:
# On récupère ici les années, mais avec les parenthèses. 
for movie in soup.find_all("div", class_ = "lister-item-content"):
  print(movie.find(class_="lister-item-year text-muted unbold").string)

(1972)
(1993)
(1957)
(1997)
(1966)
(1994)
(2006)
(1954)
(2011)
(1998)
(2000)
(2001)
(2009)
(I) (2008)
(2001)
(2000)
(1980)
(2006)
(2000)
(2000)
(2008)
(2002)
(1997)
(2004)
(1995)
(1946)
(2003)
(1979)
(2003)
(2003)
(2010)
(1997)
(2011)
(1999)
(2004)
(2008)
(1999)
(1998)
(1957)
(1959)
(1973)
(2009)
(1994)
(1999)
(2008)
(2003)
(1993)
(1991)
(1985)
(2003)


In [6]:
for movie in soup.find_all("div", class_ = "lister-item-content"):
  print(movie.find(class_="lister-item-year text-muted unbold").string.strip('(').strip(')').strip("I").strip(')').strip(") ("))

1972
1993
1957
1997
1966
1994
2006
1954
2011
1998
2000
2001
2009
2008
2001
2000
1980
2006
2000
2000
2008
2002
1997
2004
1995
1946
2003
1979
2003
2003
2010
1997
2011
1999
2004
2008
1999
1998
1957
1959
1973
2009
1994
1999
2008
2003
1993
1991
1985
2003


In [7]:
# On pourrait faire comme ça ici, mais c'est un peu fastidieux et dangereux si il y a de nouveaux caractères. On peut utiliser du regex.
for movie in soup.find_all("div", class_ = "lister-item-content"):
  print(movie.find(class_="lister-item-year text-muted unbold").string.strip('(').strip(")").strip("I").strip(") ("))



1972
1993
1957
1997
1966
1994
2006
1954
2011
1998
2000
2001
2009
2008
2001
2000
1980
2006
2000
2000
2008
2002
1997
2004
1995
1946
2003
1979
2003
2003
2010
1997
2011
1999
2004
2008
1999
1998
1957
1959
1973
2009
1994
1999
2008
2003
1993
1991
1985
2003


In [8]:
# On fais une requête pour "match" tous les strings qui contiennent 4 digits, et on veut les récupérer.
# On voit le match, puis on utilise .group() pour récupérer ce match.

for movie in soup.find_all("div", class_ = "lister-item-content"):
  print(re.search("\d{4}",movie.find(class_= "lister-item-year text-muted unbold").string).group())



1972
1993
1957
1997
1966
1994
2006
1954
2011
1998
2000
2001
2009
2008
2001
2000
1980
2006
2000
2000
2008
2002
1997
2004
1995
1946
2003
1979
2003
2003
2010
1997
2011
1999
2004
2008
1999
1998
1957
1959
1973
2009
1994
1999
2008
2003
1993
1991
1985
2003


# La durée

On repère la `class="runtime"` qui contient la durée.

In [9]:
# On peut simplement utiliser strip ici.
for movie in soup.find_all("div", class_= "lister-item-content"):
  print(movie.find(class_ = "runtime").string.strip("min"))

175 
195 
96 
116 
178 
142 
117 
207 
112 
110 
102 
135 
93 
90 
137 
154 
146 
139 
155 
143 
152 
150 
194 
88 
178 
130 
103 
117 
132 
110 
144 
89 
123 
107 
144 
130 
89 
169 
161 
212 
122 
129 
110 
189 
116 
111 
127 
137 
116 
100 


In [10]:
# On peut aussi le faire en regex. 

for movie in soup.find_all("div", class_= "lister-item-content"):
  print(re.search("\d+",movie.find(class_ = "runtime").string.strip("min")).group()) #"\d+" un ou plusieurs chiffres

175
195
96
116
178
142
117
207
112
110
102
135
93
90
137
154
146
139
155
143
152
150
194
88
178
130
103
117
132
110
144
89
123
107
144
130
89
169
161
212
122
129
110
189
116
111
127
137
116
100


# DataFrame à partir d'une liste vide

Maintenant que l'on est parvenu à récupérer chaque informations individuellement, aggrégeons les dans un DataFrame:

In [11]:
movies = []

for movie in soup.find_all("div", class_ = "lister-item-content"):
  title = movie.find("h3").find("a").string
  duration = int(movie.find(class_ = "runtime").string.strip("min"))
  year = int(re.search("\d{4}",movie.find(class_= "lister-item-year text-muted unbold").string).group())
  movies.append({"title": title, "duration": duration, "year": year})

movies

[{'title': 'The Godfather', 'duration': 175, 'year': 1972},
 {'title': "Schindler's List", 'duration': 195, 'year': 1993},
 {'title': '12 Angry Men', 'duration': 96, 'year': 1957},
 {'title': 'Life Is Beautiful', 'duration': 116, 'year': 1997},
 {'title': 'The Good, the Bad and the Ugly', 'duration': 178, 'year': 1966},
 {'title': 'The Shawshank Redemption', 'duration': 142, 'year': 1994},
 {'title': 'The Pursuit of Happyness', 'duration': 117, 'year': 2006},
 {'title': 'Seven Samurai', 'duration': 207, 'year': 1954},
 {'title': 'The Intouchables', 'duration': 112, 'year': 2011},
 {'title': 'Central Station', 'duration': 110, 'year': 1998},
 {'title': 'Requiem for a Dream', 'duration': 102, 'year': 2000},
 {'title': 'A Beautiful Mind', 'duration': 135, 'year': 2001},
 {'title': "Hachi: A Dog's Tale", 'duration': 93, 'year': 2009},
 {'title': 'Taken', 'duration': 90, 'year': 2008},
 {'title': 'My Sassy Girl', 'duration': 137, 'year': 2001},
 {'title': 'Amores perros', 'duration': 154, '

In [12]:
movie_df = pd.DataFrame(movies)
movie_df

Unnamed: 0,title,duration,year
0,The Godfather,175,1972
1,Schindler's List,195,1993
2,12 Angry Men,96,1957
3,Life Is Beautiful,116,1997
4,"The Good, the Bad and the Ugly",178,1966
5,The Shawshank Redemption,142,1994
6,The Pursuit of Happyness,117,2006
7,Seven Samurai,207,1954
8,The Intouchables,112,2011
9,Central Station,110,1998


# Alternative : DataFrame à partir d'un dictionnaire vide

In [13]:
movies_dict = {'title': [], 'duration': [], 'year': []}
for movie in soup.find_all("div", class_="lister-item-content"):
    movies_dict['title'].append(movie.find("h3").find("a").string)
    movies_dict['duration'].append(int(movie.find(class_="runtime").string.strip(' min')))
    movies_dict['year'].append(int(re.search(r"\d{4}", movie.find(class_="lister-item-year text-muted unbold").string).group()))
print(movies_dict['title'][0:2])

pd.DataFrame(movies_dict)

['The Godfather', "Schindler's List"]


Unnamed: 0,title,duration,year
0,The Godfather,175,1972
1,Schindler's List,195,1993
2,12 Angry Men,96,1957
3,Life Is Beautiful,116,1997
4,"The Good, the Bad and the Ugly",178,1966
5,The Shawshank Redemption,142,1994
6,The Pursuit of Happyness,117,2006
7,Seven Samurai,207,1954
8,The Intouchables,112,2011
9,Central Station,110,1998


# IMDB Top 250

On aimerait à présent scraper plusieurs pages, pas seulement la première.

Si j'écrie dans google `top 250 imdb`, je tombe sur cette page:

https://www.imdb.com/search/title/?groups=top_250&sort=user_rating

In [14]:
def data_page(page):
  response = requests.get("https://www.imdb.com/search/title/",
               params = {"groups":"top_250", 
                        "sort": "user_rating", 
                        "desc&start": (1 + page * 50)})
  soup = BeautifulSoup(response.content)
  return soup

In [15]:
def get_movies(soup):   
  movies = []
  for movie in soup.find_all("div", class_ = "lister-item-content"):
    title = movie.find("h3").find("a").string
    duration = int(movie.find(class_ = "runtime").string.strip("min"))
    year = int(re.search("\d{4}",movie.find(class_= "lister-item-year text-muted unbold").string).group())
    movies.append({"title": title, "duration": duration, "year": year})
  return movies

In [16]:
all_movies = []

for page in range(5):
  print(f"Page{page + 1} ...")
  soup = data_page(page)
  all_movies += get_movies(soup)
print("Terminé")

Page1 ...
Page2 ...
Page3 ...
Page4 ...
Page5 ...
Terminé


In [17]:
all_movies

[{'title': 'The Shawshank Redemption', 'duration': 142, 'year': 1994},
 {'title': 'The Godfather', 'duration': 175, 'year': 1972},
 {'title': 'The Dark Knight', 'duration': 152, 'year': 2008},
 {'title': 'The Lord of the Rings: The Return of the King',
  'duration': 201,
  'year': 2003},
 {'title': "Schindler's List", 'duration': 195, 'year': 1993},
 {'title': 'The Godfather Part II', 'duration': 202, 'year': 1974},
 {'title': '12 Angry Men', 'duration': 96, 'year': 1957},
 {'title': 'Jai Bhim', 'duration': 164, 'year': 2021},
 {'title': 'Pulp Fiction', 'duration': 154, 'year': 1994},
 {'title': 'Inception', 'duration': 148, 'year': 2010},
 {'title': 'The Lord of the Rings: The Two Towers',
  'duration': 179,
  'year': 2002},
 {'title': 'Fight Club', 'duration': 139, 'year': 1999},
 {'title': 'The Lord of the Rings: The Fellowship of the Ring',
  'duration': 178,
  'year': 2001},
 {'title': 'Forrest Gump', 'duration': 142, 'year': 1994},
 {'title': 'The Good, the Bad and the Ugly', 'du

In [18]:
pd.DataFrame(all_movies)

Unnamed: 0,title,duration,year
0,The Shawshank Redemption,142,1994
1,The Godfather,175,1972
2,The Dark Knight,152,2008
3,The Lord of the Rings: The Return of the King,201,2003
4,Schindler's List,195,1993
...,...,...,...
245,Back to the Future,116,1985
246,Apocalypse Now,147,1979
247,Alien,117,1979
248,Once Upon a Time in the West,165,1968
