# WEBSCRAPPING PROJECT - DIA 1 - Ciné Zodiaque
#### HIEN Victor - LUTTENBACHER Léa - MENU Victor

### Librairies

In [149]:
import requests
import time
import pandas as pd
import re
import random
import numpy as np
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# Context and objective

Since the early 2000s, the company has been part of a major digital transformation in many sectors. The video-entertainment world is far from being excluded, and the art of cinema has been able to adapt and evolve to meet the new challenges it brings. The result today is the presence of cinema on numerous media and platforms, whether still physically in cinemas or on streaming platforms such as Netflix or Disney+. 

However, despite all its innovations and additions for a better user experience, one problem persists, and has surely already been experienced by each and every one of us: choosing a film. Although the personalization and suggestions offered by streaming sites are relevant, it's complicated to find out about a neutral, global base of films that brings together all the different criteria. 

Review and rating sites already exist, and offer a rating and classification system, but these scores are biased and often unrepresentative, being often too specific and site-specific. What's more, most of the criteria are still very generic, and not well adapted to certain specific consumer expectations, particularly with regard to RSE.

With regard to these criteria, we have chosen to focus on two RSE issues, in line with our values, and which meet two needs: 

-The cross-cultural dimension of cinema:
All too often, cinema tends to be limited to Hollywood films on streaming platforms, reducing the range of choices available to the user. Offering a cross-cultural solution would therefore be a good idea.

-Civic and ethical commitment to VSS (Sexist and Sexual Violence): 
With the many controversies, scandals and court cases that have been taking place over the years, the press has brought to light a problem that is too often hushed up within the film industry: the subject of SGBV in the production world. As a result, many people wish to blacklist and not support films produced by people accused of VSS, but are not aware of this on streaming platforms, which is a need.


# Our Project

To address these issues, we decided to create our own movies recommendation site, called Ciné ZoDIAque.  

Ciné ZooDIAque is a site offering an interface listing numerous movies and their main characteristics: Name, Year of release, Genre, Duration, Synopsis, Director, Crew, DIstributor, VSS and Rating (Note_Z who is the mean between the score from differents sites and Note_U who is the mean between the score from differents user feedbacks).

This information will be scrapped from various film recommendation/selection sites, offering ratings for the films.

In order to respect the cross-cultural dimension criterion and thus avoid bias, we want to offer an international variety of films, broadening the range of films and opening up to different cultures around the world. To achieve this, our notes will be retrieved from different sites of different origins.

Then, in order to respect our criteria regarding VSS, we've decided not only to offer a tool for viewing people accused of VSS (and the link to the source), but also to add the possibility of filtering films according to VSS to blacklist films made by people accused of VSS.

More technically, our site will offer a general interface, a search engine and a film filtering capability.

# I) Scrapping of the movies website

As we've explained, the first step is to scrape movies from selected, reliable sites: SensCritique (FR), Allocine (EU), IMDB (US), RottenTomatoes (WD) and Metacritic (WD).

Here, we'll detail the different stages of scrapping for each site, and the difficulties encountered.

Generally speaking, we used two libraries to scrape information: BeautifulSoup and Selenium. BeautifulSoup allows us to retrieve the desired information using tags, which we identify by inspecting the HTML code of the pages in question. What's more, to obtain more information, it may sometimes be necessary to take an action on the site. So we use Selenium to "simulate" human action and activate elements, such as scrolling or clicking on a button.

In terms of the difficulties encountered, it may already be a question of identifying the information and the corresponding tags, then extracting them. In addition, it can be difficult to access the site when scrapping numerous pages, hence the usefulness of using headers to avoid being denied access. Finally, we had to take into account the possible non-presence of certain elements and include them in our code.

### Scrapping of the website "Senscritique" : https://www.senscritique.com/films/tops/top111

First of all, as you'll see below, we've tried to scrape all 111 top films and their information. However, using only beautifulSoup, only 50 movies are scrapped. 

In [20]:
url = "https://www.senscritique.com/films/tops/top111"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    films_data = []

    films = soup.find_all(class_='sc-4495ecbb-1 gmNpOm')

    for film in films:
        title = film.find('a', class_='sc-e6f263fc-0 sc-a0949da7-1 cTitej eGjRhz sc-4495ecbb-3 hCRsTs').text.strip()
        release_date = film.find('span', {'data-testid': 'date-release'}).text.strip()
        genre = film.find('span', {'data-testid': 'genres'}).text.strip()
        rating = film.find('div', class_='sc-8251ce8c-5 bVyLNx globalRating').text.strip()
        duration_element = film.find('span', {'data-testid': 'duration'})
        duration = duration_element.text.strip() if duration_element else "Durée non disponible"
        creator_element = film.find('a', class_='sc-e6f263fc-0 sc-a0949da7-0 GItpw eShzae')
        creator_span = creator_element.find('span') if creator_element else None
        creator = creator_span.text.strip() if creator_span else "Créateur non disponible"
        movie_link = film.find('a', class_='sc-e6f263fc-0 sc-a0949da7-1 cTitej eGjRhz sc-4495ecbb-3 hCRsTs')['href']
        movie_response = requests.get(f"https://www.senscritique.com{movie_link}")
        if movie_response.status_code == 200:
            movie_soup = BeautifulSoup(movie_response.text, 'html.parser')
            synopsis_element = movie_soup.find('p', {'data-testid': 'synopsis'})
            synopsis = synopsis_element.text.strip() if synopsis_element else "Synopsis non disponible"
        else:
            synopsis = "Synopsis non disponible (échec de la requête)"

        films_data.append([title, release_date, genre, rating, duration, creator, synopsis])

    columns = ["Titre", "Date de sortie", "Genre", "Note", "Durée", "Créateur", "Synopsis"]
    df = pd.DataFrame(films_data, columns=columns)
    print(f"Nombre total de films scrappés : {len(films_data)}")

else:
    print(f"La requête a échoué avec le code d'état {response.status_code}")

Nombre total de films scrappés : 50


By analyzing the site, we can see that in reality, only 50 films are displayed at the start, and when you reach the bottom of the page, it loads the remaining 61. So we use a driver and Selenium to simulate scrolling and retrieve all the information.

In [21]:
options = webdriver.ChromeOptions()

options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('enable-automation')
options.add_argument('--window-size=1920,1080')
options.add_argument('--disable-extensions')
options.add_argument('--dns-prefetch-disable')
options.add_argument('--disable-gpu')
options.add_argument('--remote-debugging-port=9222')

chromedriver_path = 'C:/Users/vim17/Desktop/A5/ChromeDriver/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)

url = "https://www.senscritique.com/films/tops/top111"
driver.get(url)

for i in range(3):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)

page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
films_data = []
films = soup.select('.sc-4495ecbb-1.gmNpOm')

for film in films:
    title = film.select_one('.sc-e6f263fc-0.sc-a0949da7-1.cTitej.eGjRhz.sc-4495ecbb-3.hCRsTs').text.strip()
    release_date = film.select_one('span[data-testid="date-release"]').text.strip()

    genre_element = film.select_one('span[data-testid="genres"]')
    genre = genre_element.text.strip() if genre_element else "Genre non disponible"

    rating_element = film.select_one('.sc-8251ce8c-5.bVyLNx.globalRating')
    rating = rating_element.text.strip() if rating_element else "Note non disponible"

    duration_element = film.select_one('span[data-testid="duration"]')
    duration = duration_element.text.strip() if duration_element else "Durée non disponible"

    creator_element = film.find('a', class_='sc-e6f263fc-0 sc-a0949da7-0 GItpw eShzae')
    creator_span = creator_element.find('span') if creator_element else None
    creator = creator_span.text.strip() if creator_span else "Créateur non disponible"

    movie_link = film.select_one('.sc-e6f263fc-0.sc-a0949da7-1.cTitej.eGjRhz.sc-4495ecbb-3.hCRsTs')['href']
    movie_response = requests.get(f"https://www.senscritique.com{movie_link}")
    if movie_response.status_code == 200:
        movie_soup = BeautifulSoup(movie_response.text, 'html.parser')
        synopsis_element = movie_soup.select_one('p[data-testid="synopsis"]')
        synopsis = synopsis_element.text.strip() if synopsis_element else "Synopsis non disponible"
    else:
        synopsis = "Synopsis non disponible (échec de la requête)"

    films_data.append([title, release_date, genre, rating, duration, creator, synopsis])

columns = ["Titre", "Date de sortie", "Genre", "Note", "Durée", "Créateur", "Synopsis"]
df_sc = pd.DataFrame(films_data, columns=columns)

print(f"Nombre total de films scrappés : {len(films_data)}")

driver.quit()

  driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)


Nombre total de films scrappés : 111


Once the site has been scrapped, we display the resulting df, already doing some initial cleaning to improve visualization.

In [22]:
df_sc['Date de sortie'] = df_sc['Date de sortie'].str.extract(r'(\d{1,2} \w+ \d{4})')
df_sc.head()

Unnamed: 0,Titre,Date de sortie,Genre,Note,Durée,Créateur,Synopsis
0,Threat Level Midnight (2011),17 février 2011,Comédie,8.7,30 min.,Tucker Gates,"Après 11 ans de préparation, Michael Scott nou..."
1,Douze Hommes en colère (1957),4 octobre 1957,"Policier, Drame",8.7,1 h 36 min.,Sidney Lumet,Un jeune homme d’origine modeste est accusé du...
2,Harakiri (1962),24 juillet 1963,Drame,8.6,2 h 13 min.,Masaki Kobayashi,"Au XVIIe siècle, le Japon n’est plus en guerre..."
3,Blade Runner : The Final Cut (2007),5 octobre 2007,Science-fiction,8.6,1 h 57 min.,Ridley Scott,"Dans les dernières années du 20ème siècle, des..."
4,"Le Bon, la Brute et le Truand (1966)",8 mars 1968,"Western, Aventure",8.5,2 h 59 min.,Sergio Leone,Un chasseur de primes rejoint deux hommes dans...


In order to harmonize all df's, we have renamed the columns to make them easier to understand.

In [23]:
new_sc = {
    'Titre': 'Title',
    'Date de sortie': 'Date',
    'Genre':'Genre',
    'Note':'Note_Sc',
    'Durée':'Duration',
    'Créateur':'Director',
    'Synopsis':'Synopsis'
}

df_sc.rename(columns=new_sc, inplace=True)
df_sc.head()

Unnamed: 0,Title,Date,Genre,Note_Sc,Duration,Director,Synopsis
0,Threat Level Midnight (2011),17 février 2011,Comédie,8.7,30 min.,Tucker Gates,"Après 11 ans de préparation, Michael Scott nou..."
1,Douze Hommes en colère (1957),4 octobre 1957,"Policier, Drame",8.7,1 h 36 min.,Sidney Lumet,Un jeune homme d’origine modeste est accusé du...
2,Harakiri (1962),24 juillet 1963,Drame,8.6,2 h 13 min.,Masaki Kobayashi,"Au XVIIe siècle, le Japon n’est plus en guerre..."
3,Blade Runner : The Final Cut (2007),5 octobre 2007,Science-fiction,8.6,1 h 57 min.,Ridley Scott,"Dans les dernières années du 20ème siècle, des..."
4,"Le Bon, la Brute et le Truand (1966)",8 mars 1968,"Western, Aventure",8.5,2 h 59 min.,Sergio Leone,Un chasseur de primes rejoint deux hommes dans...


We then save the df locally.

Total movies scrapped on this site : 111

In [37]:
df_sc.to_csv('C:/Users/vim17/Desktop/WebScrapping2024/df_sc.csv', index=False)

### Scrapping of the website "IMDb" :  https://www.imdb.com/search/title/?sort=user_rating,asc&groups=top_1000&count=100

First, we try to scrape the site to obtain the top 1000 films and their information.

In [24]:
url = "https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_1000&count=100"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    data = []

    for movie in soup.find_all('div', class_='sc-b4e41383-4 efDLHx dli-parent'):
        title_element = movie.find('h3', class_='ipc-title__text')
        year_element = movie.find('span', class_='sc-1e00898e-8 hsHAHC dli-title-metadata-item')
        duration_element = movie.find_all('span', class_='sc-1e00898e-8 hsHAHC dli-title-metadata-item')[1]

        imdb_rating_element = movie.find('span', class_='ipc-rating-star--base')
        imdb_rating = imdb_rating_element.text.strip() if imdb_rating_element else 'N/A'

        metascore_element = movie.find('span', class_='sc-b0901df4-0 bcQdDJ metacritic-score-box')
        synopsis_element = movie.find('div', class_='ipc-html-content-inner-div')

        title = title_element.text.strip() if title_element else 'N/A'
        year = year_element.text.strip() if year_element else 'N/A'
        duration = duration_element.text.strip() if duration_element else 'N/A'
        metascore = metascore_element.text.strip() if metascore_element else 'N/A'
        synopsis = synopsis_element.text.strip() if synopsis_element else 'N/A'

        data.append([title, year, duration, imdb_rating, metascore, synopsis])

    columns = ['Title', 'Year', 'Duration', 'IMDb Rating', 'Metascore', 'Synopsis']
    df_IMDB1 = pd.DataFrame(data, columns=columns)

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [25]:
df_IMDB1['Title'] = df_IMDB1['Title'].str.replace(r'^\d+\.\s', '')
df_IMDB1

  df_IMDB1['Title'] = df_IMDB1['Title'].str.replace(r'^\d+\.\s', '')


Unnamed: 0,Title,Year,Duration,IMDb Rating,Metascore,Synopsis
0,Les évadés,1994,2h 22m,9.3 (2.8M),82,"Over the course of several years, two convicts..."
1,12th Fail,2023,2h 27m,9.2 (81K),,The real-life story of IPS Officer Manoj Kumar...
2,Le Parrain,1972,2h 55m,9.2 (2M),100,"Don Vito Corleone, head of a mafia family, dec..."
3,The Dark Knight : Le Chevalier noir,2008,2h 32m,9.0 (2.8M),84,When the menace known as the Joker wreaks havo...
4,Le Seigneur des anneaux : Le Retour du roi,2003,3h 21m,9.0 (1.9M),94,Gandalf and Aragorn lead the World of Men agai...
...,...,...,...,...,...,...
95,Heat,1995,2h 50m,8.3 (708K),76,A group of high-end professional thieves start...
96,Old Boy,2003,2h,8.3 (624K),78,After being kidnapped and imprisoned for fifte...
97,Will Hunting,1997,2h 6m,8.3 (1M),71,"Will Hunting, a janitor at M.I.T., has a gift ..."
98,Requiem for a Dream,2000,1h 42m,8.3 (888K),71,The drug-induced utopias of four Coney Island ...


We note that scrapping only works for 100 titles. By analyzing and observing the site, we notice that it loads in two stages: the site only displays the first 100, and then it's up to us to press a "100 more" button which loads 100 more and so on up to 1000.
So we decided to use selenium to get around this problem, by scrolling down and pressing the corresponding button when it's present.

In [26]:
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('enable-automation')
options.add_argument('--window-size=1920,1080')
options.add_argument('--disable-extensions')
options.add_argument('--dns-prefetch-disable')
options.add_argument('--disable-gpu')
options.add_argument('--remote-debugging-port=9222')

chromedriver_path = 'C:/Users/vim17/Desktop/A5/ChromeDriver/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)

url = "https://www.imdb.com/search/title/?sort=user_rating,desc&groups=top_1000&count=100"
driver.get(url)

click_count = 0
while click_count < 9:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)

    see_more_button = driver.find_elements(By.XPATH, '//button[contains(@class, "ipc-see-more__button")]'
                                                  '[contains(@class, "ipc-btn--theme-base")]'
                                                  '[contains(@class, "ipc-btn--on-accent2")]')

    if see_more_button:
        driver.execute_script("arguments[0].click();", see_more_button[0])
        click_count += 1
        print(f"Clicked {click_count} times")
        time.sleep(3)
    else:
        print("Button not found, continuing to scroll.")

page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')

data = []

for movie in soup.find_all('div', class_='sc-b4e41383-4 efDLHx dli-parent'):
    title_element = movie.find('h3', class_='ipc-title__text')
    year_element = movie.find('span', class_='sc-1e00898e-8 hsHAHC dli-title-metadata-item')
    duration_element = movie.find_all('span', class_='sc-1e00898e-8 hsHAHC dli-title-metadata-item')[1]

    imdb_rating_element = movie.find('span', class_='ipc-rating-star--base')
    imdb_rating = imdb_rating_element.text.strip() if imdb_rating_element else 'N/A'

    metascore_element = movie.find('span', class_='sc-b0901df4-0 bcQdDJ metacritic-score-box')
    synopsis_element = movie.find('div', class_='ipc-html-content-inner-div')

    title = title_element.text.strip() if title_element else 'N/A'
    year = year_element.text.strip() if year_element else 'N/A'
    duration = duration_element.text.strip() if duration_element else 'N/A'
    metascore = metascore_element.text.strip() if metascore_element else 'N/A'
    synopsis = synopsis_element.text.strip() if synopsis_element else 'N/A'

    data.append([title, year, duration, imdb_rating, metascore, synopsis])

columns = ['Title', 'Year', 'Duration', 'IMDb Rating', 'Metascore', 'Synopsis']
df_imdb = pd.DataFrame(data, columns=columns)

driver.quit()

  driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)


Clicked 1 times
Clicked 2 times
Clicked 3 times
Clicked 4 times
Clicked 5 times
Clicked 6 times
Clicked 7 times
Clicked 8 times
Clicked 9 times


Once the site has been scrapped, we display the resulting df, already doing some initial cleaning to improve visualization.

In [27]:
df_imdb['Title'] = df_imdb['Title'].str.replace(r'^\d+\.\s', '')
df_imdb.head()

  df_imdb['Title'] = df_imdb['Title'].str.replace(r'^\d+\.\s', '')


Unnamed: 0,Title,Year,Duration,IMDb Rating,Metascore,Synopsis
0,Les évadés,1994,2h 22m,"9,3 (2,8 M)",82.0,Le banquier Andy Dufresne est arrêté pour avoi...
1,12th Fail,2023,2h 27m,"9,2 (81 k)",,"L'histoire réelle de Manoj Kumar Sharma, offic..."
2,Le Parrain,1972,2h 55m,"9,2 (2 M)",100.0,Le patriarche vieillissant d'une dynastie de l...
3,The Dark Knight : Le Chevalier noir,2008,2h 32m,"9,0 (2,8 M)",84.0,Lorsqu'une menace mieux connue sous le nom du ...
4,Le Seigneur des anneaux : Le Retour du roi,2003,3h 21m,"9,0 (1,9 M)",94.0,Gandalf et Aragorn mènent le monde des hommes ...


In order to harmonize all df's, we have renamed the columns to make them easier to understand.

In [28]:
new_imdb = {
    'Titre': 'Title',
    'Year': 'Date',
    'Duration':'Duration',
    'IMDb Rating':'Note_Imdb',
    'Metascore':'User_score_1',
    'Synopsis':'Synopsis'
}

df_imdb.rename(columns=new_imdb, inplace=True)
df_imdb.head()

Unnamed: 0,Title,Date,Duration,Note_Imdb,User_score_1,Synopsis
0,Les évadés,1994,2h 22m,"9,3 (2,8 M)",82.0,Le banquier Andy Dufresne est arrêté pour avoi...
1,12th Fail,2023,2h 27m,"9,2 (81 k)",,"L'histoire réelle de Manoj Kumar Sharma, offic..."
2,Le Parrain,1972,2h 55m,"9,2 (2 M)",100.0,Le patriarche vieillissant d'une dynastie de l...
3,The Dark Knight : Le Chevalier noir,2008,2h 32m,"9,0 (2,8 M)",84.0,Lorsqu'une menace mieux connue sous le nom du ...
4,Le Seigneur des anneaux : Le Retour du roi,2003,3h 21m,"9,0 (1,9 M)",94.0,Gandalf et Aragorn mènent le monde des hommes ...


We then save the df locally. 

Total movies scrapped on this site : 1000

In [147]:
df_imdb.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_imdb.csv', index=False)

### Scrapping of the website "Allocine" :  https://www.allocine.fr/films/

First, we try to scrape this site to obtain the maximum of films and their information.

In [102]:
url = "https://www.allocine.fr/films/"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    movies = []

    for movie_div in soup.find_all('div', class_='card entity-card entity-card-list cf'):
        title = movie_div.find('a', class_='meta-title-link').text.strip()
        date = movie_div.find('span', class_='date').text.strip()

        duration_text = movie_div.find('div', class_='meta-body-info').text.strip()
        duration_match = re.search(r'(\d+h \d+min)', duration_text)
        duration = duration_match.group(1) if duration_match else 'N/A'

        genres_element = movie_div.find('div', class_='meta-body').find('div', class_='meta-body-item meta-body-info')
        genres = [genre.text.strip() for genre in genres_element.find_all('a')] if genres_element else []

        rating_items = movie_div.find_all('div', class_='rating-item-content')
        press_rating = rating_items[0].find('span', class_='stareval-note').text.strip() if rating_items and len(rating_items) > 0 else 'N/A'
        audience_rating = rating_items[1].find('span', class_='stareval-note').text.strip() if rating_items and len(rating_items) > 1 else 'N/A'

        director_element = movie_div.find('div', class_='meta-body-item meta-body-direction ')
        director_a = director_element.find('a', class_='xXx dark-grey-link') if director_element else None
        director = director_a.text.strip() if director_a else 'N/A'

        actors_element = movie_div.find('div', class_='meta-body-actor')
        actors = [a.text.strip() for a in actors_element.find_all('a')] if actors_element else []

        synopsis_div = movie_div.find('div', class_='synopsis')
        synopsis_element = synopsis_div.find('div', class_='content-txt') if synopsis_div else None
        synopsis = synopsis_element.text.strip() if synopsis_element else 'N/A'

        movies.append({
            'Title': title,
            'Date': date,
            'Duration': duration,
            'Genres': genres,
            'Director': director,
            'Actors': actors,
            'Synopsis': synopsis,
            'Press Rating': press_rating,
            'Audience Rating': audience_rating
        })

    df_allo = pd.DataFrame(movies)

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


In [103]:
df_allo.head()

Unnamed: 0,Title,Date,Duration,Genres,Director,Actors,Synopsis,Press Rating,Audience Rating
0,Wonka,13 décembre 2023,1h 57min,[],,"[Timothée Chalamet, Calah Lane]","Découvrez la jeunesse de Willy Wonka, l’extrao...",34,38
1,Les Trois Mousquetaires: Milady,13 décembre 2023,1h 55min,[],,"[François Civil, Vincent Cassel]","Du Louvre au Palais de Buckingham, des bas-fon...",33,37
2,Chasse gardée,20 décembre 2023,1h 41min,[],,"[Didier Bourdon, Hakim Jemili]","Dans un village sans histoire, une maison de r...",30,32
3,Vermines,27 décembre 2023,1h 45min,[],,"[Théo Christine, Sofia Lesaffre]","Face à une invasion d'araignées, les habitants...",36,38
4,La Tresse,29 novembre 2023,1h 59min,[],,"[Kim Raver, Fotinì Peluso]","Trois vies, trois femmes, trois continents. Tr...",24,42


We notice that scrapping only works for 14 titles. By analyzing and observing the site, we can see that it's broken down into a large number of pages with identical URLs, except for the page number. So all we need to do is loop with BeautifulSOup.
Another method might have been to use Selenium to jump to the next page.


In [31]:
all_movies = []

page_number = 1
max_pages = 200

while page_number <= max_pages:
    url = f"https://www.allocine.fr/films/?page={page_number}"

    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        movies_on_page = soup.find_all('div', class_='card entity-card entity-card-list cf')
        if not movies_on_page:
            break

        for movie_div in movies_on_page:
            title = movie_div.find('a', class_='meta-title-link').text.strip()
            date = movie_div.find('span', class_='date').text.strip()

            duration_text = movie_div.find('div', class_='meta-body-info').text.strip()
            duration_match = re.search(r'(\d+h \d+min)', duration_text)
            duration = duration_match.group(1) if duration_match else 'N/A'

            
            genres_element = movie_div.find('div', class_='meta-body').find('div', class_='meta-body-item meta-body-info')
            genres = [genre.text.strip() for genre in genres_element.find_all('a')] if genres_element else []


            rating_items = movie_div.find_all('div', class_='rating-item-content')
            press_rating = rating_items[0].find('span', class_='stareval-note').text.strip() if rating_items and len(rating_items) > 0 else 'N/A'
            audience_rating = rating_items[1].find('span', class_='stareval-note').text.strip() if rating_items and len(rating_items) > 1 else 'N/A'


            director_element = movie_div.find('div', class_='meta-body-direction')
            director_a = director_element.find('a', class_='blue-link') if director_element else None
            director = director_a.text.strip() if director_a else 'N/A'

            actors_element = movie_div.find('div', class_='meta-body-actor')
            actors = [a.text.strip() for a in actors_element.find_all('a')] if actors_element else []

            synopsis_div = movie_div.find('div', class_='synopsis')
            synopsis_element = synopsis_div.find('div', class_='content-txt') if synopsis_div else None
            synopsis = synopsis_element.text.strip() if synopsis_element else 'N/A'

            all_movies.append({
            'Title': title,
            'Date': date,
            'Duration': duration,
            'Genres': genres,
            'Director': director,
            'Actors': actors,
            'Synopsis': synopsis,
            'Press Rating': press_rating,
            'Audience Rating': audience_rating
            })

        page_number += 1
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        break

df_allocine = pd.DataFrame(all_movies)

print("Scrapping terminé")

Scrapping terminé


In [32]:
df_allocine

Unnamed: 0,Title,Date,Duration,Genres,Director,Actors,Synopsis,Press Rating,Audience Rating
0,Wonka,13 décembre 2023,1h 57min,[],,"[Timothée Chalamet, Calah Lane]","Découvrez la jeunesse de Willy Wonka, l’extrao...",34,38
1,Les Trois Mousquetaires: Milady,13 décembre 2023,1h 55min,[],,"[François Civil, Vincent Cassel]","Du Louvre au Palais de Buckingham, des bas-fon...",33,37
2,Chasse gardée,20 décembre 2023,1h 41min,[],,"[Didier Bourdon, Hakim Jemili]","Dans un village sans histoire, une maison de r...",30,32
3,Vermines,27 décembre 2023,1h 45min,[],,"[Théo Christine, Sofia Lesaffre]","Face à une invasion d'araignées, les habitants...",36,38
4,La Tresse,29 novembre 2023,1h 59min,[],,"[Kim Raver, Fotinì Peluso]","Trois vies, trois femmes, trois continents. Tr...",24,42
...,...,...,...,...,...,...,...,...,...
2995,Hitchcock,6 février 2013,1h 38min,[],,"[Anthony Hopkins, Helen Mirren]","Alfred Hitchcock, réalisateur reconnu et admir...",26,34
2996,Saindhav,13 janvier 2024,2h 18min,[],,"[Venkatesh Daggubati, Nawazuddin Siddiqui]","Saindhav, parent célibataire, vit avec sa fill...",30,--
2997,Nikita,21 février 1990,1h 57min,[],,"[Anne Parillaud, Tchéky Karyo]",Le braquage d'une pharmacie par une bande de j...,28,37
2998,La Mort aux trousses,21 octobre 1959,2h 16min,[],,"[Cary Grant, Eva Marie Saint]",Le publiciste Roger Tornhill se retrouve par e...,46,43


In order to harmonize all df's, we have renamed the columns to make them easier to understand.

In [33]:
new_allocine = {
    'Titre': 'Title',
    'Date': 'Date',
    'Duration':'Duration',
    'Genres':'Genres',
    'Director': 'Director',
    'Actors':'Crew',
    'Synopsis':'Synopsis',
    'Press Rating':'Note_Ac',
    'Audience Rating':'User_score_2'
}

df_allocine.rename(columns=new_allocine, inplace=True)
df_allocine.head()

Unnamed: 0,Title,Date,Duration,Genres,Director,Crew,Synopsis,Note_Ac,User_score_2
0,Wonka,13 décembre 2023,1h 57min,[],,"[Timothée Chalamet, Calah Lane]","Découvrez la jeunesse de Willy Wonka, l’extrao...",34,38
1,Les Trois Mousquetaires: Milady,13 décembre 2023,1h 55min,[],,"[François Civil, Vincent Cassel]","Du Louvre au Palais de Buckingham, des bas-fon...",33,37
2,Chasse gardée,20 décembre 2023,1h 41min,[],,"[Didier Bourdon, Hakim Jemili]","Dans un village sans histoire, une maison de r...",30,32
3,Vermines,27 décembre 2023,1h 45min,[],,"[Théo Christine, Sofia Lesaffre]","Face à une invasion d'araignées, les habitants...",36,38
4,La Tresse,29 novembre 2023,1h 59min,[],,"[Kim Raver, Fotinì Peluso]","Trois vies, trois femmes, trois continents. Tr...",24,42


We then save the df locally. 

Total movies scrapped on this site : 3000

In [146]:
df_allocine.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_allocine.csv', index=False)

### Scrapping of the website "RottenTomatoes" : https://www.rottentomatoes.com/browse/

To scrape this site, we first analyzed the url of the various pages, and as there were many for each genre, we decided to proceed as follows: first, we use Selenium and Beuatiful Soup to browse all the genres, retrieving all the titles for the first 5 pages of each genre and finally collect the url for each movies individually.

In [63]:
genres = ["ACTION", "ADVENTURE", "ANIMATION", "ANIME", "BIOGRAPHY", "COMEDY", "CRIME", "DOCUMENTARY", "DRAMA",
          "ENTERTAINMENT", "FAITH & SPIRITUALITY", "FANTASY", "GAME SHOW", "LGBTQ+", "HEALTH & WELLNESS", "HISTORY",
          "HOLIDAY", "HORROR", "HOUSE & GARDEN", "KIDS & FAMILY", "MUSIC", "MUSICAL", "MYSTERY & THRILLER", "NATURE",
          "NEWS", "REALITY", "ROMANCE", "SCI-FI", "SHORT", "SOAP", "SPECIAL INTEREST", "SPORTS", "STAND-UP", "TALK SHOW",
          "TRAVEL", "VARIETY", "WAR", "WESTERN"]

all_titles = []
all_links = []
all_genres = []

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Edge/91.0.864.37 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Firefox/89.0.2',
]

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('enable-automation')
options.add_argument('--window-size=1920,1080')
options.add_argument('--disable-extensions')
options.add_argument('--dns-prefetch-disable')
options.add_argument('--disable-gpu')
options.add_argument('--remote-debugging-port=9222')

chromedriver_path = 'C:/Users/vim17/Desktop/A5/ChromeDriver/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)

for genre in genres:
    for url_type in ["movies_at_home", "movies_in_theaters"]:
        url = f"https://www.rottentomatoes.com/browse/{url_type}/genres:{genre.lower()}~sort:popular?page=5"
        driver.get(url)
        
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        for movie in soup.find_all('a', {'data-qa': 'discovery-media-list-item-caption'}):
            title = movie.find('span', {'data-qa': 'discovery-media-list-item-title'}).text.strip()
            link = "https://www.rottentomatoes.com" + movie['href']
            all_titles.append(title)
            all_links.append(link)
            all_genres.append(genre)

driver.quit()
url_type_column = "Movies at Home" if url_type == "movies_at_home" else "Movies in Theaters"
df_rt = pd.DataFrame({'Genre': all_genres, 'Title': all_titles, 'Link': all_links, 'URL Type': url_type_column})
df_rt = df_rt.drop_duplicates(subset=['Title', 'Link'])

Once this is done, we go to each of the page urls for each title, and scrape all the important information.

In [64]:
all_years = []
all_runtimes = []
all_tomatometer_scores = []
all_audience_scores = []
all_synopses = []
all_distributors = []
all_directors = []
all_genres_all = []
all_crew = []

for index, row in df_rt.iterrows():       
    url = row['Link']
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        release_date_theaters_tag = soup.find('b', {'data-qa': 'movie-info-item-label'}, text='Release Date (Theaters):')
        release_date_theaters = release_date_theaters_tag.find_next('span', {'data-qa': 'movie-info-item-value'}).text.strip() if release_date_theaters_tag else None
        release_date_streaming_tag = soup.find('b', {'data-qa': 'movie-info-item-label'}, text='Release Date (Streaming):')
        release_date_streaming = release_date_streaming_tag.find_next('span', {'data-qa': 'movie-info-item-value'}).text.strip() if release_date_streaming_tag else None
        year = release_date_theaters or release_date_streaming
        
        runtime_tag = soup.find('b', {'data-qa': 'movie-info-item-label'}, text='Runtime:')
        runtime = runtime_tag.find_next('span', {'data-qa': 'movie-info-item-value'}).text.strip() if runtime_tag else None

        score_board_tag = soup.find('score-board-deprecated', {'data-qa': 'score-panel'})
        tomatometer_score = score_board_tag.attrs.get('tomatometerscore') if score_board_tag else None
        audience_score = score_board_tag.attrs.get('audiencescore') if score_board_tag else None
        
        synopsis_tag = soup.find('p', {'data-qa': 'movie-info-synopsis'})
        synopsis = synopsis_tag.text.strip() if synopsis_tag else None
        
        distributor_tag = soup.find('b', {'data-qa': 'movie-info-item-label'}, text='Distributor:')
        distributor = distributor_tag.find_next('span', {'data-qa': 'movie-info-item-value'}).text.strip() if distributor_tag else None
        
        director_tag = soup.find('b', {'data-qa': 'movie-info-item-label'}, text='Director:')
        director = director_tag.find_next('span', {'data-qa': 'movie-info-item-value'}).text.strip() if director_tag else None
        
        genres_all_tag = soup.find('span', {'class': 'genre', 'data-qa': 'movie-info-item-value'})
        genres_all = [genre.strip() for genre in genres_all_tag.text.split(',')] if genres_all_tag else None
        
        crew_tags = soup.find_all('div', {'class': 'metadata'})
        crew = [c.find('p').text.strip() for c in crew_tags] if crew_tags else None
        
        all_years.append(year)
        all_runtimes.append(runtime)
        all_tomatometer_scores.append(tomatometer_score)
        all_audience_scores.append(audience_score)
        all_synopses.append(synopsis)
        all_distributors.append(distributor)
        all_directors.append(director)
        all_genres_all.append(genres_all)
        all_crew.append(crew)

df_extra_info = pd.DataFrame({
    'Year': all_years,
    'Runtime': all_runtimes,
    'Tomatometer Score': all_tomatometer_scores,
    'Audience Score': all_audience_scores,
    'Synopsis': all_synopses,
    'Distributor': all_distributors,
    'Director': all_directors,
    'All Genres': all_genres_all,
    'Crew': all_crew})

df_rt_f = pd.concat([df_rt, df_extra_info], axis=1)

We then do some cleaning and display the df.

In [65]:
#df_rt_f = df_rt_f.drop(columns=['Genre'])
df_rt_f = df_rt_f.drop(columns=['Link'])
df_rt_f['Year'] = df_rt_f['Year'].str.replace('\n', '', regex=False)
df_rt_f['Year'] = df_rt_f['Year'].str.replace('limited', '', regex=False)
df_rt_f['Year'] = df_rt_f['Year'].str.replace('wide', '', regex=False)
df_rt_f

In order to harmonize all df's, we have renamed the columns to make them easier to understand.

In [66]:
new_rt = {
    'Titre': 'Title',
    'Year':'Date',
    'Runtime':'Duration',
    'Tomatometer Score':'Note_Rt',
    'Audience Score':'User_score_3',
    'Synopsis':'Synopsis',
    'Distributor':'Distributor',
    'Director':'Director',
    'All Genres':'Genre',
    'Crew':'Crew'
}

df_rt_f.rename(columns=new_rt, inplace=True)
df_rt_f.head()

Unnamed: 0,Title,Date,Duration,Note_Rt,User_score_3,Synopsis,Distributor,Director,Genre,Crew
0,Lift,"Jan 12, 2024",1h 44m,32.0,33.0,An international heist crew is recruited to pr...,Netflix,F. Gary Gray,"['Action', 'Comedy', 'Crime', 'Drama']","['Kevin Hart', 'Gugu Mbatha-Raw', ""Vincent D'O..."
1,Napoleon,"Nov 22, 2023",2h 38m,58.0,59.0,"""Napoleon"" is a spectacle-filled action epic t...",Apple Original Films / Columbia Pictures,Ridley Scott,"['Biography', 'History', 'Drama', 'War', 'Acti...","['Joaquin Phoenix', 'Vanessa Kirby', 'Ben Mile..."
2,Rebel Moon: Part One - A Child of Fire,"Dec 15, 2023",2h 15m,23.0,59.0,"From Zack Snyder, the filmmaker behind 300, Ma...",Netflix,Zack Snyder,"['Sci-fi', 'Action', 'Adventure', 'Drama', 'Fa...","['Sofia Boutella', 'Djimon Hounsou', 'Ed Skrei..."
3,The Hunger Games: The Ballad of Songbirds & Sn...,"Nov 17, 2023",2h 37m,64.0,89.0,Experience the story of THE HUNGER GAMES -- 64...,Lionsgate,Francis Lawrence,"['Action', 'Adventure', 'Sci-fi']","['Tom Blyth', 'Rachel Zegler', 'Peter Dinklage..."
4,The Creator,"Sep 29, 2023",2h 13m,66.0,76.0,"From writer/director Gareth Edwards (""Rogue On...",20th Century Studios,Gareth Edwards,"['Sci-fi', 'Action', 'Adventure']","['John David Washington', 'Gemma Chan', 'Ken W..."


We then save the df locally.

Total movies scrapped on this site : 1750

In [145]:
df_rt_f.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_rt_f.csv', index=False)

### Scrapping of the website "Metacritic" : https://www.metacritic.com/browse/movie/?releaseYearMin=1910&releaseYearMax=2024&page=1

To scrape this site, we proceed in the same way as before: by analyzing the site, we first retrieve from the first 101 pages all the titles and the link to the page for these sites.

In [45]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

all_titles = []
all_release_dates = []
all_metascores = []
all_urls = []

for page_num in range(1, 101):
    url = f"https://www.metacritic.com/browse/movie/?releaseYearMin=1910&releaseYearMax=2024&page={page_num}"
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
        product_containers = soup.find_all('a', class_='c-finderProductCard_container g-color-gray80 u-grid')

        for container in product_containers:
            card = container.find_parent('div', class_='c-finderProductCard')
            
            title_div = card.find('div', {'data-title': True})
            title = title_div.get('data-title') if title_div else None

            release_date_span = card.find('span', class_='u-text-uppercase')
            release_date = release_date_span.text.strip() if release_date_span else None

            meta_score_div = card.find('div', class_='c-siteReviewScore_xsmall')
            meta_score = meta_score_div.find('span').text.strip() if meta_score_div else None

            url = "https://www.metacritic.com" + container.get('href') if container and 'href' in container.attrs else None

            all_titles.append(title)
            all_release_dates.append(release_date)
            all_metascores.append(meta_score)
            all_urls.append(url)
    else:
        print(f"Failed to retrieve page {page_num}. Status code:", response.status_code)

df_mc = pd.DataFrame({'Title': all_titles, 'Release Date': all_release_dates, 'Metascore': all_metascores, 'URL': all_urls})

Then, for each url of each title, we'll once again retrieve the various information required.

In [46]:
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
session.headers.update(headers)

all_user_scores = []
all_summaries = []
all_directors = []
all_production_companies = []
all_durations = []
all_genres = []
all_cast_crew = []

df_mc_2 = df_mc.head(1000)

for index, row in df_mc_2.iterrows():
    url = row['URL']
    
    try:
        response = session.get(url)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')

        user_score_tag = soup.find('div', {'class': 'c-siteReviewScore_user'})
        user_score = user_score_tag.find('span').text.strip() if user_score_tag else None

        summary_tag = soup.find('span', {'class': 'c-productDetails_description'})
        summary = summary_tag.text.strip() if summary_tag else None

        director_tag = soup.find('div', {'class': 'c-crewList'})
        director = director_tag.find('a', {'class': 'c-crewList_link'}).text.strip() if director_tag else None

        production_details_tag = soup.find('div', {'class': 'c-ProductionDetails'})

        production_companies_tag = production_details_tag.find('span', {'class': 'g-text-bold'}, text='Production Company')
        production_companies = [company.strip() for company in production_companies_tag.find_next('span').text.split(',')] if production_companies_tag else None

        duration_tag = production_details_tag.find('span', {'class': 'g-text-bold'}, text='Duration')
        duration = duration_tag.find_next('span').text.strip() if duration_tag else None

        genres_tag = production_details_tag.find('span', {'class': 'g-text-bold'}, text='Genres')
        genres = [genre.strip() for genre in genres_tag.find_next('ul').text.split(',')] if genres_tag else None

        all_user_scores.append(user_score)
        all_summaries.append(summary)
        all_directors.append(director)
        all_production_companies.append(production_companies)
        all_release_dates.append(release_date)
        all_durations.append(duration)
        all_genres.append(genres)

    except Exception as e:
        print(f"Failed to retrieve details for {row['Title']}. Error: {e}")

df_mc_extra_info = pd.DataFrame({
    'User Score': all_user_scores,
    'Summary': all_summaries,
    'Director': all_directors,
    'Production Companies': all_production_companies,
    'Duration': all_durations,
    'Genres': all_genres
})

df_mc_f = pd.concat([df_mc_2, df_mc_extra_info], axis=1)

We display the df.

In [47]:
df_mc_f = df_mc_f.drop(columns=['URL'])
df_mc_f

Unnamed: 0,Title,Release Date,Metascore,User Score,Summary,Director,Production Companies,Duration,Genres
0,Dekalog (1988),"Mar 22, 1996",100,7.6,This masterwork by Krzysztof Kieślowski is one...,Krzysztof Kieslowski,"[Telewizja Polska (TVP), Zespol Filmowy ""Tor"",...",9 h 32 m,[Drama]
1,Rear Window,"Sep 1, 1954",100,8.7,A wheelchair-bound photographer spies on his n...,Alfred Hitchcock,[Alfred J. Hitchcock Productions],1 h 52 m,[Mystery\n \n Thriller]
2,The Godfather,"Mar 24, 1972",100,9.3,Francis Ford Coppola's epic features Marlon Br...,Francis Ford Coppola,"[Paramount Pictures, Albert S. Ruddy Productio...",2 h 55 m,[Crime\n \n Drama]
3,The Conformist,"Oct 22, 1970",100,7.4,"Set in Rome in the 1930s, this re-release of B...",Bernardo Bertolucci,"[Mars Film, Marianne Productions, Maran Film]",1 h 53 m,[Drama]
4,The Leopard (re-release),"Aug 13, 2004",100,7.8,"Set in Sicily in 1860, Luchino Visconti's spec...",Luchino Visconti,"[Titanus, Société Nouvelle Pathé Cinéma, Socié...",3 h 6 m,[Drama\n \n History]
...,...,...,...,...,...,...,...,...,...
995,Momma's Man,"Aug 22, 2008",84,tbd,Momma’s Man chronicles the increasingly anxiou...,Azazel Jacobs,[Artists Public Domain],1 h 34 m,[Comedy\n \n Drama]
996,You Were Never Really Here,"Apr 6, 2018",84,7.3,"A traumatized veteran, unafraid of violence, t...",Lynne Ramsay,"[Why Not Productions, Film4, British Film Inst...",1 h 29 m,[Crime\n \n Drama\n \n ...
997,In the Name of the Father,"Dec 29, 1993",84,8.1,Academy Award winner Daniel Day-Lewis gives an...,Jim Sheridan,"[Hell's Kitchen Films, Universal Pictures]",2 h 13 m,[Biography\n \n Crime\n \...
998,The Eternal Memory,"Aug 11, 2023",84,9.3,Augusto and Paulina have been together for 25 ...,Maite Alberdi,"[Fabula, Micromundo Producciones, Chicken And ...",1 h 25 m,[Documentary]


In order to harmonize all df's, we have renamed the columns to make them easier to understand.

In [48]:
new_mc = {
    'Titre': 'Title',
    'Release Date' : 'Date',
    'Metascore':'Note_Mc',
    'User Score':'User_score_4',
    'Summary':'Synopsis',
    'Director':'Director',
    'Production Companies':'Distributor',
    'Duration':'Duration',
    'Genres':'Genre'
}

df_mc_f.rename(columns=new_mc, inplace=True)
df_mc_f.head()

Unnamed: 0,Title,Date,Note_Mc,User_score_4,Synopsis,Director,Distributor,Duration,Genre
0,Dekalog (1988),"Mar 22, 1996",100,7.6,This masterwork by Krzysztof Kieślowski is one...,Krzysztof Kieslowski,"[Telewizja Polska (TVP), Zespol Filmowy ""Tor"",...",9 h 32 m,[Drama]
1,Rear Window,"Sep 1, 1954",100,8.7,A wheelchair-bound photographer spies on his n...,Alfred Hitchcock,[Alfred J. Hitchcock Productions],1 h 52 m,[Mystery\n \n Thriller]
2,The Godfather,"Mar 24, 1972",100,9.3,Francis Ford Coppola's epic features Marlon Br...,Francis Ford Coppola,"[Paramount Pictures, Albert S. Ruddy Productio...",2 h 55 m,[Crime\n \n Drama]
3,The Conformist,"Oct 22, 1970",100,7.4,"Set in Rome in the 1930s, this re-release of B...",Bernardo Bertolucci,"[Mars Film, Marianne Productions, Maran Film]",1 h 53 m,[Drama]
4,The Leopard (re-release),"Aug 13, 2004",100,7.8,"Set in Sicily in 1860, Luchino Visconti's spec...",Luchino Visconti,"[Titanus, Société Nouvelle Pathé Cinéma, Socié...",3 h 6 m,[Drama\n \n History]


We then save the df locally.

Total movies scrapped on this site : 1000

In [144]:
df_mc_f.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_mc_f.csv', index=False)

# II) Fusion of movies datasets

In this section, we sought to clean, harmonize and merge the df's in order to obtain a complete dataset for our site, listing all the information previously retrieved.

So, for each of the following codes, we prepare the df for merging.

To do this, we carry out various treatments to clean up the datasets:

-We set up the same columns for all the dfs, except for the notes, which are clean.

-Replace null values

-All columns are set to the correct format, both in terms of type and content.

In [120]:
df_S = df_sc.copy()
df_S['Title'] = df_S['Title'].str.replace(r'\([^)]*\)', '', regex=True)
df_S['Date'] = df_S['Date'].str[-4:]
df_S['Note_Sc'] = pd.to_numeric(df_S['Note_Sc'], errors='coerce')
df_S['Duration'] = df_S['Duration'].str.replace(r' min', 'm', regex=True)
df_S['Duration'] = df_S['Duration'].str.replace(r' h', 'h', regex=True)
df_S['Duration'] = df_S['Duration'].str.rstrip('.')
df_S['Crew'] = ''
df_S['Distributor'] = ''
df_S = df_S.apply(lambda col: col.str.replace(r'N/A', 'NaN', regex=True) if col.dtype == 'O' else col)
df_S = df_S[['Title', 'Date', 'Genre', 'Duration', 'Synopsis', 'Director', 'Crew', 'Distributor', 'Note_Sc']]
df_S.head()

Unnamed: 0,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor,Note_Sc
0,Threat Level Midnight,2011,Comédie,30m,"Après 11 ans de préparation, Michael Scott nou...",Tucker Gates,,,8.7
1,Douze Hommes en colère,1957,"Policier, Drame",1h 36m,Un jeune homme d’origine modeste est accusé du...,Sidney Lumet,,,8.7
2,Harakiri,1963,Drame,2h 13m,"Au XVIIe siècle, le Japon n’est plus en guerre...",Masaki Kobayashi,,,8.6
3,Blade Runner : The Final Cut,2007,Science-fiction,1h 57m,"Dans les dernières années du 20ème siècle, des...",Ridley Scott,,,8.6
4,"Le Bon, la Brute et le Truand",1968,"Western, Aventure",2h 59m,Un chasseur de primes rejoint deux hommes dans...,Sergio Leone,,,8.5


In [127]:
df_I = df_imdb.copy()
df_I['Note_Imdb'] = df_I['Note_Imdb'].str.replace(r'\([^)]*\)', '', regex=True)
df_I['Note_Imdb'] = df_I['Note_Imdb'].str.replace(r',', '.', regex=True)
df_I['Note_Imdb'] = df_I['Note_Imdb'].str.replace(r' ', '', regex=True)
df_I['Note_Imdb'] = df_I['Note_Imdb'].str[:3]
df_I['Note_Imdb'] = pd.to_numeric(df_I['Note_Imdb'], errors='coerce')
df_I['User_score_1'] = pd.to_numeric(df_I['User_score_1'], errors='coerce')
df_I['User_score_1'] = df_I['User_score_1'].apply(lambda x: x / 10 if pd.notna(x) else np.nan)
df_I['Genre'] = ''
df_I['Director'] = ''
df_I['Crew'] = ''
df_I['Distributor'] = ''
df_I = df_I.apply(lambda col: col.str.replace(r'N/A', 'NaN', regex=True) if col.dtype == 'O' else col)
df_I = df_I[['Title', 'Date', 'Genre', 'Duration', 'Synopsis', 'Director', 'Crew', 'Distributor', 'Note_Imdb', 'User_score_1']]
df_I.head()

Unnamed: 0,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor,Note_Imdb,User_score_1
0,Les évadés,1994,,2h 22m,Le banquier Andy Dufresne est arrêté pour avoi...,,,,9.3,8.2
1,12th Fail,2023,,2h 27m,"L'histoire réelle de Manoj Kumar Sharma, offic...",,,,9.2,
2,Le Parrain,1972,,2h 55m,Le patriarche vieillissant d'une dynastie de l...,,,,9.2,10.0
3,The Dark Knight : Le Chevalier noir,2008,,2h 32m,Lorsqu'une menace mieux connue sous le nom du ...,,,,9.0,8.4
4,Le Seigneur des anneaux : Le Retour du roi,2003,,3h 21m,Gandalf et Aragorn mènent le monde des hommes ...,,,,9.0,9.4


In [72]:
df_A = df_allocine.copy()
df_A['Date'] = df_A['Date'].str[-4:]
df_A['Duration'] = df_A['Duration'].str.replace(r' min', 'm', regex=True)
df_A['Genre'] = df_A['Genres'].apply(lambda x: ', '.join(x))
df_A = df_A.drop(columns=['Genres'])
df_A['Crew'] = df_A['Crew'].apply(lambda x: ', '.join(x))
df_A['Note_Ac'] = pd.to_numeric(df_A['Note_Ac'].str.replace(',', '.'), errors='coerce')
df_A['User_score_2'] = pd.to_numeric(df_A['User_score_2'].str.replace(',', '.'), errors='coerce')
df_A['Distributor']=''
df_A = df_A.apply(lambda col: col.str.replace(r'N/A', 'NaN', regex=True) if col.dtype == 'O' else col)
df_A = df_A[['Title', 'Date', 'Genre', 'Duration', 'Synopsis', 'Director', 'Crew', 'Distributor', 'Note_Ac', 'User_score_2']]
df_A.head()

Unnamed: 0,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor,Note_Ac,User_score_2
0,Wonka,2023,,1h 57min,"Découvrez la jeunesse de Willy Wonka, l’extrao...",,"Timothée Chalamet, Calah Lane",,3.4,3.8
1,Les Trois Mousquetaires: Milady,2023,,1h 55min,"Du Louvre au Palais de Buckingham, des bas-fon...",,"François Civil, Vincent Cassel",,3.3,3.7
2,Chasse gardée,2023,,1h 41min,"Dans un village sans histoire, une maison de r...",,"Didier Bourdon, Hakim Jemili",,3.0,3.2
3,Vermines,2023,,1h 45min,"Face à une invasion d'araignées, les habitants...",,"Théo Christine, Sofia Lesaffre",,3.6,3.8
4,La Tresse,2023,,1h 59min,"Trois vies, trois femmes, trois continents. Tr...",,"Kim Raver, Fotinì Peluso",,2.4,4.2


In [137]:
df_T = df_rt_f.copy()
#df_T['Date'] = df_T['Date'].str.replace(r' ', '', regex=True)
df_T['Date'] = df_T['Date'].str[-5:]
df_T['Genre'] = df_T['Genre'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
df_T['Crew'] = df_T['Crew'].apply(lambda x: ', '.join(x) if isinstance(x,list) else x)
df_T['Note_Rt'] = pd.to_numeric(df_T['Note_Rt'], errors='coerce')
df_T['Note_Rt'] = df_T['Note_Rt'].apply(lambda x: x / 10 if pd.notna(x) else np.nan)
df_T['User_score_3'] = pd.to_numeric(df_T['User_score_3'], errors='coerce')
df_T['User_score_3'] = df_T['User_score_3'].apply(lambda x: x / 10 if pd.notna(x) else np.nan)
df_T = df_T.apply(lambda col: col.str.replace(r'N/A', 'NaN', regex=True) if col.dtype == 'O' else col)
df_T = df_T[['Title', 'Date', 'Genre', 'Duration', 'Synopsis', 'Director', 'Crew', 'Distributor', 'Note_Rt', 'User_score_3']]
df_T.head()

Unnamed: 0,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor,Note_Rt,User_score_3
0,Lift,2024,"['Action', 'Comedy', 'Crime', 'Drama']",1h 44m,An international heist crew is recruited to pr...,F. Gary Gray,"['Kevin Hart', 'Gugu Mbatha-Raw', ""Vincent D'O...",Netflix,3.2,3.3
1,Napoleon,2023,"['Biography', 'History', 'Drama', 'War', 'Acti...",2h 38m,"""Napoleon"" is a spectacle-filled action epic t...",Ridley Scott,"['Joaquin Phoenix', 'Vanessa Kirby', 'Ben Mile...",Apple Original Films / Columbia Pictures,5.8,5.9
2,Rebel Moon: Part One - A Child of Fire,2023,"['Sci-fi', 'Action', 'Adventure', 'Drama', 'Fa...",2h 15m,"From Zack Snyder, the filmmaker behind 300, Ma...",Zack Snyder,"['Sofia Boutella', 'Djimon Hounsou', 'Ed Skrei...",Netflix,2.3,5.9
3,The Hunger Games: The Ballad of Songbirds & Sn...,2023,"['Action', 'Adventure', 'Sci-fi']",2h 37m,Experience the story of THE HUNGER GAMES -- 64...,Francis Lawrence,"['Tom Blyth', 'Rachel Zegler', 'Peter Dinklage...",Lionsgate,6.4,8.9
4,The Creator,2023,"['Sci-fi', 'Action', 'Adventure']",2h 13m,"From writer/director Gareth Edwards (""Rogue On...",Gareth Edwards,"['John David Washington', 'Gemma Chan', 'Ken W...",20th Century Studios,6.6,7.6


In [133]:
df_M = df_mc_f.copy()
df_M['Title'] = df_M['Title'].str.replace(r'\([^)]*\)', '', regex=True)
df_M['Date'] = df_M['Date'].str.replace(r' ', '', regex=True)
df_M['Date'] = df_M['Date'].str[-4:]
df_M['Note_Mc'] = pd.to_numeric(df_M['Note_Mc'], errors='coerce')
df_M['Note_Mc'] = df_M['Note_Mc'].apply(lambda x: x / 10 if pd.notna(x) else np.nan)
df_M['User_score_4'] = pd.to_numeric(df_M['User_score_4'], errors='coerce')
df_M['Distributor'] = df_M['Distributor'].apply(lambda x: ', '.join(x) if isinstance(x,list) else x)
df_M['Genre'] = df_M['Genre'].apply(lambda x: ', '.join(x) if isinstance(x, list) else x)
df_M['Genre'] = df_M['Genre'].str.replace(r'\n', ', ', regex=True)
df_M['Genre'] = df_M['Genre'].str.replace(r'\n', ', ', regex=True)
df_M['Duration'] = df_M['Duration'].str.replace(r' ,', '', regex=True)
df_M['Duration'] = df_M['Duration'].str.replace(r' h', 'h', regex=True)
df_M['Crew']=''
df_M = df_M.apply(lambda col: col.str.replace(r'N/A', 'NaN', regex=True) if col.dtype == 'O' else col)
df_M = df_M[['Title', 'Date', 'Genre', 'Duration', 'Synopsis', 'Director', 'Crew', 'Distributor', 'Note_Mc', 'User_score_4']]
df_M.head()

Unnamed: 0,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor,Note_Mc,User_score_4
0,Dekalog,1996,Drama,9h 32 m,This masterwork by Krzysztof Kieślowski is one...,Krzysztof Kieslowski,,"Telewizja Polska (TVP), Zespol Filmowy ""Tor"", ...",10.0,7.6
1,Rear Window,1954,"Mystery, , Thriller",1h 52 m,A wheelchair-bound photographer spies on his n...,Alfred Hitchcock,,Alfred J. Hitchcock Productions,10.0,8.7
2,The Godfather,1972,"Crime, , Drama",2h 55 m,Francis Ford Coppola's epic features Marlon Br...,Francis Ford Coppola,,"Paramount Pictures, Albert S. Ruddy Production...",10.0,9.3
3,The Conformist,1970,Drama,1h 53 m,"Set in Rome in the 1930s, this re-release of B...",Bernardo Bertolucci,,"Mars Film, Marianne Productions, Maran Film",10.0,7.4
4,The Leopard,2004,"Drama, , History",3h 6 m,"Set in Sicily in 1860, Luchino Visconti's spec...",Luchino Visconti,,"Titanus, Société Nouvelle Pathé Cinéma, Sociét...",10.0,7.8


Once that's done, we merge the dfs together. The difficulty is that a simple merge wouldn't work, so we had to start from one dataset and fill in all its missing values using the other datasets. 

Once this was done, we created our new variables, Note_Z and Note_U (the respective averages of site and user scores, from different origins).

In [160]:
df_all = df_T.copy()
df_all['Note_Sc'] = None
df_all['Note_Imdb'] = None
df_all['Note_Rt'] = None
df_all['Note_Mc'] = None

dfs_to_combine = [df_A, df_I, df_M, df_S]
for df in dfs_to_combine:
    for col in df.columns:
        if col not in df_all.columns:
            df_all[col] = None
        df_all[col] = df_all[col].combine_first(df[col])

df_all['Genre'] = df_all['Genre'].fillna(pd.NA)
df_all['Note_Z'] = df_all[['Note_Sc', 'Note_Imdb', 'Note_Ac', 'Note_Rt', 'Note_Mc']].mean(axis=1).round(1)
df_all['Note_U'] = df_all[['User_score_1', 'User_score_2', 'User_score_3', 'User_score_4']].mean(axis=1).round(1)

final_columns = ['Note_Z', 'Note_U', 'Title', 'Date', 'Genre', 'Duration', 'Synopsis', 'Director', 'Crew', 'Distributor',
                 'Note_Sc', 'Note_Imdb', 'Note_Ac', 'Note_Rt', 'Note_Mc',  'User_score_1', 'User_score_2', 'User_score_3', 'User_score_4']

df_movies = df_all[final_columns]
df_movies = df_movies.drop(columns=['Note_Sc', 'Note_Imdb', 'Note_Ac', 'Note_Rt', 'Note_Mc', 'User_score_1', 'User_score_2','User_score_3','User_score_4'])
df_movies = df_movies.sort_values(by='Note_Z', ascending = False)
df_movies['Distributor'] = df_movies['Distributor'].fillna(pd.NA)
df_movies.head()

Unnamed: 0,Note_Z,Note_U,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor
0,7.8,5.7,Lift,2024,"['Action', 'Comedy', 'Crime', 'Drama']",1h 44m,An international heist crew is recruited to pr...,F. Gary Gray,"['Kevin Hart', 'Gugu Mbatha-Raw', ""Vincent D'O...",Netflix
12,7.8,7.4,Mission: Impossible - Dead Reckoning Part One,2023,"['Action', 'Adventure', 'Mystery & thriller']",2h 43m,In Mission: Impossible - Dead Reckoning Part O...,Christopher McQuarrie,"['Tom Cruise', 'Hayley Atwell', 'Ving Rhames',...",Paramount Pictures
3,7.8,7.1,The Hunger Games: The Ballad of Songbirds & Sn...,2023,"['Action', 'Adventure', 'Sci-fi']",2h 37m,Experience the story of THE HUNGER GAMES -- 64...,Francis Lawrence,"['Tom Blyth', 'Rachel Zegler', 'Peter Dinklage...",Lionsgate
1,7.8,6.1,Napoleon,2023,"['Biography', 'History', 'Drama', 'War', 'Acti...",2h 38m,"""Napoleon"" is a spectacle-filled action epic t...",Ridley Scott,"['Joaquin Phoenix', 'Vanessa Kirby', 'Ben Mile...",Apple Original Films / Columbia Pictures
6,7.8,7.3,The Equalizer 3,2023,"['Action', 'Mystery & thriller']",1h 49m,Since giving up his life as a government assas...,Antoine Fuqua,"['Denzel Washington', 'Dakota Fanning', 'David...",Columbia Pictures


We thus obtain our first almost complete df, and save it.

In [161]:
df_movies.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_movies.csv', index=False)

# III) Scrapping of the VSS Websites

As a second step, and as explained in our specifications, we looked into VSS and tried to find reliable sources offering listings on the subject. Our aim is to obtain a list of names with the source, enabling us to cross-reference this with our df_movies.

We encountered a number of difficulties here, which underlined the importance of our problem. Indeed, we've seen how difficult it is to find information about VSS. When we simply search for a person's name on the Internet, information on their charges is rarely on the first page. When we look for listings, there are very few or none. This makes it very difficult to find information about people accused of VSS, which reinforces our desire to offer this option, in line with our values.

So, we've identified three such sites, and hope to improve this listing in the future with other sources to enrich the database.

### Scrapping of the following websites :
https://www.senscritique.com/liste/les_porcs/2207124

https://www.journaldequebec.com/2017/11/01/accusations-de-nature-sexuelle-au-banc-des-accuses

https://www.radiofrance.fr/franceinter/a-hollywood-la-liste-des-harceleurs-presumes-s-allonge-8515300

For each of these sites, we retrieve the names identified by tags or bold text, and process them.

In [106]:
url_base = "https://www.senscritique.com/liste/les_porcs/2207124?page="
all_people = []

total_pages = 2

for page_number in range(1, total_pages + 1):
    url = url_base + str(page_number)
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        for person_div in soup.find_all('div', class_='sc-ff17bd40-0 hlYJbi'):
            name = person_div.find('h3', class_='sc-e6f263fc-0 cTitel').find('a').text.strip()
            all_people.append({'Name': name})

    else:
        print(f"Failed to retrieve the page {url}. Status code: {response.status_code}")

df_vss1 = pd.DataFrame(all_people)

In [107]:
df_vss1['Url']=url_base
df_vss1.head()

Unnamed: 0,Name,Url
0,Harvey Weinstein,https://www.senscritique.com/liste/les_porcs/2...
1,Kevin Spacey,https://www.senscritique.com/liste/les_porcs/2...
2,Jean Lassalle,https://www.senscritique.com/liste/les_porcs/2...
3,Louis C.K.,https://www.senscritique.com/liste/les_porcs/2...
4,James Franco,https://www.senscritique.com/liste/les_porcs/2...


In [108]:
url = "https://www.journaldequebec.com/2017/11/01/accusations-de-nature-sexuelle-au-banc-des-accuses"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    bold_texts = []

    for strong_tag in soup.find_all('strong'):
        bold_texts.append(' '.join(strong_tag.text.strip().split()[:2]))

    df_vss2 = pd.DataFrame({'BoldTexts': bold_texts})
    df_vss2['BoldTexts'] = df_vss2['BoldTexts'].str[:50]
    df_vss2 = df_vss2.iloc[1:]
    df_vss2['BoldTexts'] = df_vss2['BoldTexts'].str.replace(r'Acteur', '', regex=True)

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

In [109]:
df_vss2['Url']=url
df_vss2.rename(columns={'BoldTexts':'Name','Url':'Url'}, inplace=True)
df_vss2

Unnamed: 0,Name,Url
1,Harvey Weinstein,https://www.journaldequebec.com/2017/11/01/acc...
2,James Toback,https://www.journaldequebec.com/2017/11/01/acc...
3,Kevin Spacey,https://www.journaldequebec.com/2017/11/01/acc...
4,Brett Ratner,https://www.journaldequebec.com/2017/11/01/acc...
5,Terry Richardson,https://www.journaldequebec.com/2017/11/01/acc...
6,Mark Halperin,https://www.journaldequebec.com/2017/11/01/acc...
7,Roy Price,https://www.journaldequebec.com/2017/11/01/acc...
8,Andy Dick,https://www.journaldequebec.com/2017/11/01/acc...
9,Dustin Hoffman,https://www.journaldequebec.com/2017/11/01/acc...
10,Steven Seagal,https://www.journaldequebec.com/2017/11/01/acc...


In [110]:
url = "https://www.radiofrance.fr/franceinter/a-hollywood-la-liste-des-harceleurs-presumes-s-allonge-8515300"
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    bold_texts = []

    for strong_tag in soup.find_all('strong'):
        bold_texts.append(strong_tag.text.strip())

    df_vss3 = pd.DataFrame({'BoldTexts': bold_texts})
    df_vss3 = df_vss3[df_vss3['BoldTexts'].str.contains(r'[A-Z]') & ~df_vss3['BoldTexts'].str.contains(r'\bhommes\b|\bTimes\b|\bactrices\b|\bHollywood\b|\bVariety\b|\bauteure\b|\benvironnement\b|\bfils\b')]
    df_vss3['BoldTexts'] = df_vss3['BoldTexts'].apply(lambda x: ' '.join(word for word in x.split() if word.istitle()))
    df_vss3['BoldTexts'] = df_vss3['BoldTexts'].str.replace(r'[,-]', '', regex=True)
    df_vss3['BoldTexts'] = df_vss3['BoldTexts'].str.replace(r'La Star Trek', '', regex=True)
    df_vss3['BoldTexts'] = df_vss3['BoldTexts'].str.replace(r'Le', '', regex=True)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


In [111]:
df_vss3['Url']=url
df_vss3.rename(columns={'BoldTexts':'Name','Url':'Url'}, inplace=True)
df_vss3

Unnamed: 0,Name,Url
0,Harvey Weinstein,https://www.radiofrance.fr/franceinter/a-holly...
2,Kevin Spacey,https://www.radiofrance.fr/franceinter/a-holly...
4,George Takei,https://www.radiofrance.fr/franceinter/a-holly...
6,James Toback,https://www.radiofrance.fr/franceinter/a-holly...
11,Brett Ratner,https://www.radiofrance.fr/franceinter/a-holly...
13,Steven Seagal,https://www.radiofrance.fr/franceinter/a-holly...
15,Ben Affleck,https://www.radiofrance.fr/franceinter/a-holly...
17,Dustin Hoffman,https://www.radiofrance.fr/franceinter/a-holly...
21,Jeremy Piven,https://www.radiofrance.fr/franceinter/a-holly...
23,Louis C.K.,https://www.radiofrance.fr/franceinter/a-holly...


Once these three sites have been scrapped, we concatenate the results and remove duplicates to obtain a list of all the people accused of VSS (according to the sources used).

In [112]:
df_combined = pd.concat([df_vss1, df_vss2, df_vss3], ignore_index=True)
df_combined = df_combined.drop_duplicates(subset='Name')
df_combined = df_combined.sort_values(by='Name', ignore_index=True)
df_combined.loc[:1, 'Name'] = df_combined.loc[:1, 'Name'].str.replace(r' ', '', regex=True)
df_combined = df_combined.sort_values(by='Name', ignore_index=True)
df_vss = df_combined

df_vss

Unnamed: 0,Name,Url
0,Andy Dick,https://www.senscritique.com/liste/les_porcs/2...
1,Anthox Colaboy,https://www.senscritique.com/liste/les_porcs/2...
2,Arnold Schwarzenegger,https://www.senscritique.com/liste/les_porcs/2...
3,Ben Affleck,https://www.senscritique.com/liste/les_porcs/2...
4,Bertrand Cantat,https://www.senscritique.com/liste/les_porcs/2...
5,Bill Cosby,https://www.senscritique.com/liste/les_porcs/2...
6,Billy Dee Williams,https://www.senscritique.com/liste/les_porcs/2...
7,Brett Ratner,https://www.journaldequebec.com/2017/11/01/acc...
8,Bryan Singer,https://www.senscritique.com/liste/les_porcs/2...
9,Casey Affleck,https://www.senscritique.com/liste/les_porcs/2...


We save this list for future use.

In [142]:
df_vss.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_vss.csv', index=False)

# IV) Final dataset : Fusion between movies and vss datasets

Finally, the last phase of this scrapping is to merge the movies and vss datasets, with the sole aim of adding a VSS column (with a True/False), which will filter movies according to whether a crew member, director or distributor is accused of vss. And so we have our blacklist.

In [139]:
df_movies_f = df_movies.copy()

df_movies_f['VSS'] = False

for index, row in df_movies_f.iterrows():
    text_to_search = f"{row['Synopsis']} {row['Director']} {row['Crew']} {row['Distributor']}"
    if any(df_vss['Name'].apply(lambda name: name.lower() in text_to_search.lower())):
        df_movies_f.at[index, 'VSS'] = True

df_movies_f

Unnamed: 0,Note_Z,Note_U,Title,Date,Genre,Duration,Synopsis,Director,Crew,Distributor,VSS
623,6.8,6.4,"""Sr.""",2023,"['Holiday', 'Drama', 'Mystery & thriller']",38m,"On Christmas Eve, a young RAF pilot flying hom...",Iain Softley,"['Ben Radcliffe', 'Steven Mackintosh', 'John T...",Disney+,False
1156,3.2,3.2,(500) Days of Summer,2023,,1h 45min,"Quelque part dans le Nord de la France, Juliet...",,"Juliette Jouan, Raphaël Thiéry",,False
1672,3.3,5.3,(500) jours ensemble,2000,"['Kids & family', 'Musical', 'Animation']",1h 29m,"Two con-men (Kevin Kline, Kenneth Branagh) get...","Bibo Bergeron, \n ...","['Kevin Kline', 'Kenneth Branagh', 'Rosie Pere...",DreamWorks SKG,False
1551,1.4,6.3,100% bio,2019,"['Horror', 'Mystery & thriller']",2h 25m,"A young American couple, their relationship fo...",Ari Aster,"['Florence Pugh', 'Jack Reynor', 'William Jack...",A24,False
1238,2.8,3.6,12 Strong,2023,,1h 43min,"Marie-Line, 20 ans, est une serveuse énergique...",,"Louane Emera, Michel Blanc",,False
...,...,...,...,...,...,...,...,...,...,...,...
1391,2.3,4.5,xXx : Reactivated,2021,"['Kids & family', 'Comedy', 'Animation', 'Adve...",1h 23m,Growing up in an orphanage in the British coun...,Goro Miyazaki,"['Richard E. Grant', 'Kacey Musgraves', 'Dan S...",GKIDS,False
1620,3.1,5.9,À l'intérieur,2021,"['Documentary', 'Music']",2h 21m,"""Billie Eilish: The World's A Little Blurry"" t...",R.J. Cutler,"['Billie Eilish', 'Finneas', 'R.J. Cutler', 'R...",NEON,False
1669,3.0,4.8,À mon seul désir,2015,"['Musical', 'Comedy']",1h 55m,It's been three years since the Barden Bellas ...,Elizabeth Banks,"['Anna Kendrick', 'Rebel Wilson', 'Hailee Stei...",Universal Pictures,False
1450,3.5,6.2,Ága,2020,"['Documentary', 'Drama', 'Lgbtq+']",1h 23m,A former baseball player keeps her lesbian rel...,Chris Bolan,"['Pat Henschel', 'Chris Bolan', 'Chris Bolan',...",,False


We save our final df for use on our site.

In [141]:
df_movies_f.to_csv('C:/Users/vim17/Desktop/ZooDiaque/df_movies_f.csv', index=False)

# V) Conclusion on Scrapping and the Project

This project has enabled us to raise a number of issues of concern, in particular the accessibility of VSS-related information, and to provide a relevant movie site in line with our values.

We encountered a number of difficulties related to Scrapping (headers, number of pages, tags to be identified, etc.), Cleaning (formatting, deletion, etc.), df Merge, but also to the information available on VSS, which we hope to be able to enrich in the future.

This project has been very instructive, enabling us to improve our skills in scrapping and in using the BeautifulSoup and Selenium libraries. What's more, we were able to acquire a certain rigor and way of thinking in scrapping, to understand what information to retrieve, and how to retrieve it efficiently.

We now invite you to take a look at the website to see how we visualize and exploit this data.