<a href="https://colab.research.google.com/github/saraswatitiwari/Basic-JS/blob/master/web_scraping_justwatch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping & Data Handling Challenge**

### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests



In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
# Set up for running selenium in Google Colab
%%shell
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb
CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P /tmp/
unzip -o /tmp/chromedriver_linux64.zip -d /tmp/
chmod +x /tmp/chromedriver
mv /tmp/chromedriver /usr/local/bin/chromedriver
pip install selenium

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpadcont[0m                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
[33m0% [Waiting for headers] [Waiting for headers] [Connecting to ppa.launchpadcont[0m                                                                               Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
[33m0% [Waiting for headers] [3 InRelease 14.2 kB/110 kB 13%] [Connecting to ppa.la[0m                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.lau



In [None]:
!pip install chromedriver-autoinstaller

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import chromedriver_autoinstaller

# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# set path to chromedriver as per your configuration
chromedriver_autoinstaller.install()



'/usr/local/lib/python3.10/dist-packages/chromedriver_autoinstaller/123/chromedriver'

In [None]:
# set the target URL
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'

# set up the webdriver
driver = webdriver.Chrome(options=chrome_options)

In [None]:
driver.get(url )

In [None]:
driver

<selenium.webdriver.chrome.webdriver.WebDriver (session="4f45e81640dc7b6285625b562c439176")>

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

In [None]:
title_list = driver.find_elements(By.XPATH,"//a[@class='title-list-grid__item--link']")
title_list

[<selenium.webdriver.remote.webelement.WebElement (session="4f45e81640dc7b6285625b562c439176", element="f.6B25A0906C3A84CCFC3B0BFF991B5859.d.CDB33A07129696021F4E1994713DDAA5.e.18")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f45e81640dc7b6285625b562c439176", element="f.6B25A0906C3A84CCFC3B0BFF991B5859.d.CDB33A07129696021F4E1994713DDAA5.e.19")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f45e81640dc7b6285625b562c439176", element="f.6B25A0906C3A84CCFC3B0BFF991B5859.d.CDB33A07129696021F4E1994713DDAA5.e.20")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f45e81640dc7b6285625b562c439176", element="f.6B25A0906C3A84CCFC3B0BFF991B5859.d.CDB33A07129696021F4E1994713DDAA5.e.21")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f45e81640dc7b6285625b562c439176", element="f.6B25A0906C3A84CCFC3B0BFF991B5859.d.CDB33A07129696021F4E1994713DDAA5.e.22")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4f45e81640dc7b6285625b562

## **Fetching Movie URL's**

In [None]:
# write your code here
movies=[]
for i in title_list:
    name=i.get_attribute('href')
    movies.append(name)
movies

['https://www.justwatch.com/in/movie/hanu-man',
 'https://www.justwatch.com/in/movie/oppenheimer',
 'https://www.justwatch.com/in/movie/untitled-shahid-kapoor-kriti-sanon-film',
 'https://www.justwatch.com/in/movie/fighter-2022',
 'https://www.justwatch.com/in/movie/poor-things',
 'https://www.justwatch.com/in/movie/anatomie-dune-chute',
 'https://www.justwatch.com/in/movie/bramayugam',
 'https://www.justwatch.com/in/movie/dune-2021',
 'https://www.justwatch.com/in/movie/animal-2022',
 'https://www.justwatch.com/in/movie/merry-christmas-2024',
 'https://www.justwatch.com/in/movie/12th-fail',
 'https://www.justwatch.com/in/movie/anyone-but-you',
 'https://www.justwatch.com/in/movie/road-house-2024',
 'https://www.justwatch.com/in/movie/anweshippin-kandethum',
 'https://www.justwatch.com/in/movie/murder-mubarak',
 'https://www.justwatch.com/in/movie/manjummel-boys',
 'https://www.justwatch.com/in/movie/black-magic-2024',
 'https://www.justwatch.com/in/movie/article-370',
 'https://www.ju

## **Scrapping Movie Title**

In [None]:
# Write Your Code here
def extract_movie_name_from_url(url):
    return url.split('/')[-1]

# Extract movie names from the URLs
movie_names = [extract_movie_name_from_url(url) for url in movies]

In [None]:
movie_names

['hanu-man',
 'oppenheimer',
 'untitled-shahid-kapoor-kriti-sanon-film',
 'fighter-2022',
 'poor-things',
 'anatomie-dune-chute',
 'bramayugam',
 'dune-2021',
 'animal-2022',
 'merry-christmas-2024',
 '12th-fail',
 'anyone-but-you',
 'road-house-2024',
 'anweshippin-kandethum',
 'murder-mubarak',
 'manjummel-boys',
 'black-magic-2024',
 'article-370',
 'aattam',
 'salaar',
 'premalu',
 'sam-bahadur',
 'eagle-2024',
 'kung-fu-panda',
 'damsel-2023',
 'dune-part-two',
 'the-beekeeper-2024',
 'laapataa-ladies',
 'the-crew-2024',
 'madame-web',
 'the-kerala-story',
 'godzilla-minus-one',
 'godzilla-x-kong-the-new-empire',
 '365-days',
 'abraham-ozler',
 'dunki',
 'aquaman-and-the-lost-kingdom',
 'vadakkupatti-ramasamy',
 'the-holdovers',
 'the-zone-of-interest',
 'lover-2024',
 'main-atal-hoon',
 'blackberry',
 'ferrari',
 'mission-chapter-1',
 'migration',
 'the-gentlemen',
 'junior-2023',
 'harry-potter-and-the-philosophers-stone',
 'untitled-cord-jefferson-film',
 'barbie-2023',
 'ssmb-

## **Scrapping release Year**

In [None]:
# Write Your Code here
Release_year=[]
for movie_url in movies:
    driver.get(movie_url)

    try:
        # Find the release year element
        release_year_element = driver.find_element(By.XPATH, "//span[@class='text-muted']")
        release_year = release_year_element.text.strip("()")
    except:
        release_year = "Release Year not found"
    Release_year.append({"Movie URL": movie_url, "Release Year": release_year})

for cur_movie in Release_year:
  print(cur_movie)

{'Movie URL': 'https://www.justwatch.com/in/movie/hanu-man', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/oppenheimer', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/untitled-shahid-kapoor-kriti-sanon-film', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/fighter-2022', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/poor-things', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/anatomie-dune-chute', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/bramayugam', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/dune-2021', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/animal-2022', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/merry-christmas-2024', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.com/in/movie/12th-fail', 'Release Year': ''}
{'Movie URL': 'https://www.justwatch.c

In [None]:
# Extract release years from the Release_year list
movie_release_years = [year["Release Year"] for year in Release_year]

In [None]:
movie_release_years

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Release Year not found',
 '',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 '',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 '',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 '',
 'Release Year not found',
 'Release Year not found',
 'Release Year not found',
 'Rele

## **Scrapping Genres**

In [None]:
# Write Your Code here
#scrapping genre
def fetch_genre(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the genres using the provided XPath expression
        genres_heading = soup.find('h3', class_='detail-infos__subheading', text='Genres')
        if genres_heading:
            genres_element = genres_heading.find_next_sibling('div', class_='detail-infos__value')
            genres = genres_element.text.strip()

            # Split the genres into a list
            genres_list = [genre.strip() for genre in genres.split(',')]
            return genres_list
    return None

# Fetch genres for each URL in 'movies' and store in a new list 'all_genres'
movie_genres = []

for movie_url in movies:
    genres = fetch_genre(movie_url)
    if genres:
        movie_genres.extend(genres)

# Print the list of all genres
print(movie_genres)

  genres_heading = soup.find('h3', class_='detail-infos__subheading', text='Genres')


['Mystery & Thriller', 'Crime', 'Drama', 'Comedy', 'Romance', 'Fantasy', 'Science-Fiction', 'Comedy', 'Drama', 'Comedy', 'Action & Adventure', 'Fantasy', 'Action & Adventure', 'Science-Fiction', 'Action & Adventure', 'Drama', 'History', 'War & Military', 'Comedy']


In [None]:
movie_genres

['Mystery & Thriller',
 'Crime',
 'Drama',
 'Comedy',
 'Romance',
 'Fantasy',
 'Science-Fiction',
 'Comedy',
 'Drama',
 'Comedy',
 'Action & Adventure',
 'Fantasy',
 'Action & Adventure',
 'Science-Fiction',
 'Action & Adventure',
 'Drama',
 'History',
 'War & Military',
 'Comedy']

## **Scrapping IMBD Rating**

In [None]:
#scrapping imdb rating
def fetch_imdb(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find the genres using the provided XPath expression
        imdb_heading = soup.find('h3', class_='detail-infos__subheading', text='Rating')
        if imdb_heading:
            imdb_element = imdb_heading.find_next_sibling('div', class_='detail-infos__value')
            imdb = imdb_element.text.strip()

            # Split the genres into a list
            imdb_list = [imdb.strip() for imdb in imdb.split(',')]
            return imdb_list
    return None

# Fetch genres for each URL in 'movies' and store in a new list 'all_genres'
movie_imdb = []

for movie_url in movies:
    imdb = fetch_imdb(movie_url)
    if imdb:
        movie_imdb.extend(imdb)

  imdb_heading = soup.find('h3', class_='detail-infos__subheading', text='Rating')


In [None]:
movie_imdb

['6.2',
 '6.1  (69k)',
 '7.5  (69k)',
 '7.6  (232k)',
 '5.8  (1k)',
 '8.7  (2m)',
 '8.1  (591k)']

## **Fetching Streaming Service Details**

In [None]:
# Write Your Code here
Provider = []

for movie_url in movies:
    driver.get(movie_url)

    try:
        # Find the provider element
        provider_element = driver.find_element(By.XPATH, "//img[@class='offer__icon']")
        provider_name = provider_element.get_attribute('alt')
    except:
        provider_name = "Provider not found"

    Provider.append({"Movie URL": movie_url, "Provider Name": provider_name})

for stream in Provider:
  print(stream)

{'Movie URL': 'https://www.justwatch.com/in/movie/hanu-man', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/oppenheimer', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/untitled-shahid-kapoor-kriti-sanon-film', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/fighter-2022', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/poor-things', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/anatomie-dune-chute', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bramayugam', 'Provider Name': 'Sony Liv'}
{'Movie URL': 'https://www.justwatch.com/in/movie/dune-2021', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/animal-2022', 'Provider Name': 'Provider not found'}
{'Movie URL': 'https://www.justwatch.com/in/movie/me

In [None]:
Movie_streaming = [movie_info['Provider Name'] for movie_info in Provider]


In [None]:
Movie_streaming

['Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Sony Liv',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Netflix',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provider not found',
 'Provide

In [None]:
movies

['https://www.justwatch.com/in/movie/hanu-man',
 'https://www.justwatch.com/in/movie/oppenheimer',
 'https://www.justwatch.com/in/movie/untitled-shahid-kapoor-kriti-sanon-film',
 'https://www.justwatch.com/in/movie/fighter-2022',
 'https://www.justwatch.com/in/movie/poor-things',
 'https://www.justwatch.com/in/movie/anatomie-dune-chute',
 'https://www.justwatch.com/in/movie/bramayugam',
 'https://www.justwatch.com/in/movie/dune-2021',
 'https://www.justwatch.com/in/movie/animal-2022',
 'https://www.justwatch.com/in/movie/merry-christmas-2024',
 'https://www.justwatch.com/in/movie/12th-fail',
 'https://www.justwatch.com/in/movie/anyone-but-you',
 'https://www.justwatch.com/in/movie/road-house-2024',
 'https://www.justwatch.com/in/movie/anweshippin-kandethum',
 'https://www.justwatch.com/in/movie/murder-mubarak',
 'https://www.justwatch.com/in/movie/manjummel-boys',
 'https://www.justwatch.com/in/movie/black-magic-2024',
 'https://www.justwatch.com/in/movie/article-370',
 'https://www.ju

In [None]:
driver.quit()

## **Now Creating Movies DataFrame**

In [None]:
print(len(movie_names))
print(len(movie_release_years))
print(len(movie_genres))
print(len(movies))
print(len(movie_imdb))
print(len(Movie_streaming))

100
100
19
100
7
100


In [None]:
#Creatind datafame of scrapped movie data
import pandas as pd
Movies_data=pd.DataFrame()
Movies_data['Movie Names']=movie_names[:80]
Movies_data['Release Year']=movie_release_years[:80]
Movies_data['Movie Genres']=movie_genres[:80]
Movies_data['Movie IMDb Ratings']=movie_imdb[:80]
Movies_data['Movie Url']=movies[:80]
Movies_data['Movie Streaming Service']=Movie_streaming[:80]


ValueError: Length of values (19) does not match length of index (80)