# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests



In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
url = 'https://www.justwatch.com/in/movies?release_year_from=2000'


# headers to simulate a browser visit
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}

## Hint : Use the following code to extract the film urls
# movie_links = soup.find_all('a', href=True)
# movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]

# url_list=[]
# for x in movie_urls:
#   url_list.append('https://www.justwatch.com'+x)

## **Fetching Movie URL's**

In [None]:
response = requests.get(url, headers=headers)

if response.status_code == 200:
    print("Page fetched successfully!")
    soup = BeautifulSoup(response.content, 'html.parser')

    movie_elements = soup.select('a.title-list-grid__item--link')

    movie_urls = ['https://www.justwatch.com' + element['href'] for element in movie_elements if 'href' in element.attrs]

    print(f"Found {len(movie_urls)} movie URLs:")
    for movie_url in movie_urls[:100]:  # Print the first 100 for preview
        print(movie_url)
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")

Page fetched successfully!
Found 100 movie URLs:
https://www.justwatch.com/in/movie/pushpa-the-rule-part-2
https://www.justwatch.com/in/movie/lucky-baskhar
https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3
https://www.justwatch.com/in/movie/pushpa
https://www.justwatch.com/in/movie/venom-3-2024
https://www.justwatch.com/in/movie/all-we-imagine-as-light
https://www.justwatch.com/in/movie/amaran-2024
https://www.justwatch.com/in/movie/red-one
https://www.justwatch.com/in/movie/stree-2
https://www.justwatch.com/in/movie/ore-dake-level-up-na-ken-reawakening
https://www.justwatch.com/in/movie/kanguva
https://www.justwatch.com/in/movie/singham-again-2024-0
https://www.justwatch.com/in/movie/kishkkindha-kandam
https://www.justwatch.com/in/movie/the-substance
https://www.justwatch.com/in/movie/bagheera-2024
https://www.justwatch.com/in/movie/ntr-30
https://www.justwatch.com/in/movie/the-wild-robot
https://www.justwatch.com/in/movie/deadpool-3
https://www.justwatch.com/in/movie/sookshma-dars

## **Scrapping Movie Title**

In [None]:
# Extract and clean the movie titles from URLs
movie_titles = [movie.split('/')[-1].replace('-', ' ').title() for movie in movie_urls]

# Print the extracted and formatted movie titles
print(f"Found {len(movie_titles)} movie titles:")
for title in movie_titles[:10]:  # Print the first 10 for preview
    print(title)


Found 100 movie titles:
Pushpa The Rule Part 2
Lucky Baskhar
Bhool Bhulaiyaa 3
Pushpa
Venom 3 2024
All We Imagine As Light
Amaran 2024
Red One
Stree 2
Ore Dake Level Up Na Ken Reawakening


## **Scrapping release Year**

In [None]:
# Write Your Code here

import time

# Initialize a list to store release year data
release_years = []

# Headers to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}

# Loop through each movie URL to fetch its release year
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Find the release year element
                release_year_element = soup.find('span', class_="release-year")  # Update class if needed
                release_year = release_year_element.get_text(strip=True).strip("()")
            except AttributeError:
                release_year = 'N/A'

            # Append the movie URL and release year to the list
            release_years.append({'Movie URL': movie_url, 'Release Year': release_year})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Add a delay to prevent being blocked
    time.sleep(0.05)

# Print the final result
print("Extracted release years for movies:")
for movie_data in release_years[:10]:  # Print the first 10 for preview
    print(movie_data)



Extracted release years for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'Release Year': '2021'}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/red-one', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/stree-2', 'Release Year': '2024'}
{'Movie URL': 'https://www.justwatch.com/in/movie/ore-dake-level-up-na-ken-reawakening', 'Release Year': '2024'}


## **Scrapping Genres**

In [None]:
# Write Your Code here
# Initialize a list to store genre data
movie_genres = []

# Loop through each movie URL to fetch its genres
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the genres section
                genres_heading = soup.find('h3', class_='detail-infos__subheading', string='Genres')
                if genres_heading:
                    genres_div = genres_heading.find_next_sibling('div', class_='detail-infos__value')
                    genres = genres_div.get_text(strip=True) if genres_div else 'N/A'
                else:
                    genres = 'N/A'
            except AttributeError:
                genres = 'N/A'

            # Append the movie URL and genres to the list
            movie_genres.append({'Movie URL': movie_url, 'Genres': genres})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Add a delay to prevent being blocked
    time.sleep(0.05)

# Print the final result
print("Extracted genres for movies:")
for movie_data in movie_genres[:10]:  # Print the first 10 for preview
    print(movie_data)


Extracted genres for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'Genres': 'Mystery & Thriller, Crime, Action & Adventure, Drama'}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'Genres': 'Crime, Drama, Mystery & Thriller'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'Genres': 'Horror, Comedy'}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'Genres': 'Action & Adventure, Drama, Mystery & Thriller, Crime'}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'Genres': 'Mystery & Thriller, Science-Fiction, Action & Adventure'}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'Genres': 'Drama, Romance'}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'Genres': 'Action & Adventure, Drama, War & Military'}
{'Movie URL': 'https://www.justwatch.com/in/movie/red-one', 'Genres': 'Comedy, Action & Adventure, Fantasy'}
{'Movie URL': 'https://www.ju

## **Scrapping IMBD Rating**

In [None]:
# Write Your Code here
# Initialize a list to store IMDb rating data
movie_imdb_ratings = []

# Loop through each movie URL to fetch its IMDb rating
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the IMDb rating section
                imdb_rating_div = soup.find('div', class_='jw-scoring-listing__rating--group jw-scoring-listing__rating--link')
                if imdb_rating_div:
                    imdb_rating = imdb_rating_div.find('div').get_text(strip=True).split()[0]  # Get the first part of the text (rating)
                else:
                    imdb_rating = 'N/A'
            except AttributeError:
                imdb_rating = 'N/A'

            # Append the movie URL and IMDb rating to the list
            movie_imdb_ratings.append({'Movie URL': movie_url, 'IMDb Rating': imdb_rating})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Add a delay to prevent being blocked
    time.sleep(0.05)

# Print the final result
print("Extracted IMDb ratings for movies:")
for movie_data in movie_imdb_ratings[:10]:  # Print the first 10 for preview
    print(movie_data)


Extracted IMDb ratings for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'IMDb Rating': '6.5'}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'IMDb Rating': '8.1'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'IMDb Rating': '5.1'}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'IMDb Rating': '7.6'}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'IMDb Rating': '6.0'}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'IMDb Rating': '7.4'}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'IMDb Rating': '8.3'}
{'Movie URL': 'https://www.justwatch.com/in/movie/red-one', 'IMDb Rating': '6.5'}
{'Movie URL': 'https://www.justwatch.com/in/movie/stree-2', 'IMDb Rating': '7.0'}
{'Movie URL': 'https://www.justwatch.com/in/movie/ore-dake-level-up-na-ken-reawakening', 'IMDb Rating': '8.1'}


## **Scrapping Runtime/Duration**

In [None]:
# Initialize a list to store runtime data
movie_runtimes = []

# Loop through each movie URL to fetch its runtime
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the runtime section
                runtime_heading = soup.find('h3', class_='detail-infos__subheading', string='Runtime')
                if runtime_heading:
                    runtime_div = runtime_heading.find_next_sibling('div', class_='detail-infos__value')
                    runtime = runtime_div.get_text(strip=True) if runtime_div else 'N/A'
                else:
                    runtime = 'N/A'
            except AttributeError:
                runtime = 'N/A'

            # Append the movie URL and runtime to the list
            movie_runtimes.append({'Movie URL': movie_url, 'Runtime': runtime})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Add a delay to prevent being blocked
    time.sleep(0.05)

# Print the final result
print("Extracted runtimes for movies:")
for movie_data in movie_runtimes[:10]:  # Print the first 10 for preview
    print(movie_data)


Extracted runtimes for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'Runtime': '3h 18min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'Runtime': '2h 50min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'Runtime': '2h 38min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'Runtime': '2h 59min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'Runtime': '1h 49min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'Runtime': '1h 58min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'Runtime': '2h 47min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/red-one', 'Runtime': '2h 4min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/stree-2', 'Runtime': '2h 27min'}
{'Movie URL': 'https://www.justwatch.com/in/movie/ore-dake-level-up-na-ken-reawakening', 'Runtime': '1h 56min'}


## **Scrapping Age Rating**

In [None]:
# Write Your Code here
# Initialize a list to store age rating data
movie_age_ratings = []

# Loop through each movie URL to fetch its age rating
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the age rating section
                age_rating_heading = soup.find('h3', class_='detail-infos__subheading', string='Age rating')
                if age_rating_heading:
                    age_rating_div = age_rating_heading.find_next_sibling('div', class_='detail-infos__value')
                    age_rating = age_rating_div.get_text(strip=True) if age_rating_div else 'N/A'
                else:
                    age_rating = 'N/A'
            except AttributeError:
                age_rating = 'N/A'

            # Append the movie URL and age rating to the list
            movie_age_ratings.append({'Movie URL': movie_url, 'Age Rating': age_rating})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Reduce delay to prevent being blocked, but still add a small delay
    time.sleep(0.05)

# Print the final result
print("Extracted age ratings for movies:")
for movie_data in movie_age_ratings[:10]:  # Print the first 10 for preview
    print(movie_data)


Extracted age ratings for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'Age Rating': 'N/A'}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'Age Rating': 'UA'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'Age Rating': 'UA'}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'Age Rating': 'UA'}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'Age Rating': 'N/A'}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'Age Rating': 'A'}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'Age Rating': 'N/A'}
{'Movie URL': 'https://www.justwatch.com/in/movie/red-one', 'Age Rating': 'N/A'}
{'Movie URL': 'https://www.justwatch.com/in/movie/stree-2', 'Age Rating': 'UA'}
{'Movie URL': 'https://www.justwatch.com/in/movie/ore-dake-level-up-na-ken-reawakening', 'Age Rating': 'N/A'}


## **Fetching Production Countries Details**

In [None]:
# Initialize a list to store production country data
movie_production_countries = []

# Loop through each movie URL to fetch its production country
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the production country section
                production_country_heading = soup.find('h3', class_='detail-infos__subheading', string=lambda text: 'production country' in text.lower() if text else False)

                if production_country_heading:
                    # Find the next sibling <div> containing the country name
                    production_country_div = production_country_heading.find_next_sibling('div', class_='detail-infos__value')
                    if production_country_div:
                        production_country = production_country_div.get_text(strip=True)
                    else:
                        production_country = 'N/A'
                else:
                    production_country = 'N/A'

            except AttributeError:
                production_country = 'N/A'

            # Append the movie URL and production country to the list
            movie_production_countries.append({'Movie URL': movie_url, 'Production Country': production_country})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Reduce delay to prevent being blocked
    time.sleep(0.05)

# Print the final result
print("Extracted production countries for movies:")
for movie_data in movie_production_countries[:10]:  # Print the first 10 for preview
    print(movie_data)


Extracted production countries for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'Production Country': 'India'}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'Production Country': 'India'}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'Production Country': 'India'}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'Production Country': 'India'}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'Production Country': 'United States'}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'Production Country': 'Netherlands, France, India, Italy, Luxembourg'}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'Production Country': 'India'}
{'Movie URL': 'https://www.justwatch.com/in/movie/red-one', 'Production Country': 'United States'}
{'Movie URL': 'https://www.justwatch.com/in/movie/stree-2', 'Production Country': 'India'}
{'Movie URL': 'https://www.

## **Fetching Streaming Service Details**

In [None]:
import requests
from bs4 import BeautifulSoup
import time

# Initialize a list to store streaming service data
movie_streaming_services = []

# Loop through each movie URL to fetch its streaming service details
for movie_url in movie_urls:
    try:
        # Fetch the HTML content of the movie URL
        response = requests.get(movie_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Find all the offer links which represent streaming services
                offers = soup.find_all('a', class_='offer')

                streaming_services = []

                # Loop through each offer and extract the streaming service name
                for offer in offers:
                    # Find the image tag that holds the provider logo URL in 'src' attribute
                    img_tag = offer.find('img', class_='provider-icon')
                    if img_tag:
                        img_url = img_tag.get('src')  # Get the 'src' attribute (logo URL)
                        if img_url:
                            # Extract the service name from the logo URL
                            service_name = img_url.split('/')[-1].split('.')[0]  # e.g., 'amazonprimevideo'
                            streaming_services.append(service_name)

                # If no streaming service is found, use 'N/A'
                if not streaming_services:
                    streaming_services.append('N/A')

            except AttributeError:
                streaming_services = ['N/A']

            # Append the movie URL and its streaming services to the list
            movie_streaming_services.append({'Movie URL': movie_url, 'Streaming Services': streaming_services})
        else:
            print(f"Failed to fetch the HTML content for {movie_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {movie_url}: {e}")

    # Reduce delay to prevent being blocked
    time.sleep(0.05)

# Print the final result
print("Extracted streaming services for movies:")
for movie_data in movie_streaming_services[:10]:  # Print the first 10 for preview
    print(movie_data)


Extracted streaming services for movies:
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'Streaming Services': ['bookmyshow', 'amazonprimevideo']}
{'Movie URL': 'https://www.justwatch.com/in/movie/lucky-baskhar', 'Streaming Services': ['netflix', 'bookmyshow', 'amazonprimevideo']}
{'Movie URL': 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'Streaming Services': ['bookmyshow', 'amazonprimevideo']}
{'Movie URL': 'https://www.justwatch.com/in/movie/pushpa', 'Streaming Services': ['amazonprimevideo', 'amazonprimevideo', 'amazon', 'amazonprimevideo', 'bookmyshow']}
{'Movie URL': 'https://www.justwatch.com/in/movie/venom-3-2024', 'Streaming Services': ['itunes', 'zee5', 'amazon', 'amazonprimevideo', 'itunes', 'itunes']}
{'Movie URL': 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'Streaming Services': ['bookmyshow', 'amazonprimevideo']}
{'Movie URL': 'https://www.justwatch.com/in/movie/amaran-2024', 'Streaming Services': ['netflix', 'bookmy

## **Now Creating Movies DataFrame**

In [None]:
import pandas as pd

# Initialize a list to store all movie data
movies_data = []

# Create a dictionary for quick lookup of genres by movie URL
genre_dict = {item['Movie URL']: item['Genres'] for item in movie_genres}

# Combine all data into a structured format
for i in range(len(movie_urls)):
    movie_url = movie_urls[i]
    movie_data = {
        'Movie URL': movie_url,
        'Movie Title': movie_titles[i] if i < len(movie_titles) else 'N/A',
        'Release Year': release_years[i]['Release Year'] if i < len(release_years) else 'N/A',
        'Genres': genre_dict.get(movie_url, 'N/A'),  # Fetch genres directly from the genre dictionary
        'IMDB Rating': movie_imdb_ratings[i]['IMDb Rating'] if i < len(movie_imdb_ratings) else 'N/A',
        'Runtime': movie_runtimes[i]['Runtime'] if i < len(movie_runtimes) else 'N/A',
        'Age Rating': movie_age_ratings[i]['Age Rating'] if i < len(movie_age_ratings) else 'N/A',
        'Production Country': movie_production_countries[i]['Production Country'] if i < len(movie_production_countries) else 'N/A',
        'Streaming Services': movie_streaming_services[i]['Streaming Services'] if i < len(movie_streaming_services) else 'N/A'
    }

    # Append each movie's data to the list
    movies_data.append(movie_data)

# Convert the list of movies data into a pandas DataFrame
df = pd.DataFrame(movies_data)

# Print the DataFrame to check
print(df.head(10))  # Display the first 10 rows for preview


                                           Movie URL  \
0  https://www.justwatch.com/in/movie/pushpa-the-...   
1   https://www.justwatch.com/in/movie/lucky-baskhar   
2  https://www.justwatch.com/in/movie/bhool-bhula...   
3          https://www.justwatch.com/in/movie/pushpa   
4    https://www.justwatch.com/in/movie/venom-3-2024   
5  https://www.justwatch.com/in/movie/all-we-imag...   
6     https://www.justwatch.com/in/movie/amaran-2024   
7         https://www.justwatch.com/in/movie/red-one   
8         https://www.justwatch.com/in/movie/stree-2   
9  https://www.justwatch.com/in/movie/ore-dake-le...   

                            Movie Title Release Year  \
0                Pushpa The Rule Part 2         2024   
1                         Lucky Baskhar         2024   
2                     Bhool Bhulaiyaa 3         2024   
3                                Pushpa         2021   
4                          Venom 3 2024         2024   
5               All We Imagine As Light        

In [None]:
df.head()

Unnamed: 0,Movie URL,Movie Title,Release Year,Genres,IMDB Rating,Runtime,Age Rating,Production Country,Streaming Services
0,https://www.justwatch.com/in/movie/pushpa-the-...,Pushpa The Rule Part 2,2024,"Mystery & Thriller, Crime, Action & Adventure,...",6.5,3h 18min,,India,"[bookmyshow, amazonprimevideo]"
1,https://www.justwatch.com/in/movie/lucky-baskhar,Lucky Baskhar,2024,"Crime, Drama, Mystery & Thriller",8.1,2h 50min,UA,India,"[netflix, bookmyshow, amazonprimevideo]"
2,https://www.justwatch.com/in/movie/bhool-bhula...,Bhool Bhulaiyaa 3,2024,"Horror, Comedy",5.1,2h 38min,UA,India,"[bookmyshow, amazonprimevideo]"
3,https://www.justwatch.com/in/movie/pushpa,Pushpa,2021,"Action & Adventure, Drama, Mystery & Thriller,...",7.6,2h 59min,UA,India,"[amazonprimevideo, amazonprimevideo, amazon, a..."
4,https://www.justwatch.com/in/movie/venom-3-2024,Venom 3 2024,2024,"Mystery & Thriller, Science-Fiction, Action & ...",6.0,1h 49min,,United States,"[itunes, zee5, amazon, amazonprimevideo, itune..."


## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'


## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here

# Headers to mimic a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}

# Fetch the webpage
response = requests.get(tv_url, headers=headers)

if response.status_code == 200:
    print("Page fetched successfully!")
    soup = BeautifulSoup(response.content, 'html.parser')

    # Select TV show elements using their CSS selectors
    tv_elements = soup.select('a.title-list-grid__item--link')

    # Extract URLs and titles for TV shows
    tv_urls = ['https://www.justwatch.com' + element['href'] for element in tv_elements if 'href' in element.attrs]
    tv_titles = [url.split('/')[-1].replace('-', ' ').title() for url in tv_urls]

    # Combine titles and URLs
    tv_shows = list(zip(tv_titles, tv_urls))

    # Print only the first 10 TV shows
    print(f"Found {len(tv_shows)} TV shows:")
    for title, url in tv_shows[:100]:  # Limit to 10
        print(f"{title}: {url}")
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")

Page fetched successfully!
Found 100 TV shows:
Thukra Ke Mera Pyaar: https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar
From: https://www.justwatch.com/in/tv-show/from
The Day Of The Jackal: https://www.justwatch.com/in/tv-show/the-day-of-the-jackal
Mismatched: https://www.justwatch.com/in/tv-show/mismatched
Mirzapur: https://www.justwatch.com/in/tv-show/mirzapur
Secret Level: https://www.justwatch.com/in/tv-show/secret-level
Scam 1992: https://www.justwatch.com/in/tv-show/scam-1992
Dune The Sisterhood: https://www.justwatch.com/in/tv-show/dune-the-sisterhood
The Penguin: https://www.justwatch.com/in/tv-show/the-penguin
Black Doves: https://www.justwatch.com/in/tv-show/black-doves
These Black Black Eyes: https://www.justwatch.com/in/tv-show/these-black-black-eyes
Game Of Thrones: https://www.justwatch.com/in/tv-show/game-of-thrones
Solo Leveling 2024: https://www.justwatch.com/in/tv-show/solo-leveling-2024
Aindham Vedham: https://www.justwatch.com/in/tv-show/aindham-vedham
Yello

## **Fetching Tv Show Title details**

In [None]:
# Initialize a list to store TV show titles
tv_titles = []

# Loop through each TV show URL to fetch its title
for tv_url in tv_urls:
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the title element
                title_element = soup.find('h1', class_="title-detail-hero__details__title")
                if title_element:
                    title = title_element.get_text(strip=True).split('(')[0].strip()  # Extract title before year
                else:
                    title = 'N/A'
            except AttributeError:
                title = 'N/A'

            # Append the title to the list
            tv_titles.append({'TV Show URL': tv_url, 'Title': title})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted titles
print("Extracted TV show titles:")
for tv_show in tv_titles[:10]:  # Print all extracted titles
    print(tv_show)


Extracted TV show titles:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Title': 'Thukra Ke Mera Pyaar'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Title': 'From'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Title': 'The Day of the Jackal'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Title': 'Mismatched'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Title': 'Mirzapur'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Title': 'Secret Level'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Title': 'Scam 1992: The Harshad Mehta Story'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'Title': 'Dune: Prophecy'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-penguin', 'Title': 'The Penguin'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/black-doves', 'Title': 'Black Doves'}


## **Fetching Release Year**

In [None]:
# Write Your Code here
# Initialize a list to store release year data
tv_release_years = []

# Loop through each TV show URL to fetch its release year
for tv_url in tv_urls:  # Limiting to 10 URLs for demonstration
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the release year element
                release_year_element = soup.find('span', class_="release-year")
                if release_year_element:
                    release_year = release_year_element.get_text(strip=True).strip("()")  # Remove parentheses
                else:
                    release_year = 'N/A'
            except AttributeError:
                release_year = 'N/A'

            # Append the release year to the list
            tv_release_years.append({'TV Show URL': tv_url, 'Release Year': release_year})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted release years
print("Extracted release years for TV shows:")
for tv_show in tv_release_years[:10]:  # Print all extracted release years
    print(tv_show)


Extracted release years for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Release Year': '2024'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Release Year': '2022'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Release Year': '2024'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Release Year': '2020'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Release Year': '2018'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Release Year': '2024'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Release Year': '2020'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'Release Year': '2024'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-penguin', 'Release Year': '2024'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/black-doves', 'Release Year': '2024'}


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
# Initialize a list to store genre data
tv_genres = []

# Loop through each TV show URL to fetch its genres
for tv_url in tv_urls:  # Limiting to 10 URLs for demonstration
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the genres section
                genres_heading = soup.find('h3', class_='detail-infos__subheading', string='Genres')
                if genres_heading:
                    genres_div = genres_heading.find_next_sibling('div', class_='detail-infos__value')
                    genres = genres_div.get_text(strip=True) if genres_div else 'N/A'
                else:
                    genres = 'N/A'
            except AttributeError:
                genres = 'N/A'

            # Append the genre data to the list
            tv_genres.append({'TV Show URL': tv_url, 'Genres': genres})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted genres
print("Extracted genres for TV shows:")
for tv_show in tv_genres[:10]:  # Print all extracted genres
    print(tv_show)


Extracted genres for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Genres': 'Romance'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Genres': 'Mystery & Thriller, Drama, Horror, Science-Fiction'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Genres': 'Action & Adventure, Crime, Drama, Mystery & Thriller'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Genres': 'Comedy, Drama, Romance'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Genres': 'Action & Adventure, Drama, Mystery & Thriller, Crime'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Genres': 'Animation, Action & Adventure, Fantasy, Science-Fiction'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Genres': 'Crime, Drama, Mystery & Thriller'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'Genres': 'Drama, Science-Fiction, Action 

## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here
# Initialize a list to store IMDb rating data
tv_imdb_ratings = []

# Loop through each TV show URL to fetch its IMDb rating
for tv_url in tv_urls:  # Limiting to 10 URLs for demonstration
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the IMDb rating element
                imdb_rating_element = soup.find('span', class_='imdb-score')
                if imdb_rating_element:
                    imdb_rating = imdb_rating_element.get_text(strip=True).split()[0]  # Extract the numerical rating
                else:
                    imdb_rating = 'N/A'
            except AttributeError:
                imdb_rating = 'N/A'

            # Append the IMDb rating data to the list
            tv_imdb_ratings.append({'TV Show URL': tv_url, 'IMDb Rating': imdb_rating})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted IMDb ratings
print("Extracted IMDb ratings for TV shows:")
for tv_show in tv_imdb_ratings[:10]:  # Print all extracted IMDb ratings
    print(tv_show)


Extracted IMDb ratings for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'IMDb Rating': '6.6'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'IMDb Rating': '7.8'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'IMDb Rating': '8.2'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'IMDb Rating': '5.9'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'IMDb Rating': '8.4'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'IMDb Rating': '7.7'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'IMDb Rating': '9.2'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'IMDb Rating': '7.3'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-penguin', 'IMDb Rating': '8.7'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/black-doves', 'IMDb Rating': '7.2'}


## **Fetching Age Rating Details**

In [None]:
# Write Your Code here
# Initialize a list to store age rating data
tv_age_ratings = []

# Loop through each TV show URL to fetch its age rating
for tv_url in tv_urls:
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the age rating section
                age_rating_heading = soup.find('h3', class_='detail-infos__subheading', string='Age rating')
                if age_rating_heading:
                    age_rating_div = age_rating_heading.find_next_sibling('div', class_='detail-infos__value')
                    age_rating = age_rating_div.get_text(strip=True) if age_rating_div else 'N/A'
                else:
                    age_rating = 'N/A'
            except AttributeError:
                age_rating = 'N/A'

            # Append the age rating data to the list
            tv_age_ratings.append({'TV Show URL': tv_url, 'Age Rating': age_rating})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted age ratings
print("Extracted age ratings for TV shows:")
for tv_show in tv_age_ratings[:10]:  # Print all extracted age ratings
    print(tv_show)


Extracted age ratings for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Age Rating': 'A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-penguin', 'Age Rating': 'N/A'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/black-doves', 'Age Rating': 'N/A'}


## **Fetching Production Country details**

In [None]:
# Write Your Code here
# Initialize a list to store production country data
tv_production_countries = []

# Loop through each TV show URL to fetch its production country
for tv_url in tv_urls:  # Limiting to 10 URLs for demonstration
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the production country section
                production_country_heading = soup.find(
                    'h3',
                    class_='detail-infos__subheading',
                    string=lambda text: 'Production country' in text if text else False
                )
                if production_country_heading:
                    production_country_div = production_country_heading.find_next_sibling('div', class_='detail-infos__value')
                    production_country = production_country_div.get_text(strip=True) if production_country_div else 'N/A'
                else:
                    production_country = 'N/A'
            except AttributeError:
                production_country = 'N/A'

            # Append the production country data to the list
            tv_production_countries.append({'TV Show URL': tv_url, 'Production Country': production_country})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted production countries
print("Extracted production countries for TV shows:")
for tv_show in tv_production_countries[:10]:  # Print all extracted production countries
    print(tv_show)


Extracted production countries for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Production Country': 'India'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Production Country': 'United States'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Production Country': 'United Kingdom, United States'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Production Country': 'India'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Production Country': 'India'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Production Country': 'United States'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Production Country': 'India'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'Production Country': 'Hungary, United States, Canada'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-penguin', 'Production Cou

## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
# Initialize a list to store streaming service data
tv_streaming_services = []

# Loop through each TV show URL to fetch its streaming service details
for tv_url in tv_urls:  # Limiting to 10 URLs for demonstration
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Find all the offer links which represent streaming services
                offers = soup.find_all('a', class_='offer')

                streaming_services = []

                # Loop through each offer and extract the streaming service name
                for offer in offers:
                    # Find the image tag that holds the provider logo URL in 'src' attribute
                    img_tag = offer.find('img', class_='provider-icon')
                    if img_tag:
                        img_url = img_tag.get('src')  # Get the 'src' attribute (logo URL)
                        if img_url:
                            # Extract the service name from the logo URL
                            service_name = img_url.split('/')[-1].split('.')[0]  # e.g., 'amazonprimevideo'
                            streaming_services.append(service_name)

                # If no streaming service is found, use 'N/A'
                if not streaming_services:
                    streaming_services.append('N/A')

            except AttributeError:
                streaming_services = ['N/A']

            # Append the streaming service data to the list
            tv_streaming_services.append({'TV Show URL': tv_url, 'Streaming Services': streaming_services})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted streaming services
print("Extracted streaming services for TV shows:")
for tv_show in tv_streaming_services[:10]:  # Print all extracted streaming services
    print(tv_show)


Extracted streaming services for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Streaming Services': ['amazonprimevideo']}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Streaming Services': ['amazonprimevideo', 'amazonprimevideo', 'amazonprimevideo']}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Streaming Services': ['jiocinema', 'amazonprimevideo']}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Streaming Services': ['netflix', 'amazonprimevideo']}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Streaming Services': ['amazonprimevideo', 'amazonprimevideo', 'amazonprimevideo', 'amazonprimevideo']}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Streaming Services': ['amazonprimevideo', 'amazonprimevideo', 'amazonprimevideo']}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Streaming Services': ['sonyliv', 'vimoviesand

## **Fetching Duration Details**

In [None]:
# Write Your Code here
# Initialize a list to store runtime data
tv_runtimes = []

# Loop through each TV show URL to fetch its runtime
for tv_url in tv_urls:  # Limiting to 10 URLs for demonstration
    try:
        # Fetch the HTML content of the TV show URL
        response = requests.get(tv_url, headers=headers)

        if response.status_code == 200:
            # Parse the HTML content with BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')

            try:
                # Locate the runtime section
                runtime_heading = soup.find('h3', class_='detail-infos__subheading', string='Runtime')
                if runtime_heading:
                    runtime_div = runtime_heading.find_next_sibling('div', class_='detail-infos__value')
                    runtime = runtime_div.get_text(strip=True) if runtime_div else 'N/A'
                else:
                    runtime = 'N/A'
            except AttributeError:
                runtime = 'N/A'

            # Append the runtime data to the list
            tv_runtimes.append({'TV Show URL': tv_url, 'Runtime': runtime})
        else:
            print(f"Failed to fetch the HTML content for {tv_url} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred for {tv_url}: {e}")

    # Add a small delay to prevent being blocked
    time.sleep(0.05)

# Print the extracted runtimes
print("Extracted runtimes for TV shows:")
for tv_show in tv_runtimes[:10]:  # Print all extracted runtimes
    print(tv_show)


Extracted runtimes for TV shows:
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'Runtime': '23min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/from', 'Runtime': '51min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'Runtime': '51min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mismatched', 'Runtime': '36min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/mirzapur', 'Runtime': '50min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/secret-level', 'Runtime': '15min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/scam-1992', 'Runtime': '52min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/dune-the-sisterhood', 'Runtime': '1h 3min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/the-penguin', 'Runtime': '58min'}
{'TV Show URL': 'https://www.justwatch.com/in/tv-show/black-doves', 'Runtime': '54min'}


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
import pandas as pd

# Merge all the extracted data into a single DataFrame
tv_show_data = []

# Combine data based on the index or corresponding URLs
for i in range(len(tv_titles)):
    try:
        tv_show_data.append({
            'TV Show URL': tv_titles[i]['TV Show URL'],
            'Title': tv_titles[i]['Title'],
            'Release Year': tv_release_years[i]['Release Year'] if i < len(tv_release_years) else 'N/A',
            'Genres': tv_genres[i]['Genres'] if i < len(tv_genres) else 'N/A',
            'IMDb Rating': tv_imdb_ratings[i]['IMDb Rating'] if i < len(tv_imdb_ratings) else 'N/A',
            'Age Rating': tv_age_ratings[i]['Age Rating'] if i < len(tv_age_ratings) else 'N/A',
            'Production Country': tv_production_countries[i]['Production Country'] if i < len(tv_production_countries) else 'N/A',
            'Streaming Services': ', '.join(tv_streaming_services[i]['Streaming Services']) if i < len(tv_streaming_services) else 'N/A',
            'Runtime': tv_runtimes[i]['Runtime'] if i < len(tv_runtimes) else 'N/A',
        })
    except IndexError as e:
        print(f"An error occurred while merging data: {e}")
        continue

# Create the DataFrame
tv_show_df = pd.DataFrame(tv_show_data)

# Display the DataFrame
print(tv_show_df.head())

# Save to a CSV file
tv_show_df.to_csv('tv_show_data.csv', index=False)
print("TV show data successfully saved to 'tv_show_data.csv'.")


                                         TV Show URL                  Title  \
0  https://www.justwatch.com/in/tv-show/thukra-ke...   Thukra Ke Mera Pyaar   
1          https://www.justwatch.com/in/tv-show/from                   From   
2  https://www.justwatch.com/in/tv-show/the-day-o...  The Day of the Jackal   
3    https://www.justwatch.com/in/tv-show/mismatched             Mismatched   
4      https://www.justwatch.com/in/tv-show/mirzapur               Mirzapur   

  Release Year                                             Genres IMDb Rating  \
0         2024                                            Romance         6.6   
1         2022  Mystery & Thriller, Drama, Horror, Science-Fic...         7.8   
2         2024  Action & Adventure, Crime, Drama, Mystery & Th...         8.2   
3         2020                             Comedy, Drama, Romance         5.9   
4         2018  Action & Adventure, Drama, Mystery & Thriller,...         8.4   

  Age Rating             Production Co

In [None]:
tv_show_df.head()

Unnamed: 0,TV Show URL,Title,Release Year,Genres,IMDb Rating,Age Rating,Production Country,Streaming Services,Runtime
0,https://www.justwatch.com/in/tv-show/thukra-ke...,Thukra Ke Mera Pyaar,2024,Romance,6.6,,India,amazonprimevideo,23min
1,https://www.justwatch.com/in/tv-show/from,From,2022,"Mystery & Thriller, Drama, Horror, Science-Fic...",7.8,,United States,"amazonprimevideo, amazonprimevideo, amazonprim...",51min
2,https://www.justwatch.com/in/tv-show/the-day-o...,The Day of the Jackal,2024,"Action & Adventure, Crime, Drama, Mystery & Th...",8.2,,"United Kingdom, United States","jiocinema, amazonprimevideo",51min
3,https://www.justwatch.com/in/tv-show/mismatched,Mismatched,2020,"Comedy, Drama, Romance",5.9,,India,"netflix, amazonprimevideo",36min
4,https://www.justwatch.com/in/tv-show/mirzapur,Mirzapur,2018,"Action & Adventure, Drama, Mystery & Thriller,...",8.4,A,India,"amazonprimevideo, amazonprimevideo, amazonprim...",50min


In [None]:
tv_show_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   TV Show URL         100 non-null    object
 1   Title               100 non-null    object
 2   Release Year        100 non-null    object
 3   Genres              100 non-null    object
 4   IMDb Rating         100 non-null    object
 5   Age Rating          100 non-null    object
 6   Production Country  100 non-null    object
 7   Streaming Services  100 non-null    object
 8   Runtime             100 non-null    object
dtypes: object(9)
memory usage: 7.2+ KB


## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here


## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Convert IMDb Rating columns to numeric
df['IMDB Rating'] = pd.to_numeric(df['IMDB Rating'], errors='coerce')
tv_show_df['IMDb Rating'] = pd.to_numeric(tv_show_df['IMDb Rating'], errors='coerce')

# Calculate the mean IMDb rating
mean_movie_rating = df['IMDB Rating'].mean()
mean_tvshow_rating = tv_show_df['IMDb Rating'].mean()

# Print results
print(f"Mean IMDb Rating for Movies: {mean_movie_rating:.2f}")
print(f"Mean IMDb Rating for TV Shows: {mean_tvshow_rating:.2f}")


Mean IMDb Rating for Movies: 6.93
Mean IMDb Rating for TV Shows: 7.80


## **Analyzing Top Genres**

In [None]:
# Write Your Code here
from collections import Counter

# Split the genres for movies
movie_genres = df['Genres'].dropna().str.split(',').explode().str.strip()
movie_genre_counts = Counter(movie_genres)

# Split the genres for TV shows
tvshow_genres = tv_show_df['Genres'].dropna().str.split(',').explode().str.strip()
tvshow_genre_counts = Counter(tvshow_genres)

# Combine both counters for total counts across movies and TV shows
total_genre_counts = movie_genre_counts + tvshow_genre_counts

# Convert the counters to DataFrames for better presentation
movie_genres_df = pd.DataFrame(movie_genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)
tvshow_genres_df = pd.DataFrame(tvshow_genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)
total_genres_df = pd.DataFrame(total_genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False)

# Display the results
print("Unique genres and their counts in Movies:")
print(movie_genres_df)

print("\nUnique genres and their counts in TV Shows:")
print(tvshow_genres_df)

print("\nTotal unique genres and their counts in Movies and TV Shows combined:")
print(total_genres_df)

# Optional: Save results to CSV files
movie_genres_df.to_csv('movie_genres_counts.csv', index=False)
tvshow_genres_df.to_csv('tvshow_genres_counts.csv', index=False)
total_genres_df.to_csv('total_genres_counts.csv', index=False)


Unique genres and their counts in Movies:
                 Genre  Count
3                Drama     66
0   Mystery & Thriller     50
2   Action & Adventure     49
1                Crime     21
5               Comedy     19
4               Horror     17
6      Science-Fiction     15
7              Romance     13
9              Fantasy     12
11       Kids & Family     10
10           Animation      6
13      Made in Europe      4
12             History      3
8       War & Military      2
14               Sport      1

Unique genres and their counts in TV Shows:
                 Genre  Count
2                Drama     81
5   Action & Adventure     42
1   Mystery & Thriller     41
4      Science-Fiction     36
6                Crime     29
9              Fantasy     24
7               Comedy     24
8            Animation     16
0              Romance     10
3               Horror      8
11             History      4
13       Kids & Family      4
15      War & Military      3
12          R

In [None]:
#Let's Visvalize it using word cloud
import plotly.express as px

# Create visualizations for Movies, TV Shows, and Combined Genres

# Movies Genre Counts Visualization
fig_movies = px.bar(
    movie_genres_df,
    x='Genre',
    y='Count',
    title='Unique Genres and Their Counts in Movies',
    labels={'Genre': 'Movie Genres', 'Count': 'Count'},
    template='plotly_white',
    text='Count'
)
fig_movies.update_traces(textposition='outside')
fig_movies.show()

# TV Shows Genre Counts Visualization
fig_tvshows = px.bar(
    tvshow_genres_df,
    x='Genre',
    y='Count',
    title='Unique Genres and Their Counts in TV Shows',
    labels={'Genre': 'TV Show Genres', 'Count': 'Count'},
    template='plotly_white',
    text='Count'
)
fig_tvshows.update_traces(textposition='outside')
fig_tvshows.show()

# Combined Genre Counts Visualization
fig_combined = px.bar(
    total_genres_df,
    x='Genre',
    y='Count',
    title='Unique Genres and Their Counts (Movies & TV Shows Combined)',
    labels={'Genre': 'Combined Genres', 'Count': 'Count'},
    template='plotly_white',
    text='Count'
)
fig_combined.update_traces(textposition='outside')
fig_combined.show()


## **Finding Predominant Streaming Service**

In [None]:
from collections import Counter
import pandas as pd

# Helper function to process streaming services
def process_streaming_services(services_column):
    # Ensure all values are strings and split into unique services
    services = services_column.dropna().apply(lambda x: list(set(service.strip().strip("[]'") for service in str(x).split(','))))
    # Flatten the list of services
    flattened_services = [service for sublist in services for service in sublist]
    return flattened_services

# Process streaming services for movies
movie_services = process_streaming_services(df['Streaming Services'])
movie_service_counts = Counter(movie_services)

# Process streaming services for TV shows
tvshow_services = process_streaming_services(tv_show_df['Streaming Services'])
tvshow_service_counts = Counter(tvshow_services)

# Combine both counters for total counts across movies and TV shows
total_service_counts = movie_service_counts + tvshow_service_counts

# Convert counters to DataFrames
movie_services_df = pd.DataFrame(movie_service_counts.items(), columns=['Streaming Service', 'Count']).sort_values(by='Count', ascending=False)
tvshow_services_df = pd.DataFrame(tvshow_service_counts.items(), columns=['Streaming Service', 'Count']).sort_values(by='Count', ascending=False)
total_services_df = pd.DataFrame(total_service_counts.items(), columns=['Streaming Service', 'Count']).sort_values(by='Count', ascending=False)

# Display results
print("Movie Streaming Services Count:")
print(movie_services_df)

print("\nTV Show Streaming Services Count:")
print(tvshow_services_df)

print("\nTotal Streaming Services Count:")
print(total_services_df)


Movie Streaming Services Count:
       Streaming Service  Count
0       amazonprimevideo    100
3                 amazon     34
1             bookmyshow     23
2                netflix     20
5                 itunes     17
8                hotstar     10
17             jiocinema      9
11         vimoviesandtv      8
4                   zee5      7
10                   aha      4
18               tatasky      3
13         lionsgateplay      2
9            hungamaplay      1
7                   mubi      1
12               sonyliv      1
14  appletvlionsgateplay      1
15   amazonlionsgateplay      1
16                sunnxt      1
6             amazonmubi      1
19               filmbox      1

TV Show Streaming Services Count:
       Streaming Service  Count
0       amazonprimevideo     97
2                netflix     36
1              jiocinema     18
7                hotstar      8
9       amazonanimetimes      8
5      amazoncrunchyroll      7
4          vimoviesandtv      3
6    

In [None]:
#Let's Visvalize it using word cloud
import plotly.express as px

# Visualization for Movie Streaming Services
fig_movies = px.bar(
    movie_services_df,
    x='Streaming Service',
    y='Count',
    title='Movie Streaming Services Count',
    labels={'Streaming Service': 'Service', 'Count': 'Number of Movies'},
    template='plotly_white',
    text='Count'
)
fig_movies.update_traces(textposition='outside')
fig_movies.show()

# Visualization for TV Show Streaming Services
fig_tvshows = px.bar(
    tvshow_services_df,
    x='Streaming Service',
    y='Count',
    title='TV Show Streaming Services Count',
    labels={'Streaming Service': 'Service', 'Count': 'Number of TV Shows'},
    template='plotly_white',
    text='Count'
)
fig_tvshows.update_traces(textposition='outside')
fig_tvshows.show()

# Visualization for Combined Streaming Services
fig_combined = px.bar(
    total_services_df,
    x='Streaming Service',
    y='Count',
    title='Total Streaming Services Count (Movies & TV Shows Combined)',
    labels={'Streaming Service': 'Service', 'Count': 'Total Count'},
    template='plotly_white',
    text='Count'
)
fig_combined.update_traces(textposition='outside')
fig_combined.show()


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
import pandas as pd

# Export the DataFrame to a CSV file
csv_file_name = "movies_data.csv"
df.to_csv(csv_file_name, index=False)

print(f"DataFrame successfully exported to {csv_file_name}")

In [None]:
#saving filter data as Filter Data in csv format
import pandas as pd

# Export the DataFrame to a CSV file
csv_file_name = "tv_show_data.csv"
tv_show_df.to_csv(csv_file_name, index=False)

print(f"DataFrame successfully exported to {csv_file_name}")

DataFrame successfully exported to tv_show_data.csv


# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***