# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
# Specifying the URL from which movies related data will be fetched

# WAS GETTING 403 ERROR, so took help online and found this piece of code so that it wont throw 403 error
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
url='https://www.justwatch.com/in/movies?release_year_from=2000'

# Sending an HTTP GET request to the URL
response=requests.get(url, headers= headers)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(response.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Movie URL's**

In [None]:

## Hint : Use the following code to extract the film urls
movie_links = soup.find_all('a', href=True)
movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]
url_list=[]
for x in movie_urls:
  url_list.append('https://www.justwatch.com'+x)
print(len(url_list))
print(url_list)

110
['https://www.justwatch.com/in/movie/pushpa-the-rule-part-2', 'https://www.justwatch.com/in/movie/marco-2024', 'https://www.justwatch.com/in/movie/sookshma-darshini', 'https://www.justwatch.com/in/movie/bhool-bhulaiyaa-3', 'https://www.justwatch.com/in/movie/ore-dake-level-up-na-ken-reawakening', 'https://www.justwatch.com/in/movie/all-we-imagine-as-light', 'https://www.justwatch.com/in/movie/lucky-baskhar', 'https://www.justwatch.com/in/movie/your-fault', 'https://www.justwatch.com/in/movie/singham-again-2024-0', 'https://www.justwatch.com/in/movie/game-changer-2023', 'https://www.justwatch.com/in/movie/venom-3-2024', 'https://www.justwatch.com/in/movie/stree-2', 'https://www.justwatch.com/in/movie/the-substance', 'https://www.justwatch.com/in/movie/mufasa-the-lion-king', 'https://www.justwatch.com/in/movie/pushpa', 'https://www.justwatch.com/in/movie/the-wild-robot', 'https://www.justwatch.com/in/movie/untitled-murad-khetani-varun-dhawan-project', 'https://www.justwatch.com/in/mo

## **Scrapping Movie Title**

In [None]:
# Write Your Code here
def scrape_movie_title(url):
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    title_element = soup.find('h1').text.strip()
    if title_element:
        return title_element
    else:
        return None  # Or handle the case where title is not found
movie_titles = []
for movie_url in url_list:
  title = scrape_movie_title(movie_url)
  if title:
      movie_titles.append(title)
print(len(movie_titles))
movie_titles

110


['Pushpa 2 - The Rule (2024)',
 'Marco (2024)',
 'Sookshma Darshini (2024)',
 'Bhool Bhulaiyaa 3 (2024)',
 'Solo Leveling -ReAwakening- (2024)',
 'All We Imagine as Light (2024)',
 'Lucky Baskhar (2024)',
 'Culpa Tuya (2024)',
 'Singham Again (2024)',
 'Game Changer (2025)',
 'Venom: The Last Dance (2024)',
 'Stree 2: Sarkate Ka Aatank (2024)',
 'The Substance (2024)',
 'Mufasa: The Lion King (2024)',
 'Pushpa: The Rise (2021)',
 'The Wild Robot (2024)',
 'Baby John (2024)',
 'Anora (2024)',
 '365 Days (2020)',
 'The Sabarmati Report (2024)',
 'Pani (2024)',
 'Amaran (2024)',
 'Viduthalai Part 2 (2024)',
 'Gladiator II (2024)',
 'Red One (2024)',
 'Babygirl (2024)',
 'Kill (2024)',
 'Kraven the Hunter (2024)',
 'Carry-On (2024)',
 'Bhairathi Ranagal (2024)',
 'Rifle Club (2024)',
 'Kishkindha Kaandam (2024)',
 'Bagheera (2024)',
 'UI (2024)',
 'Viduthalai: Part I (2023)',
 'Wicked (2024)',
 'Heretic (2024)',
 'My Fault (2023)',
 'Girls Will Be Girls (2024)',
 'Salaar (2023)',
 'Bachhal

## **Scrapping release Year**

In [None]:
# Write Your Code here
#import time
movie_release_years=[]
for url in url_list:
  try:
    response=requests.get(url,headers=headers)
    soup=BeautifulSoup(response.text,'html.parser')
    year=soup.find_all('span',class_='release-year')[0].text.strip('()')
  except:
    year='NA'
  movie_release_years.append(year)
  #time.sleep(1)
print(movie_release_years)
print(len(movie_release_years))

['2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2025', '2024', '2024', '2024', '2024', '2021', '2024', '2024', '2024', '2020', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2023', '2024', '2024', '2023', '2024', '2023', '2024', '2024', '2024', '2001', '2024', '2024', '2024', '2024', '2025', '2022', '2015', '2024', '2024', '2014', '2019', '2024', '2001', '2021', '2019', '2023', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2019', '2017', '2013', '2019', '2025', '2012', '2019', '2024', '2024', '2023', '2024', '2024', '2008', '2024', '2023', '2024', '2024', '2025', '2024', '2024', '2023', '2018', '2024', '2024', '2024', '2024', '2025', '2018', '2016', '2024', '2025', '2024', '2024', '2024', '2024', '2024', '2024', '2024', '2025', '2025', '2015', '2025']
110


## **Scrapping Genres**

In [None]:
# Write Your Code here
movie_genre = []
for url in url_list:
    try:
      response = requests.get(url, headers=headers)
      soup = BeautifulSoup(response.text, 'html.parser')
      for x in soup.find_all('div', class_='poster-detail-infos'):
          if x.find_all('h3')[0].text == 'Genres':
            genre = x.find_all('span')[0].text
    except:
        genre = 'NA'
    movie_genre.append(genre)
print(len(movie_genre))
print(movie_genre)

110
['Action & Adventure, Drama, Mystery & Thriller, Crime', 'Action & Adventure, Crime, Mystery & Thriller', 'Mystery & Thriller, Comedy', 'Comedy, Horror', 'Action & Adventure, Fantasy, Animation', 'Drama, Romance', 'Crime, Drama, Mystery & Thriller', 'Drama, Romance', 'Drama, Mystery & Thriller, Crime, Action & Adventure', 'Drama, Mystery & Thriller, Action & Adventure', 'Action & Adventure, Science-Fiction, Mystery & Thriller', 'Comedy, Horror', 'Horror, Science-Fiction, Drama', 'Fantasy, Animation, Drama, Kids & Family, Action & Adventure', 'Mystery & Thriller, Action & Adventure, Drama, Crime', 'Science-Fiction, Animation, Action & Adventure, Kids & Family', 'Drama, Mystery & Thriller, Action & Adventure', 'Comedy, Drama, Romance', 'Drama, Romance, Made in Europe', 'Mystery & Thriller, Crime, Drama', 'Crime, Mystery & Thriller, Drama', 'War & Military, Action & Adventure, Drama', 'Crime, Drama, Mystery & Thriller, Action & Adventure, History', 'Action & Adventure, Drama', 'Action

## **Scrapping IMBD Rating**

In [None]:
imbd_scores = []
#imbd_score=[]
for url in url_list:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        for x in soup.find_all('div'):
            imbd_score=soup.find_all('span', class_='imdb-score')[0].text.strip().split()[0]
    except:
        imbd_score = 'NA'
    imbd_scores.append(imbd_score)
print(len(imbd_scores))
print(imbd_scores)

110
['6.3', '7.5', '7.9', '4.7', '8.4', '7.2', '8.0', '5.2', '4.9', '6.6', '6.0', '7.0', '7.3', '6.7', '7.6', '8.2', '6.4', '7.8', '3.3', '6.5', '6.8', '8.2', '8.3', '6.6', '6.3', '6.5', '7.5', '5.4', '6.5', '7.2', '7.1', '8.0', '6.7', '7.3', '8.3', '7.7', '7.0', '6.2', '7.1', '6.6', '8.7', '4.6', '7.0', '7.7', '7.6', '7.8', '6.4', '7.6', '6.0', '8.2', '4.2', '7.2', '8.1', '8.7', '6.8', '7.9', '7.7', '6.2', '7.1', '6.1', '7.2', '6.9', '8.0', '7.4', '8.1', '6.9', '7.2', '6.5', '7.0', '6.8', '8.2', '8.2', '7.9', '6.5', '7.0', '8.4', '8.5', '7.7', '4.3', '7.7', '7.6', '7.0', '6.7', '5.2', '7.2', '7.8', '7.4', '6.9', '5.1', '7.7', '7.0', '6.3', '6.8', '7.1', '6.0', '7.3', '7.3', '6.5', '8.7', '6.8', '6.3', '7.5', '7.9', '4.7', '8.4', '4.3', '5.6', 'NA', '7.5', 'NA']


## **Scrapping Runtime/Duration**

In [None]:
import time
movie_runtime = []
for url in url_list:
  try:
    response = requests.get(url, headers=headers)
    soup=BeautifulSoup(response.text,'html.parser')
    for x in soup.find_all('div',class_='poster-detail-infos'):
       if x.find_all('h3')[0].text=='Runtime':
        runtime=x.find_all('div')[0].text
  except:
    runtime='NA'
  movie_runtime.append(runtime)
  #time.sleep(1)
print(len(movie_runtime))
print(movie_runtime)

110
['3h 46min', '2h 25min', '2h 22min', '2h 38min', '1h 56min', '1h 54min', '2h 28min', '2h 0min', '2h 25min', '2h 44min', '1h 49min', '2h 27min', '2h 21min', '1h 58min', '2h 59min', '1h 42min', '2h 42min', '2h 19min', '1h 54min', '2h 7min', '2h 23min', '2h 47min', '2h 30min', '2h 28min', '2h 4min', '1h 54min', '1h 45min', '2h 7min', '2h 0min', '2h 12min', '1h 53min', '2h 13min', '2h 38min', '2h 12min', '2h 26min', '2h 41min', '1h 51min', '1h 57min', '1h 59min', '2h 55min', '2h 20min', '2h 34min', '1h 48min', '1h 45min', '2h 13min', '2h 38min', '2h 18min', '2h 8min', '1h 54min', '2h 46min', '2h 5min', '1h 50min', '2h 44min', '2h 49min', '1h 58min', '1h 24min', '2h 32min', '1h 52min', '2h 56min', '3h 21min', '2h 3min', '1h 54min', '2h 29min', '2h 0min', '3h 35min', '2h 14min', '1h 30min', '1h 43min', '1h 50min', '2h 12min', '3h 0min', '2h 56min', '1h 52min', '1h 23min', '2h 7min', '2h 21min', '2h 47min', '2h 44min', '2h 35min', '2h 9min', '2h 4min', '1h 40min', '1h 37min', '2h 32min', 

## **Scrapping Age Rating**

In [None]:
import time
movie_agerating=[]
age_rating=''
for url in url_list:
  try:
     response=requests.get(url,headers=headers)
     soup=BeautifulSoup(response.text,'html.parser')
     for x in soup.find_all('div', class_='poster-detail-infos' ):
       if x.find_all('h3')[0].text=='Age rating':
        age_rating=x.find_all('div')[0].text
  except:
    age_rating='NA'
  movie_agerating.append(age_rating)
  time.sleep(1)
print(len(movie_agerating))
print(movie_agerating)

110
['', 'A', 'UA', 'UA', 'UA', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'U', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'UA', 'UA', 'A', 'A', 'A', 'UA', 'A', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'U', 'U', 'U', 'U', 'UA', 'A', 'UA', 'UA', 'A', 'A', 'A', 'UA', 'UA', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'UA', 'UA', 'U', 'UA', 'A', 'A', 'UA', 'UA', 'UA', 'A', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'U', 'UA', 'UA', 'A', 'U', 'UA', 'UA', 'UA', 'UA', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA']


## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
Production_countries = []
for url in url_list:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        for x in soup.find_all('div', class_='poster-detail-infos'):
            if x.find('h3', class_='poster-detail-infos__subheading').text == 'Production country':
                Production_country = x.find('div', class_='poster-detail-infos__value').text
    except:
        Production_country = 'NA'
    Production_countries.append(Production_country)
print(len(Production_countries))
print(Production_countries)

110
['India', 'India', 'India', 'India', 'Japan', 'France, India, Italy, Luxembourg, Netherlands', 'India', 'Spain', 'India', 'India', 'United States', 'India', 'United Kingdom, United States, France', 'United States', 'India', 'United States, Japan', 'India', 'United States', 'Poland', 'India', 'India', 'India', 'India', 'United States, United Kingdom, Morocco, Canada, Malta', 'United States', 'Netherlands, United States', 'India', 'United States', 'United States', 'India', 'India', 'India', 'India', 'India', 'India', 'United States', 'Canada, United States', 'Spain', 'India, Norway, United States, France', 'India', 'India', 'India', 'United Kingdom, France', 'Mexico', 'United States', 'India', 'India', 'United States', 'United States', 'India', 'United States', 'Japan, United States', 'India', 'United Kingdom, Canada, United States', 'United States', 'Latvia, Belgium, France', 'United Kingdom, United States', 'India, United States', 'India', 'India', 'India', 'India', 'India', 'Unite

## **Fetching Streaming Service Details**

In [None]:
movie_streaming_services = []
for url in url_list:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract 'src' and process to get the name of the streaming service
        name = [
            x['src'].split('/')[-1].split('.')[0]  # Extract the last part of the URL and remove file extension
            for x in soup.find_all('img', class_="provider-icon wide icon") if 'src' in x.attrs
        ]
    except Exception as e:
        print(f"Error for URL {url}: {e}")
        names = ['NA']
    movie_streaming_services.append(", ".join(name))
print(len(movie_streaming_services))
print(movie_streaming_services)

110
['amazonprimevideo', 'amazonprimevideo', 'hotstar, amazonprimevideo', 'netflix, amazonprimevideo', 'amazonprimevideo', 'hotstar, amazonprimevideo', 'netflix, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazon, amazonprimevideo, bookmyshow', 'amazonprimevideo', 'itunes, zee5, amazon, amazonprimevideo, itunes, itunes', 'amazonprimevideo, amazonprimevideo, amazon, amazonprimevideo', 'mubi, amazonmubi, amazonprimevideo, amazon', 'bookmyshow, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazon, amazonprimevideo', 'itunes, zee5, amazon, amazonprimevideo, itunes, itunes, bookmyshow', 'bookmyshow, amazonprimevideo', 'amazonprimevideo', 'netflix, amazonprimevideo', 'zee5, bookmyshow, amazonprimevideo', 'bookmyshow, amazonprimevideo', 'netflix, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazon, amazonprimevideo', 'itunes, amazon, itunes, amazonprimevideo, itunes, bookmyshow', 'amazonprimevideo, amazon

## **Now Creating Movies DataFrame**

In [None]:
info_dict = {
    'url': url_list,
    'title': movie_titles,
    'release_year': movie_release_years,
    'genre': movie_genre,
    'imdb_rating': imbd_scores,
    'runtime': movie_runtime,
    'age_rating': movie_agerating,
    'production_country': Production_countries,
    'streaming_service': movie_streaming_services
}
movie_data = pd.DataFrame(info_dict)
movie_data

In [None]:
# make a csv file
movie_data.to_csv('tv_data.csv')

## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'

# WAS GETTING 403 ERROR, so took help online and found this piece of code so that it wont throw 403 error
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
# Sending an HTTP GET request to the URL
response=requests.get(tv_url, headers=headers)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(response.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Tv shows Url details**

In [None]:
tv_show_links=[]
# Find all elements containing TV show links (adjust selector if needed)
for x in soup.find_all('a', class_="title-list-grid__item--link"):
# Extract the href attributes to get the URLs
 tv_show_links.append('https://www.justwatch.com'+x['href'])
# Print the extracted URLs and their count
print(tv_show_links)
print(len(tv_show_links))

['https://www.justwatch.com/in/tv-show/squid-game', 'https://www.justwatch.com/in/tv-show/thukra-ke-mera-pyaar', 'https://www.justwatch.com/in/tv-show/paatal-lok', 'https://www.justwatch.com/in/tv-show/solo-leveling-2024', 'https://www.justwatch.com/in/tv-show/the-day-of-the-jackal', 'https://www.justwatch.com/in/tv-show/from', 'https://www.justwatch.com/in/tv-show/mismatched', 'https://www.justwatch.com/in/tv-show/alice-in-borderland', 'https://www.justwatch.com/in/tv-show/mirzapur', 'https://www.justwatch.com/in/tv-show/black-warrant', 'https://www.justwatch.com/in/tv-show/game-of-thrones', 'https://www.justwatch.com/in/tv-show/mrs-fletcher', 'https://www.justwatch.com/in/tv-show/attack-on-titan', 'https://www.justwatch.com/in/tv-show/bigg-boss', 'https://www.justwatch.com/in/tv-show/what-if-2021', 'https://www.justwatch.com/in/tv-show/the-penguin', 'https://www.justwatch.com/in/tv-show/mastram', 'https://www.justwatch.com/in/tv-show/aindham-vedham', 'https://www.justwatch.com/in/tv-

## **Fetching Tv Show Title details**

In [None]:
# Write Your Code here
def scrape_tv_title(tv_url):
    response = requests.get(tv_url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    title_element = soup.find('h1',class_='title-detail-hero__details__title').text.strip()
    if title_element:
        return title_element
    else:
        return None  # Or handle the case where title is not found
tv_titles = []
for tv_url in tv_show_links:
  title = scrape_tv_title(tv_url)
  if title:
      tv_titles.append(title)
print(len(tv_titles))
tv_titles

100


['Squid Game (2021)',
 'Thukra Ke Mera Pyaar (2024)',
 'Paatal Lok (2020)',
 'Solo Leveling (2024)',
 'The Day of the Jackal (2024)',
 'From (2022)',
 'Mismatched (2020)',
 'Alice in Borderland (2020)',
 'Mirzapur (2018)',
 'Black Warrant (2025)',
 'Game of Thrones (2011)',
 'Mrs. Fletcher (2019)',
 'Attack on Titan (2013)',
 'Bigg Boss (2006)',
 'What If...? (2021)',
 'The Penguin (2024)',
 'Mastram (2020)',
 'Aindham Vedham (2024)',
 'American Primeval (2025)',
 'XO, Kitty (2023)',
 'Squid Game: The Challenge (2023)',
 'Secret Level (2024)',
 'Scam 1992: The Harshad Mehta Story (2020)',
 'Dune: Prophecy (2024)',
 'Farzi (2023)',
 'Fake Profile (2023)',
 'Breaking Bad (2008)',
 "Adam's Sweet Agony (2024)",
 'Mayfair Witches (2023)',
 'When the Phone Rings (2024)',
 'Taaza Khabar (2023)',
 'Yellowstone (2018)',
 'Landman (2024)',
 'Panchayat (2020)',
 'The Rookie (2018)',
 'The Boys (2019)',
 'Silo (2023)',
 'House of the Dragon (2022)',
 'Naruto (2002)',
 'Money Heist (2017)',
 'High 

In [None]:
import time
tvshow_title = []
for tv_url in tv_show_links:
  try:
    response = requests.get(tv_url, headers = headers)
    soup = BeautifulSoup(response.text,'html.parser')
    title=soup.find('h1',class_='title-detail-hero__details__title').text.strip()
  except:
    title='NA'
  tvshow_title.append(title)
  time.sleep(1)
print(tvshow_title)
print(len(tvshow_title))

['Squid Game (2021)', 'Thukra Ke Mera Pyaar (2024)', 'Paatal Lok (2020)', 'Solo Leveling (2024)', 'The Day of the Jackal (2024)', 'From (2022)', 'Mismatched (2020)', 'Alice in Borderland (2020)', 'Mirzapur (2018)', 'Black Warrant (2025)', 'Game of Thrones (2011)', 'Mrs. Fletcher (2019)', 'Attack on Titan (2013)', 'Bigg Boss (2006)', 'What If...? (2021)', 'The Penguin (2024)', 'Mastram (2020)', 'Aindham Vedham (2024)', 'American Primeval (2025)', 'XO, Kitty (2023)', 'Squid Game: The Challenge (2023)', 'Secret Level (2024)', 'Scam 1992: The Harshad Mehta Story (2020)', 'Dune: Prophecy (2024)', 'Farzi (2023)', 'Fake Profile (2023)', 'Breaking Bad (2008)', "Adam's Sweet Agony (2024)", 'Mayfair Witches (2023)', 'When the Phone Rings (2024)', 'Taaza Khabar (2023)', 'Yellowstone (2018)', 'Landman (2024)', 'Panchayat (2020)', 'The Rookie (2018)', 'The Boys (2019)', 'Silo (2023)', 'House of the Dragon (2022)', 'Naruto (2002)', 'Money Heist (2017)', 'High Potential (2024)', 'Beast Games (2024)',

## **Fetching Release Year**

In [None]:
# Write Your Code here
#import time
Release_Years= []
for tv_url in tv_show_links:
  try:
    response = requests.get(tv_url, headers = headers)
    soup = BeautifulSoup(response.text,'html.parser')
    title=soup.find_all('span',class_='release-year')[0].text.strip('()')
  except:
    title='NA'
  Release_Years.append(title)
  #time.sleep(1)
print(Release_Years)
print(len(Release_Years))

['2021', '2024', '2020', '2024', '2024', '2022', '2020', '2020', '2018', '2025', '2011', '2019', '2013', '2006', '2021', '2024', '2020', '2024', '2025', '2023', '2023', '2024', '2020', '2024', '2023', '2023', '2008', '2024', '2023', '2024', '2023', '2018', '2024', '2020', '2018', '2019', '2023', '2022', '2002', '2017', '2024', '2024', '2019', '2024', '2024', '2023', '2024', '2024', '2024', '2025', '2022', '2025', '2017', '2018', '2025', '2017', '2024', '2020', '2010', '2018', '2024', '2007', '2022', '2020', '2020', '2020', '2024', '2024', '2021', '2022', '2018', '2014', '2019', '2024', '2019', '2021', '2022', '2020', '2025', '2023', '2019', '2020', '2025', '2024', '2009', '2023', '2014', '2014', '2024', '2018', '2024', '2016', '2018', '2024', '2022', '2016', '2022', '2004', '2013', '2025']
100


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
tv_show_genre = []
for url in tv_show_links:
    try:
      response = requests.get(url, headers=headers)
      soup = BeautifulSoup(response.text, 'html.parser')
      for x in soup.find_all('div', class_='poster-detail-infos'):
          if x.find_all('h3')[0].text == 'Genres':
            genre = x.find_all('span')[0].text
    except:
        genre = 'NA'
    tv_show_genre.append(genre)
print(len(tv_show_genre))
print(tv_show_genre)

100
['Action & Adventure, Mystery & Thriller, Drama', 'Drama, Romance', 'Crime, Drama, Mystery & Thriller', 'Animation, Action & Adventure, Fantasy, Science-Fiction', 'Action & Adventure, Crime, Drama, Mystery & Thriller', 'Horror, Science-Fiction, Mystery & Thriller, Drama', 'Comedy, Drama, Romance', 'Drama, Horror, Science-Fiction, Mystery & Thriller, Action & Adventure', 'Crime, Action & Adventure, Drama, Mystery & Thriller', 'Crime, Drama', 'Drama, Action & Adventure, Science-Fiction, Fantasy', 'Comedy, Drama', 'Animation, Action & Adventure, Drama, Fantasy, Horror, Science-Fiction', 'Reality TV', 'Science-Fiction, Action & Adventure, Animation', 'Crime, Drama, Fantasy', 'Drama, Comedy, Fantasy', 'Mystery & Thriller', 'Western, Drama, Action & Adventure, Mystery & Thriller', 'Comedy, Drama, Romance', 'Reality TV', 'Fantasy, Science-Fiction, Animation, Action & Adventure', 'Crime, Drama, Mystery & Thriller', 'Action & Adventure, Drama, Science-Fiction', 'Drama, Crime, Mystery & Thri

## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here
imbd_scores = []
#imbd_score=[]
for url in tv_show_links:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        for x in soup.find_all('div'):
            imbd_score=soup.find_all('span', class_='imdb-score')[0].text.strip().split()[0]
    except:
        imbd_score = 'NA'
    imbd_scores.append(imbd_score)
print(len(imbd_scores))
print(imbd_scores)

100
['8.0', '6.7', '8.1', '8.3', '8.2', '7.8', '5.9', '7.7', '8.4', '8.1', '9.2', '7.1', '9.1', '3.6', '7.4', '8.6', '6.9', '7.3', '8.2', '6.5', '5.8', '7.4', '9.2', '7.3', '8.3', '5.7', '9.5', 'NA', '6.2', '5.5', '8.1', '8.6', '8.3', '9.0', '8.0', '8.7', '8.1', '8.3', '8.4', '8.2', '7.6', '5.2', '8.6', '6.1', '8.6', '6.8', '7.2', '8.3', '8.4', '7.6', '8.7', '6.5', '8.1', '3.4', '8.0', '8.7', '7.8', '6.5', '5.3', '4.8', '8.5', '8.7', '7.6', '8.6', 'NA', '6.6', '7.2', '7.3', '8.7', '6.8', '8.2', '8.9', '9.3', '8.1', '7.8', '9.0', '8.0', '8.5', '8.3', 'NA', '8.3', 'NA', '6.1', '4.6', '8.5', '7.5', '7.8', '7.5', '7.8', '7.7', '5.5', '8.7', '7.8', '7.2', '9.0', '8.6', '8.1', '8.3', '8.7', '7.1']


## **Fetching Age Rating Details**

In [None]:
# Write Your Code here
import time
tv_show_agerating=[]
for url in tv_show_links:
  try:
     response=requests.get(url,headers=headers)
     soup=BeautifulSoup(response.text,'html.parser')
     for x in soup.find_all('div', class_='poster-detail-infos'):
       if x.find_all('h3')[0].text=='Age rating':
        age_rating=x.find_all('div')[0].text
  except:
    runtime='NA'
  tv_show_agerating.append(age_rating)
  time.sleep(1)
print(len(tv_show_agerating))
print(tv_show_agerating)

100
['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'U', 'U', 'UA', 'UA', 'UA', 'UA', 'UA', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'U', 'U', 'U', 'A', 'A', 'A', 'A', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'U', 'A', 'A', 'A', 'A', 'A', 'A', 'U', 'U', 'U', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'U', 'A', 'A']


## **Fetching Production Country details**

In [None]:
# Write Your Code here
tvshow_productioncountries = []
for url in tv_show_links:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        for x in soup.find_all('div', class_='poster-detail-infos'):
            if x.find('h3', class_='poster-detail-infos__subheading').text == 'Production country':
                Production_country = x.find('div', class_='poster-detail-infos__value').text
    except:
        Production_country = 'NA'
    tvshow_productioncountries.append(Production_country)
print(len(tvshow_productioncountries))
print(tvshow_productioncountries)

100
['South Korea', 'India', 'India', 'Japan, South Korea', 'United Kingdom, United States', 'United States', 'India', 'Japan', 'India', 'India', 'United States', 'United States', 'Japan', 'India', 'United States', 'United States', 'India', 'India', 'United States', 'United States', 'United Kingdom', 'United States', 'India', 'Canada, Hungary, United States', 'India', 'Colombia', 'United States', 'Japan', 'United States', 'South Korea', 'India', 'United States', 'United States', 'India', 'United States', 'United States', 'United States', 'United States', 'Japan', 'Spain', 'United States', 'United States', 'Japan', 'Canada, United States', 'United States', 'United States', 'United Kingdom', 'United States', 'Colombia', 'United Kingdom', 'United States', 'India', 'United States', 'India', 'Japan', 'Germany', 'United States', 'Mexico', 'United States', 'India', 'Japan', 'Japan', 'South Korea', 'India', 'India', 'India', 'United States', 'United States', 'India', 'India', 'India', 'United 

## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
tvshow_streaming_services = []
for url in tv_show_links:
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Extract 'src' and process to get the name of the streaming service
        name = [
            x['src'].split('/')[-1].split('.')[0]  # Extract the last part of the URL and remove file extension
            for x in soup.find_all('img', class_="provider-icon wide icon") if 'src' in x.attrs
        ]
    except Exception as e:
        print(f"Error for URL {url}: {e}")
        names = ['NA']
    tvshow_streaming_services.append(", ".join(name))
print(len(tvshow_streaming_services))
print(tvshow_streaming_services)

100
['netflix, amazonprimevideo', 'hotstar, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazonprimevideo, amazonprimevideo', 'amazoncrunchyroll, amazonprimevideo', 'jiocinema, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazonprimevideo', 'netflix, amazonprimevideo', 'netflix, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazonprimevideo', 'netflix, amazonprimevideo', 'jiocinema, amazonprimevideo', 'amazonprimevideo', 'amazonprimevideo, amazonanimetimes, amazonprimevideo, amazonprimevideo', 'jiocinema, tatasky, amazonprimevideo, jiocinema', 'hotstar, amazonprimevideo', 'jiocinema, amazonprimevideo', 'vimoviesandtv, amazonprimevideo', 'zee5, zee5, amazonprimevideo', 'netflix, amazonprimevideo', 'netflix, amazonprimevideo', 'netflix, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazonprimevideo', 'sonyliv, vimoviesandtv, amazonprimevideo', 'jiocinema, amazonprimevideo', 'amazonprimevideo, amazonprimevideo, amazonprimevideo, amazonprimevideo',

## **Fetching Duration Details**

In [None]:
# Write Your Code here
import time
tv_show_runtime= []
for url in tv_show_links:
  try:
    response = requests.get(url, headers=headers)
    soup=BeautifulSoup(response.text,'html.parser')
    for x in soup.find_all('div',class_='poster-detail-infos'):
       if x.find_all('h3')[0].text=='Runtime':
        runtime=x.find_all('div')[0].text
  except:
    runtime='NA'
  tv_show_runtime.append(runtime)
  #time.sleep(1)
print(len(tv_show_runtime))
print(tv_show_runtime)

100
['57min', '23min', '43min', '24min', '51min', '51min', '36min', '54min', '50min', '44min', '58min', '34min', '25min', '1h 13min', '31min', '58min', '28min', '34min', '50min', '30min', '50min', '15min', '52min', '1h 5min', '56min', '40min', '47min', '3min', '43min', '1h 8min', '32min', '51min', '55min', '35min', '43min', '1h 1min', '50min', '1h 2min', '23min', '57min', '43min', '47min', '27min', '28min', '59min', '51min', '54min', '50min', '1h 4min', '1h 0min', '48min', '38min', '49min', '44min', '24min', '56min', '23min', '34min', '40min', '22min', '24min', '23min', '1h 2min', '46min', '35min', '43min', '37min', '49min', '57min', '40min', '24min', '1h 1min', '1h 5min', '24min', '49min', '41min', '40min', '24min', '51min', '45min', '58min', '47min', '45min', '24min', '21min', '50min', '43min', '42min', '49min', '48min', '42min', '1h 1min', '46min', '52min', '24min', '59min', '49min', '43min', '58min', '40min']


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
info_dict = {
    'url': tv_show_links,
    'title': tvshow_title,
    'release_year': Release_Years,
    'genre': tv_show_genre,
    'imdb_rating': imbd_scores,
    'runtime': tv_show_runtime,
    'age_rating': tv_show_agerating,
    'production_country': tvshow_productioncountries,
    'streaming_service': tvshow_streaming_services,
}
tv_data=pd.DataFrame(info_dict)
tv_data

In [None]:
tv_data.to_csv('tv_data.csv')
tv_data

In [None]:
final_data=pd.concat([movie_data,tv_data],axis=0)
final_data

In [None]:
final_data.to_csv('final_data.csv')

## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Only include movies and TV shows released in the last 2 years (from the current date)
final_data1['imdb_rating'] = final_data1['imdb_rating'].replace('NA', float("NaN"))
final_data1['imdb_rating'] = final_data1['imdb_rating'].astype(float)
# Ensure that the release_date column is in datetime format for proper comparison
final_data1['release_year']=pd.to_datetime(final_data1['release_year'])
# Define the current date and calculate the date for two years ago
from datetime import datetime, timedelta
current_date = datetime.now()
two_years_ago = current_date - pd.DateOffset(years=2)  # Roughly 2 years ago
# Filter the DataFrame to include only movies and TV shows released in the last 2 years
final_data1_ry = final_data1[final_data1['release_year'] >= two_years_ago]
final_data1_ry

In [None]:
final_data1_ir = final_data1[final_data1['age_rating'] == 'U']
final_data1_ir

In [None]:
 #Only include movies and TV shows with an IMDb rating of 7 or higher
final_data1_ir = final_data1[final_data1['imdb_rating'] >= 7]
final_data1_ir

In [None]:
# Apply filter
data_filtered = final_data1[
    (final_data1['release_year'] >= two_years_ago) & (final_data1['imdb_rating'] >= 7) & (final_data1['age_rating'] == 'U')]
data_filtered

In [None]:
# Write Your Code here
# first make copy of data file where we do manipulation
final_data1=final_data.copy()
movie_data1 = movie_data.copy()
tv_data1 = tv_data.copy()

In [None]:
print(final_data1.head())
print("\n")
print("\n")
print(final_data1.info())

In [None]:
# check row and column
print(final_data1.shape)

(210, 9)


In [None]:
# checking null value
final_data1.isnull().sum()

In [None]:
# Calculate the number of duplicate rows
final_data1.duplicated().sum()

4

In [None]:
# statistical report
final_data1.describe()

In [None]:
# Convert 'imdb_rating' column to numeric, handling non-numeric values
final_data1['imdb_rating'] = final_data1['imdb_rating'].replace('NA', float("NaN"))
final_data1['imdb_rating'] = final_data1['imdb_rating'].astype(float)
# Calculate the mean IMDb rating for movies
movie_mean_rating = final_data1.groupby('type')['imdb_rating'].mean()
print("Mean Imdb Rating")
print(movie_mean_rating)

Mean Imdb Rating
type
imbd_scores    7.285294
Name: imdb_rating, dtype: float64


## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Convert 'imdb_rating' column to numeric, handling non-numeric values
movie_data1['imdb_rating'] = movie_data1['imdb_rating'].replace('NA', float("NaN"))
movie_data1['imdb_rating'] = movie_data1['imdb_rating'].astype(float)
# Calculate the mean IMDb rating for movies
movie_mean_rating = movie_data1.groupby('type')['imdb_rating'].mean()
print("Mean Imdb Rating")
print(movie_mean_rating)

Mean Imdb Rating
type
imbd_scores    6.975926
Name: imdb_rating, dtype: float64


In [None]:
# Convert 'imdb_rating' column to numeric, handling non-numeric values
tv_data1['imdb_rating'] = tv_data1['imdb_rating'].replace('NA', float("NaN"))
tv_data1['imdb_rating'] = tv_data1['imdb_rating'].astype(float)
# Calculate the mean IMDb rating for movies
tv_rating = tv_data1.groupby('type')['imdb_rating'].mean()
print("Mean Imdb Rating")
print(tv_rating)

Mean Imdb Rating
type
imbd_scores    7.633333
Name: imdb_rating, dtype: float64


## **Analyzing Top Genres**

In [None]:
# Write Your Code here
# for both movies and tv shows
genre_count = movie_data1['genre'].value_counts().sort_values(ascending=False)
top_5movie_genres = genre_count.head(5)
print(top_5movie_genres)
genre_count = tv_data1['genre'].value_counts().sort_values(ascending=False)
top_5tv_genres = genre_count.head(5)
print(top_5tv_genres)

genre
Drama, Romance                                   8
Comedy                                           4
Comedy, Horror                                   3
Action & Adventure, Crime, Mystery & Thriller    3
Crime, Drama, Mystery & Thriller                 3
Name: count, dtype: int64
genre
Drama                               6
Comedy, Drama, Romance              4
Crime, Drama                        4
Reality TV                          4
Crime, Drama, Mystery & Thriller    4
Name: count, dtype: int64


In [None]:
#Let's Visvalize it using word cloud
from wordcloud import WordCloud, STOPWORDS
# Combine genre strings from both movie and TV show data
all_genres = " ".join(movie_data1['genre'].astype(str).tolist() + movie_data1['genre'].astype(str).tolist())
# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(all_genres)
# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
plt.bar(x =top_5_genres_visulalize.index,height =top_5_genres_visulalize.values)
plt.xlabel('Genre')
plt.ylabel('Number of Movies and TV Shows')
plt.title('Top 5 Genres with the Most Movies and TV Shows')
plt.xticks(rotation=45, ha='right')
plt.show()

## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here
# Counting Streaming Service
streaming_service_counts = movie_data1['streaming_service'].value_counts().sort_values(ascending=False)
streaming_service_counts

In [None]:
# Write Your Code here
# Counting Streaming Service
streaming_service_counts = tv_data1['streaming_service'].value_counts().sort_values(ascending=False)
streaming_service_counts

In [None]:
#Let's Visvalize it using word cloud
#Let's Visvalize it using word cloud
from wordcloud import WordCloud, STOPWORDS
# Combine genre strings from both movie and TV show data
all_genres = " ".join(tv_data1['genre'].astype(str).tolist() + tv_data1['genre'].astype(str).tolist())
# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(all_genres)
# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
final_data.to_csv('Final_Data.csv', index=False)

In [None]:
#saving filter data as Filter Data in csv format
data_filtered.to_csv('data_filtered.csv', index=False)

# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***