# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
def fetch_movie_urls(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return "Failed to retrieve the page, status code:", response.status_code
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup


url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
soup=fetch_movie_urls(url)
print(soup.prettify())

## Hint : Use the following code to extract the film urls
# movie_links = soup.find_all('a', href=True)
# movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]

# url_list=[]
# for x in movie_urls:
#   url_list.append('https://www.justwatch.com'+x)

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

In [None]:
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }

In [None]:
# Write Your Code here
url='https://www.justwatch.com/in/movies?release_year_from=2000'
content=requests.get(url,headers=headers)
soup=BeautifulSoup(content.text,'html.parser')
soup.prettify()



## **Fetching Movie URL's**

In [None]:
movie_url=[]
for x in soup.find_all('a',class_='title-list-grid__item--link'):
  movie_url.append("https://www.justwatch.com"+x['href'])
len(movie_url)
movie_url


['https://www.justwatch.com/in/movie/project-k',
 'https://www.justwatch.com/in/movie/kill-2024',
 'https://www.justwatch.com/in/movie/munjha',
 'https://www.justwatch.com/in/movie/maharaja-2024',
 'https://www.justwatch.com/in/movie/deadpool-3',
 'https://www.justwatch.com/in/movie/stree-2',
 'https://www.justwatch.com/in/movie/stree',
 'https://www.justwatch.com/in/movie/chandu-champion',
 'https://www.justwatch.com/in/movie/kingdom-of-the-planet-of-the-apes',
 'https://www.justwatch.com/in/movie/aadujeevitham',
 'https://www.justwatch.com/in/movie/agent',
 'https://www.justwatch.com/in/movie/deadpool',
 'https://www.justwatch.com/in/movie/dune-part-two',
 'https://www.justwatch.com/in/movie/the-ministry-of-ungentlemanly-warfare',
 'https://www.justwatch.com/in/movie/indian-2',
 'https://www.justwatch.com/in/movie/phir-aayi-hasseen-dillruba',
 'https://www.justwatch.com/in/movie/aavesham-2024',
 'https://www.justwatch.com/in/movie/laila-majnu',
 'https://www.justwatch.com/in/movie/th

In [None]:
movie_url

['https://www.justwatch.com/in/movie/project-k',
 'https://www.justwatch.com/in/movie/kill-2024',
 'https://www.justwatch.com/in/movie/munjha',
 'https://www.justwatch.com/in/movie/maharaja-2024',
 'https://www.justwatch.com/in/movie/deadpool-3',
 'https://www.justwatch.com/in/movie/stree-2',
 'https://www.justwatch.com/in/movie/stree',
 'https://www.justwatch.com/in/movie/chandu-champion',
 'https://www.justwatch.com/in/movie/kingdom-of-the-planet-of-the-apes',
 'https://www.justwatch.com/in/movie/aadujeevitham',
 'https://www.justwatch.com/in/movie/agent',
 'https://www.justwatch.com/in/movie/deadpool',
 'https://www.justwatch.com/in/movie/dune-part-two',
 'https://www.justwatch.com/in/movie/the-ministry-of-ungentlemanly-warfare',
 'https://www.justwatch.com/in/movie/indian-2',
 'https://www.justwatch.com/in/movie/phir-aayi-hasseen-dillruba',
 'https://www.justwatch.com/in/movie/aavesham-2024',
 'https://www.justwatch.com/in/movie/laila-majnu',
 'https://www.justwatch.com/in/movie/th

## **Scrapping Movie Title**

In [None]:
# Write Your Code here
movie_titles=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    title=soup.find_all('h1')[0].text
  except:
    title='NA'
  movie_titles.append(title)

print(movie_titles)



[' Kalki 2898-AD (2024)', ' Kill (2024)', ' Munjya (2024)', ' Maharaja (2024)', ' Deadpool & Wolverine (2024)', ' Stree 2: Sarkate Ka Aatank (2024)', ' Stree (2018)', ' Chandu Champion (2024)', ' Kingdom of the Planet of the Apes (2024)', ' The Goat Life (2024)', ' Agent (2023)', ' Deadpool (2016)', ' Dune: Part Two (2024)', ' The Ministry of Ungentlemanly Warfare (2024)', ' Indian 2 (2024)', ' Phir Aayi Hasseen Dillruba (2024)', ' Aavesham (2024)', ' Laila Majnu (2018)', ' The Gangster, the Cop, the Devil (2019)', ' The Fall Guy (2024)', ' 365 Days (2020)', 'NA', 'NA', ' Weapon (2024)', ' Bhaiyya Ji (2024)', ' Furiosa: A Mad Max Saga (2024)', 'NA', ' Je Jatt Vigad Gya (2024)', 'NA', ' A Quiet Place: Day One (2024)', ' Mr. & Mrs. Mahi (2024)', 'NA', ' Love Lies Bleeding (2024)', 'NA', ' Savi (2024)', ' Harom Hara (2024)', ' Dune (2021)', ' Maharshi (2019)', 'NA', 'NA', ' Laapataa Ladies (2024)', ' Maharaj (2024)', ' Inside Out 2 (2024)', 'NA', ' Despicable Me 4 (2024)', ' Oppenheimer (

## **Scrapping release Year**

In [None]:
# Write Your Code here
movie_year=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    year=soup.find_all('span',class_='release-year')[0].text.strip('()')
  except:
    year='NA'
  movie_year.append(year)

print(movie_year)

['2024', '2024', '2024', '2024', '2024', '2024', '2018', '2024', '2024', '2024', '2023', '2016', '2024', '2024', '2024', '2024', '2024', '2018', '2019', '2024', '2020', '2024', '2024', '2024', '2024', '2024', 'NA', '2024', '2024', '2024', '2024', 'NA', '2024', 'NA', '2024', '2024', 'NA', 'NA', 'NA', '2024', 'NA', 'NA', 'NA', '2024', '2024', '2023', '2023', '2022', '2024', '2024', '2024', '2024', '2023', 'NA', '2024', '2024', '2023', '2024', '2016', '2023', '2023', '2024', '2021', '2024', '2015', '2024', '2024', '2011', '2024', '2024', '2001', 'NA', '2024', '2024', '2024', '2024', '2002', '2024', '2024', '2015', '2017', '2018', '2004', '2024', 'NA', 'NA', '2018', '2014', 'NA', '2024', 'NA', '2022', '2015', '2013', '2013', '2018', '2023', 'NA', '2022', 'NA']


## **Scrapping Genres**

In [None]:
# Write Your Code here
movie_genre=[]
for url in movie_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
            if x.find_all('h3')[0].text=='Genres':
              genre=x.find_all('span')[0].text
    except:
        genre='NA'
    movie_genre.append(genre)
print(movie_genre)


['Drama, Action & Adventure, Fantasy, Science-Fiction, Mystery & Thriller', 'Action & Adventure, Crime, Drama, Mystery & Thriller', 'Comedy, Horror', 'Drama, Mystery & Thriller, Action & Adventure, Crime', 'Action & Adventure, Comedy, Science-Fiction', 'Comedy, Horror', 'Horror, Comedy, Drama', 'Drama, History, Sport, War & Military, Action & Adventure', 'Science-Fiction, Action & Adventure, Drama, Mystery & Thriller', 'Drama', 'Mystery & Thriller, Action & Adventure', 'Comedy, Action & Adventure', 'Action & Adventure, Science-Fiction, Drama', 'Action & Adventure, Comedy, War & Military', 'Action & Adventure, Drama, Mystery & Thriller', 'Mystery & Thriller, Romance, Crime, Drama', 'Comedy, Action & Adventure', 'Drama, Romance', 'Mystery & Thriller, Action & Adventure, Crime', 'Comedy, Drama, Romance, Action & Adventure', 'Drama, Romance, Made in Europe', 'Drama', 'Action & Adventure, Drama, Mystery & Thriller', 'Science-Fiction, Mystery & Thriller, Action & Adventure', 'Action & Advent

## **Scrapping IMBD Rating**

In [None]:
# Write Your Code here
movie_rating=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='poster-detail-infos'):
      if x.find_all('h3')[0].text=='Rating':
        rating=x.find_all('div')[0].text
  except:
    rating='NA'
  movie_rating.append(rating)
print(movie_rating)

['8.1  (222k)95%', '8.1  (222k)95%', '8.1  (222k)95%', '8.1  (222k)95%', '8.0  (240k)78%', '7.9  (16k)67%', '7.9  (16k)67%', '7.9  (16k)67%', '7.9  (16k)67%', '7.9  (16k)67%', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '4.2  (1k)', '7.6  (170k)90%', '7.6  (170k)90%', '7.6  (170k)90%', '7.6  (170k)90%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.4  (78k)86%', '6.7  (76k)74%', '6.7  (76k)74%', '6.7  (76k)74%', '7.8  (107k)91%', '7.8  (107k)91%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '6.2  (32k)56%', '4.3  (9k)42%', '6.1  (98k)54%', '6.1  (98k)54%', '6.1  (98k)54%', '6.1  (98k)54%', '8.3  (215k)89%', '8.3  (215k)89%', '8.3  (215k)89%', '7.0  (53

## **Scrapping Runtime/Duration**

In [None]:
# Write Your Code here
movie_runtime=[]
for url in movie_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
          if x.find_all('h3')[0].text=='Runtime':
            runtime=x.find_all('div')[0].text
    except:
        runtime='NA'
    movie_runtime.append(runtime)
print(movie_runtime)


['2h 55min', '1h 45min', '2h 3min', '2h 30min', '2h 8min', '2h 27min', '2h 8min', '2h 22min', '2h 25min', '2h 0min', '2h 0min', '1h 48min', '2h 46min', '2h 2min', '3h 0min', '2h 13min', '2h 38min', '2h 19min', '1h 50min', '2h 6min', '1h 54min', '2h 3min', '2h 25min', '2h 0min', '2h 30min', '2h 29min', '1h 55min', '2h 12min', '2h 0min', '1h 39min', '2h 19min', '2h 0min', '1h 44min', '2h 55min', '2h 3min', '2h 34min', '2h 35min', '2h 56min', '2h 56min', '2h 3min', '2h 2min', '2h 12min', '1h 37min', '2h 39min', '1h 34min', '3h 0min', '2h 4min', '2h 36min', '2h 28min', '2h 15min', '2h 10min', '1h 34min', '2h 30min', '2h 34min', '1h 55min', '1h 49min', '2h 37min', '1h 55min', '2h 41min', '2h 26min', '2h 26min', '1h 41min', '2h 15min', '2h 30min', '2h 30min', '2h 28min', '2h 36min', '2h 39min', '2h 39min', '1h 55min', '2h 32min', '2h 50min', '2h 50min', '2h 50min', '2h 50min', '2h 50min', '2h 50min', '2h 13min', '2h 20min', '2h 20min', '2h 3min', '2h 3min', '1h 48min', '1h 48min', '1h 48min'

## **Scrapping Age Rating**

In [None]:
# Write Your Code here
movie_agerating=[]
for url in movie_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
            if x.find_all('h3')[0].text=='Age rating':
              age_rating=x.find_all('div')[0].text
    except:
      age_rating='NA'
    movie_agerating.append(age_rating)
print(movie_agerating)

['UA', 'A', 'A', 'A', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'A', 'A', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'U', 'A', 'UA', 'UA', 'A', 'A', 'A', 'UA', 'UA', 'U', 'A', 'A', 'A', 'UA', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'UA', 'U', 'U', 'U', 'UA', 'UA', 'UA', 'A', 'UA', 'A', 'U', 'UA', 'U', 'UA', 'UA', 'UA', 'UA', 'U', 'U', 'A', 'A', 'A', 'UA', 'U', 'U', 'U', 'UA', 'UA', 'A', 'U', 'UA', 'UA', 'UA', 'U', 'UA', 'U', 'UA', 'UA', 'UA', 'A', 'A', 'A', 'A', 'UA', 'UA', 'UA', 'A', 'A', 'UA', 'UA', 'UA', 'UA', 'UA', 'A', 'A', 'A', 'A', 'A', 'A']


## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
movie_country=[]
for url in movie_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
            if x.find_all('h3')[0].text==' Production country ':
              country=x.find_all('div')[0].text
    except:
      runtime='NA'
    movie_country.append(country)
print(movie_country)


['India', 'India', 'India', 'India', 'United States', 'India', 'India', 'India', 'United States', 'United States, India', 'India', 'United States', 'United States', 'Turkey, United States, United Kingdom', 'India', 'India', 'India', 'India', 'South Korea', 'Canada, United States, Australia', 'Poland', 'India', 'India', 'India', 'India', 'Australia, United States', 'United States', 'India', 'India', 'United States, United Kingdom, Canada', 'India', 'United States', 'United States, United Kingdom', 'India', 'India', 'India', 'United States', 'India', 'India', 'United States', 'India', 'India', 'United States', 'India', 'United States', 'United Kingdom, United States', 'Germany, Japan', 'India', 'India', 'India', 'United States', 'United States', 'India', 'Thailand, China, India', 'United States', 'United States', 'Canada, United States', 'United States', 'India, United States', 'India', 'India', 'Canada, United States', 'India', 'India', 'United States', 'India', 'India', 'India', 'India

## **Fetching Streaming Service Details**

In [None]:
# Write Your Code here
movie_stream_service=[]
for url in movie_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        names=[x['alt'] for x in soup.find_all('img',class_='offer__icon')]
    except:
      names='NA'
    movie_stream_service.append(" , ".join(names))
print(movie_stream_service)


['Amazon Prime Video , Amazon Prime Video , Netflix , Amazon Video , Bookmyshow', '', 'Hotstar', 'Netflix , Bookmyshow', 'Bookmyshow', 'Bookmyshow', 'Apple TV , Hotstar , Apple TV , Apple TV', 'Amazon Prime Video , Amazon Prime Video , Bookmyshow', 'Apple TV , Hotstar , Apple TV , Amazon Video , Apple TV', 'Netflix', '', 'Apple TV , Hotstar , Amazon Video , Apple TV , Apple TV', 'Apple TV , Jio Cinema , Amazon Video , Apple TV , Apple TV', 'Amazon Prime Video , Amazon Prime Video', 'Netflix', 'Netflix', 'Amazon Prime Video , Amazon Prime Video , Hotstar , Amazon Video', 'Zee5', '', 'Apple TV , Zee5 , Amazon Video , Apple TV , Apple TV', 'Netflix', 'Amazon Prime Video , Amazon Prime Video , Amazon Video', 'Amazon Prime Video , Amazon Prime Video , Sun Nxt , Amazon Video , Bookmyshow', 'Amazon Prime Video , Amazon Prime Video , aha', 'Zee5', 'Apple TV , Amazon Video , Apple TV , Apple TV', 'Apple TV , Zee5 , Amazon Video , Apple TV , Apple TV', '', 'Amazon Prime Video , Amazon Prime Vide

## **Now Creating Movies DataFrame**

In [None]:
# Write Your Code here
info={'movie_url':movie_url,
'movie_name':movie_titles,
'release_year':movie_year,
'movie_rating':movie_rating,
'movie_genre':movie_genre,
'movie_runtime':movie_runtime,
'movie_agerating':movie_agerating,
'movie_country':movie_country,
'movie_stream_service' :movie_stream_service}

data=pd.DataFrame(info)


In [None]:
data.head(30)

Unnamed: 0,movie_url,movie_name,release_year,movie_rating,movie_genre,movie_runtime,movie_agerating,movie_country,movie_stream_service
0,https://www.justwatch.com/in/movie/project-k,Kalki 2898-AD (2024),2024.0,8.1 (222k)95%,"Drama, Action & Adventure, Fantasy, Science-Fi...",2h 55min,UA,India,"Amazon Prime Video , Amazon Prime Video , Netf..."
1,https://www.justwatch.com/in/movie/kill-2024,Kill (2024),2024.0,8.1 (222k)95%,"Action & Adventure, Crime, Drama, Mystery & Th...",1h 45min,A,India,
2,https://www.justwatch.com/in/movie/munjha,Munjya (2024),2024.0,8.1 (222k)95%,"Comedy, Horror",2h 3min,A,India,Hotstar
3,https://www.justwatch.com/in/movie/maharaja-2024,Maharaja (2024),2024.0,8.1 (222k)95%,"Drama, Mystery & Thriller, Action & Adventure,...",2h 30min,A,India,"Netflix , Bookmyshow"
4,https://www.justwatch.com/in/movie/deadpool-3,Deadpool & Wolverine (2024),2024.0,8.0 (240k)78%,"Action & Adventure, Comedy, Science-Fiction",2h 8min,A,United States,Bookmyshow
5,https://www.justwatch.com/in/movie/stree-2,Stree 2: Sarkate Ka Aatank (2024),2024.0,7.9 (16k)67%,"Comedy, Horror",2h 27min,UA,India,Bookmyshow
6,https://www.justwatch.com/in/movie/stree,Stree (2018),2018.0,7.9 (16k)67%,"Horror, Comedy, Drama",2h 8min,UA,India,"Apple TV , Hotstar , Apple TV , Apple TV"
7,https://www.justwatch.com/in/movie/chandu-cham...,Chandu Champion (2024),2024.0,7.9 (16k)67%,"Drama, History, Sport, War & Military, Action ...",2h 22min,UA,India,"Amazon Prime Video , Amazon Prime Video , Book..."
8,https://www.justwatch.com/in/movie/kingdom-of-...,Kingdom of the Planet of the Apes (2024),2024.0,7.9 (16k)67%,"Science-Fiction, Action & Adventure, Drama, My...",2h 25min,UA,United States,"Apple TV , Hotstar , Apple TV , Amazon Video ,..."
9,https://www.justwatch.com/in/movie/aadujeevitham,The Goat Life (2024),2024.0,7.9 (16k)67%,Drama,2h 0min,UA,"United States, India",Netflix


## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
page=requests.get(tv_url,headers=headers)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here
show_url=[]
for x in soup.find_all('a',class_="title-list-grid__item--link"):
    show_url.append('https://www.justwatch.com'+ x['href'])
print(show_url)
print(len(show_url))


['https://www.justwatch.com/in/tv-show/mirzapur', 'https://www.justwatch.com/in/tv-show/house-of-the-dragon', 'https://www.justwatch.com/in/tv-show/adams-sweet-agony', 'https://www.justwatch.com/in/tv-show/gyaarah-gyaarah', 'https://www.justwatch.com/in/tv-show/the-boys', 'https://www.justwatch.com/in/tv-show/game-of-thrones', 'https://www.justwatch.com/in/tv-show/panchayat', 'https://www.justwatch.com/in/tv-show/sweet-home', 'https://www.justwatch.com/in/tv-show/apharan', 'https://www.justwatch.com/in/tv-show/x-x-x-uncensored', 'https://www.justwatch.com/in/tv-show/shekhar-home', 'https://www.justwatch.com/in/tv-show/attack-on-titan', 'https://www.justwatch.com/in/tv-show/shogun-2024', 'https://www.justwatch.com/in/tv-show/batman-caped-crusader', 'https://www.justwatch.com/in/tv-show/elite', 'https://www.justwatch.com/in/tv-show/the-umbrella-academy', 'https://www.justwatch.com/in/tv-show/demon-slayer-kimetsu-no-yaiba', 'https://www.justwatch.com/in/tv-show/tribhuvan-mishra-ca-topper'

## **Fetching Tv Show Title details**

In [None]:
# Write Your Code here
show_titles=[]

for url in show_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        title=soup.find_all('h1')[0].text
    except:
      title='NA'
    show_titles.append(title)
print(len(show_titles))
print(show_titles)



100
[' Mirzapur (2018)', ' House of the Dragon (2022)', " Adam's Sweet Agony (2024)", ' Gyaarah Gyaarah (2024)', ' The Boys (2019)', ' Game of Thrones (2011)', ' Panchayat (2020)', ' Sweet Home (2020)', ' Apharan (2018)', ' XXX: Uncensored (2018)', ' Shekhar Home (2024)', ' Attack on Titan (2013)', ' Shōgun (2024)', ' Batman: Caped Crusader (2024)', ' Elite (2018)', ' The Umbrella Academy (2019)', ' Demon Slayer: Kimetsu no Yaiba (2019)', ' Tribhuvan Mishra CA Topper (2024)', ' Shahmaran (2023)', ' Mad Men (2007)', ' Mastram (2020)', ' Presumed Innocent (2024)', ' Money Heist (2017)', ' Bigg Boss OTT (2021)', ' The Bear (2022)', ' Farzi (2023)', ' Asur: Welcome to Your Dark Side (2020)', ' Bigg Boss (2006)', " A Good Girl's Guide to Murder (2024)", ' Gullak (2019)', ' Aashram (2020)', ' Unsolved Mysteries (2020)', ' Breaking Bad (2008)', ' The Family Man (2019)', ' Terror Tuesday: Extreme (2024)', ' Stranger Things (2016)', ' Evil (2019)', ' College Romance (2018)', ' The Rookie (2018)

## **Fetching Release Year**

In [None]:
# Write Your Code here
show_year=[]
for url in show_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    year=soup.find_all('span',class_='release-year')[0].text.strip('()')
  except:
    year='NA'
  show_year.append(year)

print(show_year)

['2018', '2022', '2024', '2024', '2019', '2011', '2020', '2020', '2018', '2018', '2024', '2013', '2024', '2024', '2018', '2019', '2019', '2024', '2023', '2007', '2020', '2024', '2017', '2021', '2022', '2023', '2020', '2006', '2024', '2019', '2020', '2020', '2008', '2019', '2024', '2016', '2019', '2018', '2018', '2021', '2020', '2017', '2022', '2018', '2020', '2010', '2024', '2022', '2022', '2024', '2019', '2013', '2024', '2024', '2020', '2018', '2008', '2010', '2002', '2005', '2014', '2007', '2018', '2019', '2017', '2024', '2014', '2009', '2021', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', '2024', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', '2021', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', '2016', 'NA', 'NA']


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
show_genre=[]
for url in show_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
            if x.find_all('h3')[0].text=='Genres':
              genre=x.find_all('span')[0].text
    except:
        genre='NA'
    show_genre.append(genre)
print(show_genre)

['Action & Adventure, Drama, Crime, Mystery & Thriller', 'Action & Adventure, Science-Fiction, Drama, Fantasy, Romance', 'Animation', 'Fantasy, Drama, Science-Fiction', 'Science-Fiction, Action & Adventure, Comedy, Crime, Drama', 'Action & Adventure, Science-Fiction, Drama, Fantasy', 'Drama, Comedy', 'Science-Fiction, Drama, Fantasy, Horror, Mystery & Thriller', 'Action & Adventure, Crime, Mystery & Thriller, Drama', 'Comedy, Drama, Romance', 'Drama, Crime', 'Horror, Animation, Action & Adventure, Drama, Fantasy, Science-Fiction', 'War & Military, Drama, History', 'Action & Adventure, Crime, Kids & Family, Fantasy, Science-Fiction, Animation', 'Drama, Mystery & Thriller, Crime', 'Science-Fiction, Action & Adventure, Comedy, Drama, Fantasy', 'Animation, Action & Adventure, Science-Fiction, Mystery & Thriller, Fantasy', 'Comedy, Crime, Drama, Mystery & Thriller', 'Mystery & Thriller, Action & Adventure, Drama, Science-Fiction, Romance, Fantasy', 'Drama', 'Drama, Fantasy, Comedy', 'Drama,

## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here
show_rating=[]
for url in show_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='poster-detail-infos'):
      if x.find_all('h3')[0].text=='Rating':
        rating=x.find_all('div')[0].text
  except:
    rating='NA'
  show_rating.append(rating)
print(show_rating)

['8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '6.9  (2k)', '6.9  (2k)', '6.9  (2k)', '1.9  (1k)', '1.9  (1k)', '1.9  (1k)', '1.9  (1k)', '3.6  (4k)', '3.6  (4k)', '3.6  (4k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '8.0  (79k)', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3

## **Fetching Age Rating Details**

In [None]:
# Write Your Code here
show_agerating=[]
for url in show_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
            if x.find_all('h3')[0].text=='Age rating':
              age_rating=x.find_all('div')[0].text
    except:
      age_rating='NA'
    show_agerating.append(show_rating)
print(show_agerating)

[['8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '8.2  (72k)75%', '6.9  (2k)', '6.9  (2k)', '6.9  (2k)', '1.9  (1k)', '1.9  (1k)', '1.9  (1k)', '1.9  (1k)', '3.6  (4k)', '3.6  (4k)', '3.6  (4k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '6.6  (57k)', '8.0  (79k)', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '6.0  (15k)77%', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '7.9  (1k)', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '3.9 ', '

## **Fetching Production Country details**

In [None]:
# Write Your Code here
show_country=[]
for url in show_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
            if x.find_all('h3')[0].text==' Production country ':
              country=x.find_all('div')[0].text
    except:
      runtime='NA'
    show_country.append(country)
print(show_country)

['India', 'United States', 'Japan', 'India', 'United States', 'United States', 'India', 'South Korea', 'India', 'India', 'India', 'Japan', 'United States', 'United States', 'Spain', 'United States', 'Japan', 'India', 'Turkey', 'United States', 'India', 'United States', 'Spain', 'India', 'United States', 'United States', 'India', 'India', 'Germany, United Kingdom', 'India', 'India', 'United States', 'United States', 'India', 'Thailand', 'United States', 'United States', 'India', 'United States', 'United States', 'United States', 'United States', 'United States', 'India', 'United States', 'United States', 'India', 'India', 'United States', 'United States', 'United States', 'United Kingdom', 'United Kingdom', 'Spain', 'Spain', 'United States', 'United States', 'United States', 'Japan', 'United States', 'United States', 'Japan', 'United States', 'United States', 'United States', 'United States', 'United States', 'United States', 'South Korea', 'United States', 'United Kingdom, United State

## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
show_stream_service=[]
for url in movie_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        names=[x['alt'] for x in soup.find_all('img',class_='offer__icon')]
    except:
      names='NA'
    show_stream_service.append(" , ".join(names))
print(show_stream_service)

['Amazon Prime Video , Amazon Prime Video , Netflix , Amazon Video , Bookmyshow', '', 'Hotstar', 'Netflix , Bookmyshow', 'Bookmyshow', 'Bookmyshow', 'Apple TV , Hotstar , Apple TV , Apple TV', 'Amazon Prime Video , Amazon Prime Video , Bookmyshow', 'Apple TV , Hotstar , Apple TV , Amazon Video , Apple TV', 'Netflix', '', 'Apple TV , Hotstar , Amazon Video , Apple TV , Apple TV', 'Apple TV , Jio Cinema , Amazon Video , Apple TV , Apple TV', 'Amazon Prime Video , Amazon Prime Video', 'Netflix', 'Netflix', 'Amazon Prime Video , Amazon Prime Video , Hotstar , Amazon Video', 'Zee5', '', 'Apple TV , Zee5 , Amazon Video , Apple TV , Apple TV', 'Netflix', 'Amazon Prime Video , Amazon Prime Video , Amazon Video', 'Amazon Prime Video , Amazon Prime Video , Sun Nxt , Amazon Video , Bookmyshow', 'Amazon Prime Video , Amazon Prime Video , aha', 'Zee5', 'Apple TV , Amazon Video , Apple TV , Apple TV', 'Apple TV , Zee5 , Amazon Video , Apple TV , Apple TV', '', 'Amazon Prime Video , Amazon Prime Vide

## **Fetching Duration Details**

In [None]:
# Write Your Code here
show_runtime=[]
for url in show_url:
    try:
        content=requests.get(url,headers=headers)
        soup=BeautifulSoup(content.text,'html.parser')
        for x in soup.find_all('div',class_='detail-infos'):
          if x.find_all('h3')[0].text=='Runtime':
            runtime=x.find_all('div')[0].text
    except:
        runtime='NA'
    show_runtime.append(runtime)
print(show_runtime)

['50min', '1h 2min', '3min', '43min', '1h 1min', '58min', '35min', '58min', '24min', '22min', '42min', '25min', '59min', '25min', '49min', '51min', '26min', '57min', '49min', '49min', '28min', '43min', '50min', '1h 28min', '34min', '56min', '47min', '1h 16min', '44min', '30min', '43min', '45min', '47min', '45min', '43min', '1h 1min', '49min', '31min', '43min', '51min', '52min', '56min', '38min', '44min', '31min', '1h 28min', '26min', '46min', '50min', '59min', '43min', '58min', '56min', '46min', '24min', '48min', '45min', '46min', '23min', '24min', '1h 1min', '24min', '35min', '58min', '43min', '43min', '45min', '45min', '55min', '55min', '55min', '55min', '55min', '55min', '55min', '55min', '55min', '55min', '55min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min', '1h 2min']


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
info={'show_url':show_url,
'show_name':show_titles,
'release_year':show_year,
'show_rating':show_rating,
'show_genre':show_genre,
'show_runtime':show_runtime,
'show_agerating':show_agerating,
'show_country':show_country,
'show_stream_service' :show_stream_service}

show_data=pd.DataFrame(info)
show_data.head(5)

Unnamed: 0,show_url,show_name,release_year,show_rating,show_genre,show_runtime,show_agerating,show_country,show_stream_service
0,https://www.justwatch.com/in/tv-show/mirzapur,Mirzapur (2018),2018,8.2 (72k)75%,"Action & Adventure, Drama, Crime, Mystery & Th...",50min,"[8.2 (72k)75%, 8.2 (72k)75%, 8.2 (72k)75%, ...",India,"Amazon Prime Video , Amazon Prime Video , Netf..."
1,https://www.justwatch.com/in/tv-show/house-of-...,House of the Dragon (2022),2022,8.2 (72k)75%,"Action & Adventure, Science-Fiction, Drama, Fa...",1h 2min,"[8.2 (72k)75%, 8.2 (72k)75%, 8.2 (72k)75%, ...",United States,
2,https://www.justwatch.com/in/tv-show/adams-swe...,Adam's Sweet Agony (2024),2024,8.2 (72k)75%,Animation,3min,"[8.2 (72k)75%, 8.2 (72k)75%, 8.2 (72k)75%, ...",Japan,Hotstar
3,https://www.justwatch.com/in/tv-show/gyaarah-g...,Gyaarah Gyaarah (2024),2024,8.2 (72k)75%,"Fantasy, Drama, Science-Fiction",43min,"[8.2 (72k)75%, 8.2 (72k)75%, 8.2 (72k)75%, ...",India,"Netflix , Bookmyshow"
4,https://www.justwatch.com/in/tv-show/the-boys,The Boys (2019),2019,8.2 (72k)75%,"Science-Fiction, Action & Adventure, Comedy, C...",1h 1min,"[8.2 (72k)75%, 8.2 (72k)75%, 8.2 (72k)75%, ...",United States,Bookmyshow


## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here

## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Write Your Code here


## **Analyzing Top Genres**

In [None]:
# Write Your Code here
data['movie_genre'].value_counts().head(5)

Unnamed: 0_level_0,count
movie_genre,Unnamed: 1_level_1
"Comedy, Horror, Mystery & Thriller",8
"Romance, Comedy, Drama",6
"Drama, Romance",5
"Comedy, Drama, Romance",5
Drama,5


In [None]:

show_data['show_genre'].value_counts().head(5)

Unnamed: 0_level_0,count
show_genre,Unnamed: 1_level_1
"Comedy, Drama, Action & Adventure",17
"Drama, Romance, Mystery & Thriller, Crime",12
"Drama, Mystery & Thriller",9
"Comedy, Drama",6
Drama,4


## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here
data['movie_stream_service'].value_counts().head(5)

Unnamed: 0_level_0,count
movie_stream_service,Unnamed: 1_level_1
,17
Netflix,12
Bookmyshow,5
"Amazon Prime Video , Amazon Prime Video",5
Hotstar,4


In [None]:

show_data['show_stream_service'].value_counts().head(5)

Unnamed: 0_level_0,count
show_stream_service,Unnamed: 1_level_1
,39
Netflix,10
"Amazon Prime Video , Amazon Prime Video",5
Zee5,3
"Apple TV , Amazon Video , Apple TV , Apple TV",3


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
data.to_csv('movies_data.csv',index=False)
show_data.to_csv('show_data.csv',index=False)

# **Dataset Drive Link (View Access with Anyone) -**

In [None]:
Movies_data= https://drive.google.com/file/d/1UMe80SI9Tz0nOLj8BlrJJEid_lxzN68p/view?usp=sharing

Show_data- https://drive.google.com/file/d/1IRYTrbWToY05Mje1dBKJu4xC9zgeYxLU/view?usp=sharing

# ***Congratulations!!! You have completed your Assignment.***