=-098765q14567
\';lk# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2


In [None]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import time

## **Scrapping Movies Data**

In [None]:
# Specifying the URL from which movies related data will be fetched
url='https://www.justwatch.com/in/movies?release_year_from=2000'

response = requests.get(url)

soup =  BeautifulSoup(response.text, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Movie URL's**

In [None]:
# Write Your Code here
list_of_link = []

main_link = r"https://www.justwatch.com"

anchor_tag = soup.find_all("a", class_ = "title-list-grid__item--link")

for tag in anchor_tag:
    href = tag.get("href")
    movie_link = main_link + href
    list_of_link.append(movie_link)

print(len(list_of_link))
print(list_of_link)

100
['https://www.justwatch.com/in/movie/hanu-man', 'https://www.justwatch.com/in/movie/untitled-shahid-kapoor-kriti-sanon-film', 'https://www.justwatch.com/in/movie/oppenheimer', 'https://www.justwatch.com/in/movie/fighter-2022', 'https://www.justwatch.com/in/movie/anatomie-dune-chute', 'https://www.justwatch.com/in/movie/bramayugam', 'https://www.justwatch.com/in/movie/poor-things', 'https://www.justwatch.com/in/movie/animal-2022', 'https://www.justwatch.com/in/movie/dune-2021', 'https://www.justwatch.com/in/movie/manjummel-boys', 'https://www.justwatch.com/in/movie/premalu', 'https://www.justwatch.com/in/movie/road-house-2024', 'https://www.justwatch.com/in/movie/12th-fail', 'https://www.justwatch.com/in/movie/anyone-but-you', 'https://www.justwatch.com/in/movie/murder-mubarak', 'https://www.justwatch.com/in/movie/the-crew-2024', 'https://www.justwatch.com/in/movie/dune-part-two', 'https://www.justwatch.com/in/movie/aattam', 'https://www.justwatch.com/in/movie/black-magic-2024', 'ht

## **Scrapping Movie Title**

In [None]:
list_of_movies_title = []
for link in list_of_link:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")

        title_tag = soup.find("h1")
        title = title_tag.text.split("(")[0].strip()
        list_of_movies_title.append(title)
    except Exception as e:
        print(e)
    time.sleep(3)
print(len(list_of_movies_title))
print(list_of_movies_title)


100
['Hanu-Man', 'Teri Baaton Mein Aisa Uljha Jiya', 'Oppenheimer', 'Fighter', 'Anatomy of a Fall', 'Bramayugam', 'Poor Things', 'Animal', 'Dune', 'Manjummel Boys', 'Premalu', 'Road House', '12th Fail', 'Anyone But You', 'Murder Mubarak', 'Crew', 'Dune: Part Two', 'Aattam', 'Shaitaan', 'Kung Fu Panda', 'Article 370', 'Sam Bahadur', 'Merry Christmas', 'Godzilla x Kong: The New Empire', 'The Beekeeper', 'Madame Web', 'Laapataa Ladies', 'Salaar', 'Eagle', '365 Days', 'Kung Fu Panda 4', 'Anweshippin Kandethum', 'Abraham Ozler', 'The Kerala Story', 'Aquaman and the Lost Kingdom', 'Godzilla vs. Kong', 'Madgaon Express', 'Lover', 'Godzilla Minus One', 'Ferrari', 'Gaami', 'Dunki', 'Main Atal Hoon', 'Mission: Chapter 1', 'Joker', 'Migration', 'DJ Tillu', 'The Holdovers', 'The Goat Life', 'Damsel', "Harry Potter and the Philosopher's Stone", 'Operation Valentine', 'Chaari 111', 'She Said', 'Zara Hatke Zara Bachke', 'Yodha', 'The Gentlemen', 'Por', 'Jawan', 'Vadakkupatti Ramasamy', 'Red Eye', 'Ja

## **Scrapping release Year**

In [None]:
list_of_release_year = []

for link in list_of_link:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        year_tag = soup.find("h1")
        year = year_tag.text.split("(")[1].strip()
        list_of_release_year.append(year[:-1])
    except Exception as e:
        print(e)
    time.sleep(3)

print(len(list_of_release_year))
print(list_of_release_year)



100
['2024', '2024', '2023', '2024', '2023', '2024', '2023', '2023', '2021', '2024', '2024', '2024', '2023', '2023', '2024', '2024', '2024', '2024', '2024', '2008', '2024', '2023', '2024', '2024', '2024', '2024', '2024', '2023', '2024', '2020', '2024', '2024', '2024', '2023', '2023', '2021', '2024', '2024', '2023', '2023', '2024', '2023', '2024', '2024', '2019', '2023', '2022', '2023', '2024', '2024', '2001', '2024', '2024', '2022', '2023', '2024', '2020', '2024', '2023', '2024', '2005', '2024', '2011', '2023', '2013', '2024', '2024', '2024', '2013', '2022', '2024', '2014', '2024', '2024', '2016', '2023', '2023', '2019', '2017', '2023', '2021', '2023', '2014', '2023', '2024', '2022', '2023', '2024', '2018', '2024', '2023', '2021', '2023', '2024', '2019', '2016', '2024', '2018', '2014', '2020']


## **Scrapping Genres**

In [None]:
# Write Your Code here
list_of_movies_generes = []
for link in list_of_link:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        genere_heading = soup.find("h3", class_ = "detail-infos__subheading", string = "Genres")
        if genere_heading:
            genere = genere_heading.find_next("div", class_="detail-infos__value")
            list_of_movies_generes.append(genere.text)

        else:
            list_of_movies_generes.append("genere not found")


    except Exception as e:
        print(e)

print(len(list_of_movies_generes))
print(list_of_movies_generes)

100
['Comedy, Science-Fiction, Fantasy, Action & Adventure', 'Romance, Science-Fiction, Comedy, Drama', 'Drama, History', 'Mystery & Thriller, War & Military, Action & Adventure', 'Mystery & Thriller, Crime, Drama', 'Horror, Mystery & Thriller', 'Comedy, Science-Fiction, Drama, Romance', 'Crime, Drama, Action & Adventure, Mystery & Thriller', 'Science-Fiction, Action & Adventure, Drama', 'Mystery & Thriller', 'Romance, Comedy', 'Mystery & Thriller, Action & Adventure', 'Drama', 'Comedy, Romance', 'Comedy, Crime, Mystery & Thriller, Romance', 'Comedy, Drama', 'Action & Adventure, Science-Fiction, Drama', 'Drama', 'Horror, Mystery & Thriller', 'Action & Adventure, Animation, Comedy, Kids & Family, Fantasy', 'Action & Adventure, Drama, Mystery & Thriller', 'Drama, War & Military, History', 'Drama, Mystery & Thriller', 'Science-Fiction, Mystery & Thriller, Action & Adventure, Fantasy', 'Drama, Action & Adventure, Mystery & Thriller', 'Fantasy, Action & Adventure, Science-Fiction, Mystery &

## **Scrapping IMBD Rating**

In [None]:
list_of_movies_rating = []
for link in list_of_link:
  try:
    response = requests.get(link)
    # Check if the response status code is 429 (Too Many Requests)
    if response.status_code == 429:
        # If a 429 error is encountered, wait for an increasingly longer time before retrying
        time.sleep(5)  # Wait for 5 seconds (you can adjust this value)
        response = requests.get(link)  # Retry the request

    soup = BeautifulSoup(response.text, "html.parser")

    rating_tag = soup.find_all("div", class_="jw-scoring-listing__rating")

    if rating_tag:
      rating = rating_tag[1].text.split("(")[0]
      list_of_movies_rating.append(rating.strip())
    else:
      list_of_movies_rating.append("Rating Not Found")

  except Exception as e:
    print(e)
  # time.sleep(4)

print(list_of_movies_rating)

## **Scrapping Runtime/Duration**

In [None]:
# Write Your Code here
list_of_runtime = []

for link in list_of_link:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        runtime_tag = soup.find("h3", class_= "detail-infos__subheading",string = "Runtime")
        if runtime_tag:
            runtime = runtime_tag.find_next("div", class_ ="detail-infos__value")
            list_of_runtime.append(runtime.text)
        else:
            list_of_runtime.append("ratings not found")
    except Exception as e:
        print(e)
print(len(list_of_runtime))
print(list_of_runtime)

## **Scrapping Age Rating**

In [None]:
# Write Your Code here
list_of_age_rating = []
for link in list_of_link:
    try:
        response = requests.get(link)
        soup =  BeautifulSoup(response.text, "html.parser")
        age_rating_tag = soup.find("h3", class_ = "detail-infos__subheading", string = "Age rating")

        if age_rating_tag:
            age_rating = age_rating_tag.find_next("div", class_ = "detail-infos__value")
            list_of_age_rating.append(age_rating.text)

        else:
            list_of_age_rating.append("age rating not found")
    except Exception as e:
        print(e)
print(len(list_of_age_rating))
print(list_of_age_rating)



## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
list_of_production_details = []

for link in list_of_link:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        production_country_tag = soup.find("h3", class_ = "detail-infos__subheading",string = " Production country ")

        if production_country_tag:
            production_country = production_country_tag.find_next("div", class_ = "detail-infos__value")
            list_of_production_details.append(production_country.text)
        else:
            list_of_production_details.append("country not details not found")
    except Exception as e:
        print(e)

print(len(list_of_production_details))
print(list_of_production_details)

## **Fetching Streaming Service Details**

In [None]:
# Write Your Code here
list_of_streaming_service_details = []

for link in list_of_link:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        streaming_sevice_tag = soup.find("img", class_ = "offer__icon")

        if streaming_sevice_tag:
            streaming_sevice = streaming_sevice_tag.get("alt")
            list_of_streaming_service_details.append(streaming_sevice)
        else:
            list_of_streaming_service_details.append("streaming service not found")
    except  Exception as e:
        print(e)

print(len(list_of_streaming_service_details))
print(list_of_streaming_service_details)

100
['streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'Netflix', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'Netflix', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streaming s

## **Now Creating Movies DataFrame**

In [None]:
# Write Your Code here
movie_dict_df = {
    "Movie-title": list_of_movies_title,
    "Release-year": list_of_release_year,
    "Genre": list_of_movies_generes,
    "IMBD-rating": list_of_movies_rating,
    "Runtime": list_of_runtime,
    "Country": list_of_production_details,
    "streaming_sevice": list_of_streaming_service_details,
    "Age-rating": list_of_age_rating,
    "Links": list_of_link
}
movie_df = pd.DataFrame(movie_dict_df)


movie_df




Unnamed: 0,Movie-title,Release-year,Genre,IMBD-rating,Runtime,Country,streaming_sevice,Age-rating,Links
0,Hanu-Man,2024,"Comedy, Science-Fiction, Fantasy, Action & Adv...",8.0,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/hanu-man
1,Teri Baaton Mein Aisa Uljha Jiya,2024,"Romance, Science-Fiction, Comedy, Drama",6.5,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/untitled-sh...
2,Oppenheimer,2023,"Drama, History",8.3,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/oppenheimer
3,Fighter,2024,"Mystery & Thriller, War & Military, Action & A...",6.4,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/fighter-2022
4,Anatomy of a Fall,2023,"Mystery & Thriller, Crime, Drama",7.7,2h 32min,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/anatomie-du...
...,...,...,...,...,...,...,...,...,...
95,La La Land,2016,genere not found,8.0,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/la-la-land
96,Family Star,2024,genere not found,5.4,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/family-star
97,Tomb Raider,2018,genere not found,6.3,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/tomb-raider
98,John Wick,2014,genere not found,7.4,ratings not found,country not details not found,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/john-wick


## **Scraping TV  Show Data**

In [None]:
tv_url = r"https://www.justwatch.com/in/tv-shows?release_year_from=2000"

response = requests.get(tv_url, headers = {"user_agent": "chrome/123.0.6312.106"})
soup = BeautifulSoup(response.text, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<html data-vue-meta="%7B%22dir%22:%7B%22ssr%22:%22ltr%22%7D,%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-vue-meta-server-rendered="" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8" data-vue-meta="ssr"/>
  <meta content="IE=edge" data-vue-meta="ssr" httpequiv="X-UA-Compatible"/>
  <meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" data-vue-meta="ssr" name="viewport"/>
  <meta content="JustWatch" data-vue-meta="ssr" property="og:site_name"/>
  <meta content="794243977319785" data-vue-meta="ssr" property="fb:app_id"/>
  <meta content="/appassets/img/JustWatch_logo_with_claim.png" data-vmid="og:image" data-vue-meta="ssr" property="og:image"/>
  <meta content="606" data-vmid="og:image:width" data-vue-meta="ssr" property="og:image:width"/>
  <meta content="302" data-vmid="og:image:height" data-vue-meta="ssr" pro

## **Fetching Tv shows Url details**

In [None]:
list_of_tvshow_url = []

main_link = r'https://www.justwatch.com'

a_tag = soup.find_all("a", class_= "title-list-grid__item--link")

for tag in a_tag:
    href = tag.get("href")
    full_link = main_link + href
    list_of_tvshow_url.append(full_link)

print(len(list_of_tvshow_url))
print(list_of_tvshow_url)


100
['https://www.justwatch.com/in/tv-show/shogun-2024', 'https://www.justwatch.com/in/tv-show/mirzapur', 'https://www.justwatch.com/in/tv-show/3-body-problem', 'https://www.justwatch.com/in/tv-show/panchayat', 'https://www.justwatch.com/in/tv-show/game-of-thrones', 'https://www.justwatch.com/in/tv-show/the-gentlemen', 'https://www.justwatch.com/in/tv-show/sunflower-2021', 'https://www.justwatch.com/in/tv-show/solo-leveling-2024', 'https://www.justwatch.com/in/tv-show/maharani-2021', 'https://www.justwatch.com/in/tv-show/maamla-legal-hai', 'https://www.justwatch.com/in/tv-show/attack-on-titan', 'https://www.justwatch.com/in/tv-show/apharan', 'https://www.justwatch.com/in/tv-show/inspector-rishi', 'https://www.justwatch.com/in/tv-show/invincible', 'https://www.justwatch.com/in/tv-show/jujutsu-kaisen', 'https://www.justwatch.com/in/tv-show/halo', 'https://www.justwatch.com/in/tv-show/save-the-tigers', 'https://www.justwatch.com/in/tv-show/farzi', 'https://www.justwatch.com/in/tv-show/rip

## **Fetching Tv Show Title details**

In [None]:


list_of_tv_show_title = []

for link in list_of_tvshow_url:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        title_tvshow_tag = soup.find("h1")
        if title_tvshow_tag:
            title_text = title_tvshow_tag.text.split("(")[0].strip()
            list_of_tv_show_title.append(title_text)
        else:
            print(f"No title found for URL: {link}")
    except Exception as e:
        print(f"Error scraping URL {link}: {e}")

    time.sleep(3)

print(len(list_of_tv_show_title))
print(list_of_tv_show_title)


100
['Shōgun', 'Mirzapur', '3 Body Problem', 'Panchayat', 'Game of Thrones', 'The Gentlemen', 'Sunflower', 'Solo Leveling', 'Maharani', 'Maamla Legal Hai', 'Attack on Titan', 'Apharan', 'Inspector Rishi', 'Invincible', 'Jujutsu Kaisen', 'Halo', 'Saving the Tigers', 'Farzi', 'Ripley', 'Young Sheldon', 'Mastram', 'Avatar: The Last Airbender', 'Queen of Tears', 'A Gentleman in Moscow', 'True Detective', 'The Great Indian Kapil Show', 'Lootere', 'Loki', 'Reacher', 'Scam 1992', 'Parasyte: The Grey', 'The Family Man', 'Naruto', 'Naruto Shippūden', 'Supersex', 'Gandii Baat', 'Yellowstone', 'Turning Point: The Bomb and the Cold War', '9-1-1', 'Money Heist', 'Euphoria', 'Fallout', 'House of the Dragon', 'Gullak', 'The Rookie', 'Under the Dome', "X-Men '97", 'Breaking Bad', 'Peaky Blinders', 'Lucifer', 'The Vampire Diaries', 'Aashram', 'Dehati Ladke', 'Dark Desire', 'Testament: The Story of Moses', 'Modern Family', "Grey's Anatomy", 'The Good Doctor', 'Asur: Welcome to Your Dark Side', 'Young Ro

## **Fetching Release Year**

In [None]:
link = "https://www.justwatch.com/in/tv-show/shogun-2024"
response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")
release_year_tag = soup.find("h1")
release_year = release_year_tag.text.split()[1].strip("()")
print(release_year)



2024


In [None]:
# Write Your Code here
list_of_tv_show_relase_year = []

for link in list_of_tvshow_url:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        release_year_tag = soup.find("h1")
        if release_year_tag:
            release_year = release_year_tag.text.split()[1].strip("()").strip()
            list_of_tv_show_relase_year.append(release_year)
        else:
            list_of_tv_show_relase_year.append("release year not found")


    except Exception as e:
        print(e)

print(len(list_of_tv_show_relase_year))
print(list_of_tv_show_relase_year)



100
['2024', '2018', 'Body', '2020', 'of', 'Gentlemen', '2021', 'Leveling', '2021', 'Legal', 'on', '2018', 'Rishi', '2021', 'Kaisen', '2022', 'the', '2023', '2024', 'Sheldon', '2020', 'The', 'of', 'Gentleman', 'Detective', 'Great', '2024', '2021', '2022', '1992', 'The', 'Family', '2002', 'Shippūden', '2024', 'Baat', '2018', 'Point:', '2018', 'Heist', '2019', '2024', 'of', '2019', 'Rookie', 'the', "'97", 'Bad', 'Blinders', '2016', 'Vampire', 'release year not found', 'Ladke', 'Desire', 'The', 'Family', 'Anatomy', 'Good', 'Welcome', 'Royals', 'Freelancer', '2020', '2024', 'Flash', 'Things', 'Last', 'MAGIC', '2010', 'of', '2014', '2024', '2023', 'Man', 'Slayer:', 'Legend', 'Day', 'Legacy', 'Boys', '2021', 'Railway', 'The', 'Police', 'Vice', '2022', 'Serpent', '2022', '2018', 'Girls', 'to', '2022', 'of', 'of', 'World', 'Call', '2023', 'Boss', '2006', 'Bear', 'Break', 'Horror']


In [None]:
# Write Your Code here
list_of_tv_show_release_year = []
for link in list_of_tvshow_url :
  try:
    response = requests.get(link)
    # Check if the response status code is 429 (Too Many Requests)
    if response.status_code == 429:
        # If a 429 error is encountered, wait for an increasingly longer time before retrying
        time.sleep(5)  # Wait for 5 seconds (you can adjust this value)
        response = requests.get(link)  # Retry the request

    soup = BeautifulSoup(response.text, "html.parser")

    release_year_heading = soup.find("h1")

    if release_year_heading:
      release_year = release_year_heading.text.split("(")[1].strip()[:-1]
    else:
      release_year = "release_year not found"

    list_of_tv_show_release_year.append(release_year)

  except Exception as e:
    print(e)

  # time.sleep(4)
print(len(list_of_tv_show_release_year))
print(list_of_tv_show_release_year)

100
['2024', '2018', '2024', '2020', '2011', '2024', '2021', '2024', '2021', '2024', '2013', '2018', '2024', '2021', '2020', '2022', '2023', '2023', '2024', '2017', '2020', '2024', '2024', '2024', '2014', '2024', '2024', '2021', '2022', '2020', '2024', '2019', '2002', '2007', '2024', '2018', '2018', '2024', '2018', '2017', '2019', '2024', '2022', '2019', '2018', '2013', '2023', '2008', '2013', '2016', '2009', '2020', '2023', '2020', '2024', '2009', '2005', '2017', '2020', '2021', '2023', '2020', '2024', '2014', '2016', '2023', '2023', '2010', '2011', '2014', '2024', '2023', '2015', '2019', '2021', '2024', '2023', '2019', '2021', '2023', '2005', '2024', '2022', '2022', '2021', '2022', '2018', 'BGDC', '2009', '2022', '2024', '2024', '2024', '2015', '2023', '2006', '2006', '2022', '2005', '2011']


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here
link = "https://www.justwatch.com/in/tv-show/shogun-2024"
response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")

genere_tag = soup.find("h3", class_ = "detail-infos__subheading", string = "Genres")
genere = genere_tag.find_next("div", class_ = "detail-infos__value")
print(genere.text)

Drama, War & Military, History


In [None]:
list_of_tvshow_genere = []

for link in list_of_tvshow_url:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        genere_tag = soup.find("h3", class_ = "detail-infos__subheading", string = "Genres")

        if genere_tag:
            genere = genere_tag.find_next("div", class_ = "detail-infos__value").text
            list_of_tvshow_genere.append(genere)
        else:
            list_of_tvshow_genere.append("genere not found")

    except Exception as e:
        print(e)

print(len(list_of_tvshow_genere))
print(list_of_tvshow_genere)



100
['Drama, War & Military, History', 'Crime, Action & Adventure, Drama, Mystery & Thriller', 'Science-Fiction, Mystery & Thriller, Drama, Fantasy', 'Comedy, Drama', 'Drama, Fantasy, Action & Adventure, Science-Fiction', 'Action & Adventure, Comedy, Crime, Drama', 'Comedy, Crime', 'Animation, Action & Adventure, Fantasy, Science-Fiction', 'Drama', 'Drama, Comedy', 'Drama, Fantasy, Horror, Animation, Action & Adventure, Science-Fiction', 'Drama, Action & Adventure, Crime, Mystery & Thriller', 'Horror, Action & Adventure, Drama, Mystery & Thriller', 'Drama, Animation, Science-Fiction, Action & Adventure, Fantasy, Mystery & Thriller', 'Fantasy, Mystery & Thriller, Animation, Action & Adventure, Science-Fiction', 'Action & Adventure, Science-Fiction, Mystery & Thriller, War & Military', 'Comedy, Drama, Mystery & Thriller', 'Crime, Drama, Mystery & Thriller', 'Crime, Drama, Mystery & Thriller', 'Comedy, Kids & Family', 'Drama, Comedy, Fantasy', 'Science-Fiction, Action & Adventure, Comedy,

## **Fetching IMDB Rating Details**

In [None]:
link = "https://www.justwatch.com/in/tv-show/shogun-2024"
response = requests.get(link)
soup = BeautifulSoup(html_content, 'html.parser')

rating_element = soup.find('div').find_all('span')[1]
rating_text = rating_element.text.split()[1]
print(rating_text.strip("()"))

58k


In [None]:
list_of_tvshow_imdb_rating = []

for link in list_of_tvshow_url:
    response = requests.get(link)
    soup = BeautifulSoup(html_content, 'html.parser')
    rating_element = soup.find('div').find_all('span')[1]
    rating_text = rating_element.text.split()[1]
    if rating_text:
        list_of_tvshow_imdb_rating.append(rating_text.strip("()"))
    else:
        list_of_tvshow_imdb_rating.append("rating not found")

print(len(list_of_tvshow_imdb_rating))
print(list_of_tvshow_imdb_rating)

100
['58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k', '58k']


## **Fetching Age Rating Details**

In [None]:
# Write Your Code here
link = r"https://www.justwatch.com/in/tv-show/game-of-thrones"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')

age_rating_tag = soup.find("h3", class_ = "detail-infos__subheading" ,string = "Age rating")
age_rating = age_rating_tag.find_next("div", class_ = "detail-infos__value")

print(age_rating.text)


A


In [None]:
import requests
from bs4 import BeautifulSoup

list_of_tvshow_age_rating = []

for link in list_of_tvshow_url:
    try:
        response = requests.get(link)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            age_rating_tag = soup.find("h3", class_="detail-infos__subheading", string="Age rating")
            if age_rating_tag:
                age_rating = age_rating_tag.find_next("div", class_="detail-infos__value")
                if age_rating:
                    list_of_tvshow_age_rating.append(age_rating.text.strip())
                else:
                    list_of_tvshow_age_rating.append("Age rating not found")
            else:
                list_of_tvshow_age_rating.append("Age rating tag not found")
        else:
            list_of_tvshow_age_rating.append("Failed to fetch page")
    except requests.RequestException as e:
        print("Request failed:", e)
        list_of_tvshow_age_rating.append("Request failed")
    except Exception as e:
        print("An error occurred:", e)
        list_of_tvshow_age_rating.append("An error occurred")

print(list_of_tvshow_age_rating)



['Age rating tag not found', 'Age rating tag not found', 'A', 'Age rating tag not found', 'A', 'A', 'A', 'Age rating tag not found', 'UA', 'Age rating tag not found', 'UA', 'Age rating tag not found', 'A', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'U', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'U', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'A', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'U', 'Age rating tag not found', 'A', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'A', 'A', 'A', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'Age rating tag not found', 'U', 'A', 'U', 'U', 'Age r

## **Fetching Production Country details**

In [None]:
# Write Your Code here
link = "https://www.justwatch.com/in/tv-show/shogun-2024"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
production_country_tag = soup.find("h3", class_ ="detail-infos__subheading", string = " Production country ")
production_country = production_country_tag.find_next("div", class_ = "detail-infos__value")

print(production_country.text)

United States


In [None]:
list_of_tvshow_production_country_details = []
for link in list_of_tvshow_url:
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    production_country_tag = soup.find("h3", class_ ="detail-infos__subheading", string = " Production country ")
    if production_country_tag:
        production_country = production_country_tag.find_next("div", class_ = "detail-infos__value")
        if production_country:
            list_of_tvshow_production_country_details.append(production_country.text)
        else:
            list_of_tvshow_production_country_details.append("Unknown")
    else:
        list_of_tvshow_production_country_details.append("Not Found")

print(list_of_tvshow_production_country_details)


['United States', 'India', 'United States', 'India', 'United States, United Kingdom', 'United Kingdom, United States', 'India', 'Japan, South Korea', 'India', 'India', 'Japan', 'India', 'India', 'United States', 'United States, Japan', 'United States', 'India', 'India', 'United States', 'United States', 'India', 'United States', 'South Korea', 'United Kingdom', 'United States', 'India', 'India', 'United States', 'United States', 'India', 'South Korea', 'India', 'Japan', 'Japan', 'Italy', 'India', 'United States', 'United States', 'United States', 'Spain', 'United States', 'United States', 'United States', 'India', 'United States', 'United States', 'United States', 'United States', 'United Kingdom', 'Not Found', 'Not Found', 'India', 'India', 'Not Found', 'United States', 'Not Found', 'United States', 'Not Found', 'Not Found', 'Sweden', 'Not Found', 'Not Found', 'Not Found', 'United States', 'Not Found', 'Not Found', 'Japan', 'Not Found', 'Not Found', 'United States', 'Not Found', 'Not 

## **Fetching Streaming Service details**

In [None]:
# Write Your Code here
list_of_streaming_service_details = []

for link in list_of_tvshow_url:
    try:
        response = requests.get(link)
        soup = BeautifulSoup(response.text, 'html.parser')
        streaming_sevice_tag = soup.find("img", class_ = "offer__icon")
        if streaming_sevice_tag:
            streaming_sevice = streaming_sevice_tag.get("alt").strip()
            list_of_streaming_service_details.append(streaming_sevice)
        else:
            list_of_streaming_service_details.append("streaming service not found")

    except Exception as e:
        print(e)

print(list_of_streaming_service_details)





['Hotstar', 'Amazon Prime Video', 'Netflix', 'Amazon Prime Video', 'Jio Cinema', 'Netflix', 'VI movies and tv', 'Crunchyroll', 'Sony Liv', 'Netflix', 'Amazon Prime Video', 'Jio Cinema', 'Amazon Prime Video', 'Amazon Prime Video', 'Crunchyroll', 'streaming service not found', 'Hotstar', 'Amazon Prime Video', 'Netflix', 'Amazon Prime Video', 'streaming service not found', 'Netflix', 'streaming service not found', 'Amazon Prime Video', 'streaming service not found', 'streaming service not found', 'Hotstar', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'Netflix', 'streaming service not found', 'streaming service not found', 'Crunchyroll', 'streaming service not found', 'Alt Balaji', 'streaming service not found', 'Netflix', 'streaming service not found', 'streaming service not found', 'Jio Cinema', 'streaming service not found', 'Jio Cinema', 'streaming service not found', 'streaming service not found', 'streaming service not found', 'streami

## **Fetching Duration Details**

In [None]:
link = "https://www.justwatch.com/in/tv-show/shogun-2024"
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
duration_tag = soup.find("h3", class_ = "detail-infos__subheading", string = "Runtime")
duration = duration_tag.find_next("div", class_ = "detail-infos__value")
print(duration.text)

58min


In [None]:
list_of_tv_show_runtime = []
for link in list_of_tvshow_url:
  try:
    response = requests.get(link)
    # Check if the response status code is 429 (Too Many Requests)
    if response.status_code == 429:
         time.sleep(5)
        response = requests.get(link)  # Retry the request

    soup = BeautifulSoup(response.text, "html.parser")

    runtime_heading = soup.find("h3", class_="detail-infos__subheading", string="Runtime")
    if runtime_heading:
      runtime = runtime_heading.find_next("div", class_="detail-infos__value").text.strip()
    else:
      runtime = "runtime Not found"
    list_of_tv_show_runtime.append(runtime)

  except Exception as e:
    print(e)
  # time.sleep(4)


print(list_of_tv_show_runtime)

['58min', '50min', '56min', '33min', '57min', '50min', '37min', '24min', '44min', '32min', '25min', '24min', '45min', '49min', '23min', '51min', '27min', '56min', '56min', '19min', '28min', '54min', '1h 27min', '49min', '1h 1min', '53min', '45min', '49min', '48min', '52min', '50min', '45min', '23min', '24min', '50min', '44min', '50min', '1h 8min', '44min', '50min', '58min', '59min', '1h 2min', '29min', '42min', '43min', '32min', '47min', '58min', '47min', '42min', '43min', '24min', '34min', '1h 25min', '21min', '48min', '43min', '47min', '46min', '50min', '46min', '50min', '42min', '1h 1min', '58min', '23min', '54min', '44min', '52min', '45min', '28min', '24min', '26min', '21min', '29min', '46min', '1h 1min', '39min', '59min', '24min', '38min', '57min', '53min', '57min', '40min', '1h 4min', '50min', '46min', '50min', '36min', '57min', '1h 3min', '50min', '47min', '1h 15min', '53min', '34min', '50min', '44min']


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
import pandas as pd
tv_show_dict = {
    "Title" : list_of_tv_show_title,
    "Release-year" : list_of_tv_show_relase_year,
    "Genere" : list_of_tvshow_genere,
    "Country" : list_of_tvshow_production_country_details,
     "Runtime": list_of_tv_show_runtime,
    "Streaming-Service": list_of_streaming_service_details,
    "links": list_of_tvshow_url
}
tv_show_df = pd.DataFrame(tv_show_dict)
tv_show_df

NameError: name 'list_of_tv_show_title' is not defined

## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here
movie_df

NameError: name 'movie_df' is not defined

In [None]:
df = movie_df.copy()
df

Unnamed: 0,Movie-title,Release-year,Genre,IMBD-rating,Runtime,Country,streaming_sevice,Age-rating,Links
0,Hanu-Man,2024,"Comedy, Science-Fiction, Fantasy, Action & Adv...",8.0,2h 39min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/hanu-man
1,Teri Baaton Mein Aisa Uljha Jiya,2024,"Romance, Science-Fiction, Comedy, Drama",6.6,2h 21min,India,Amazon Prime Video,age rating not found,https://www.justwatch.com/in/movie/untitled-sh...
2,Oppenheimer,2023,"Drama, History",8.3,3h 0min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/oppenheimer
3,Fighter,2024,"Action & Adventure, Mystery & Thriller, War & ...",6.4,2h 47min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/fighter-2022
4,Anatomy of a Fall,2023,"Mystery & Thriller, Crime, Drama",7.7,2h 32min,France,Amazon Prime Video,age rating not found,https://www.justwatch.com/in/movie/anatomie-du...
...,...,...,...,...,...,...,...,...,...
95,Spider-Man: No Way Home,2021,"Drama, Kids & Family, Romance, Action & Advent...",8.2,2h 28min,United States,streaming service not found,UA,https://www.justwatch.com/in/movie/family-star
96,Uri: The Surgical Strike,2019,"Fantasy, Action & Adventure, Mystery & Thriller",8.2,2h 18min,country not details not found,Zee5,age rating not found,https://www.justwatch.com/in/movie/tomb-raider
97,Past Lives,2023,"Action & Adventure, Mystery & Thriller, Crime",7.9,1h 46min,"South Korea, United States",streaming service not found,age rating not found,https://www.justwatch.com/in/movie/john-wick
98,Welcome Home,2020,"Action & Adventure, Drama, History, War & Mili...",7.4,2h 5min,country not details not found,streaming service not found,A,https://www.justwatch.com/in/movie/uri-the-sur...


## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
from datetime import datetime

df["Release-year"] = df["Release-year"].astype(int)
current_year = datetime.now().year

recent_movies = df[df["Release-year"] > current_year - 2]

In [None]:
from datetime import datetime
current_year = datetime.now().year

recent_movie = df[df["Release-year"]>=current_year-1]
recent_movie

Unnamed: 0,Movie-title,Release-year,Genre,IMBD-rating,Runtime,Country,streaming_sevice,Age-rating,Links
0,Hanu-Man,2024,"Comedy, Science-Fiction, Fantasy, Action & Adv...",8.0,2h 39min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/hanu-man
1,Teri Baaton Mein Aisa Uljha Jiya,2024,"Romance, Science-Fiction, Comedy, Drama",6.6,2h 21min,India,Amazon Prime Video,age rating not found,https://www.justwatch.com/in/movie/untitled-sh...
2,Oppenheimer,2023,"Drama, History",8.3,3h 0min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/oppenheimer
3,Fighter,2024,"Action & Adventure, Mystery & Thriller, War & ...",6.4,2h 47min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/fighter-2022
4,Anatomy of a Fall,2023,"Mystery & Thriller, Crime, Drama",7.7,2h 32min,France,Amazon Prime Video,age rating not found,https://www.justwatch.com/in/movie/anatomie-du...
...,...,...,...,...,...,...,...,...,...
90,Masthu Shades Unnai Ra,2024,"Comedy, Drama, Kids & Family, Romance",7.8,2h 22min,country not details not found,Amazon Prime Video,age rating not found,https://www.justwatch.com/in/movie/rocky-aur-r...
92,Rocky Aur Rani Kii Prem Kahaani,2023,"Fantasy, Action & Adventure, Science-Fiction, ...",6.5,2h 48min,India,streaming service not found,age rating not found,https://www.justwatch.com/in/movie/saint-seiya...
93,Knights of the Zodiac,2023,"Action & Adventure, Drama, Mystery & Thriller,...",4.4,1h 53min,country not details not found,Apple TV,age rating not found,https://www.justwatch.com/in/movie/captain-miller
97,Past Lives,2023,"Action & Adventure, Mystery & Thriller, Crime",7.9,1h 46min,"South Korea, United States",streaming service not found,age rating not found,https://www.justwatch.com/in/movie/john-wick


In [None]:
df["IMBD-rating"] = df["IMBD-rating"].astype(float)
highest_rated_movie = df[df["IMBD-rating"] >= 7]
highest_rated_movie.head()

Unnamed: 0,Movie-title,Release-year,Genre,IMBD-rating,Runtime,Country,streaming_sevice,Age-rating,Links
0,Hanu-Man,2024,"Comedy, Science-Fiction, Fantasy, Action & Adv...",8.0,2h 39min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/hanu-man
2,Oppenheimer,2023,"Drama, History",8.3,3h 0min,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/oppenheimer
4,Anatomy of a Fall,2023,"Mystery & Thriller, Crime, Drama",7.7,2h 32min,France,Amazon Prime Video,age rating not found,https://www.justwatch.com/in/movie/anatomie-du...
5,Bramayugam,2024,"Horror, Mystery & Thriller",8.0,ratings not found,country not details not found,streaming service not found,UA,https://www.justwatch.com/in/movie/bramayugam
6,Poor Things,2023,"Science-Fiction, Romance, Comedy, Drama",7.9,2h 22min,country not details not found,streaming service not found,A,https://www.justwatch.com/in/movie/poor-things


## **Analyzing Top Genres**

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = " ".join(df["Genre"])
wordcloud = WordCloud(width=1000, height=600, background_color="white").generate(text)

plt.figure(figsize=(8, 5))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


ModuleNotFoundError: No module named 'wordcloud'

In [None]:
#Let's Visvalize it using word cloud


## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here
streaming_service_count = df.groupby("streaming_sevice").size().reset_index(name="Count")
streaming_service_count_sorted = streaming_service_count.sort_values(by="Count", ascending=False)
streaming_service_count_sorted

Unnamed: 0,streaming_sevice,Count
8,streaming service not found,72
0,Amazon Prime Video,11
1,Apple TV,5
4,Netflix,5
2,Bookmyshow,2
6,Zee5,2
3,Hotstar,1
5,Sun Nxt,1
7,aha,1


In [None]:
streaming_service_counts = df.groupby('streaming_sevice').size().reset_index(name="Count")

streaming_service_counts_sorted = streaming_service_counts.sort_values(by="Count", ascending=False)
streaming_service_counts_sorted

Unnamed: 0,streaming_sevice,Count
8,streaming service not found,72
0,Amazon Prime Video,11
1,Apple TV,5
4,Netflix,5
2,Bookmyshow,2
6,Zee5,2
3,Hotstar,1
5,Sun Nxt,1
7,aha,1


In [None]:
#Let's Visvalize it using word cloud
text = " ".join(df["streaming_sevice"])

wordcloud = WordCloud(background_color="white").generate(text)

plt.figure(figsize=(8, 5))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

NameError: name 'WordCloud' is not defined

In [None]:
text = " ".join(df["streaming_sevice"])

wordcloud = WordCloud(width=1000, height=600, background_color="white").generate(text)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

NameError: name 'WordCloud' is not defined

## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format
df.to_csv("Orignal_DF.csv", index=False)
print("Export Done!")

Export Done!


In [None]:
#saving filter data as Filter Data in csv format
filter_movies.to_csv("filter_movies.csv", index=False)
print("Export Done!")

NameError: name 'filter_movies' is not defined

# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***