Reference link for Webscraping : https://www.youtube.com/watch?v=O68xT4dE_zU&t=793s

#Importing modules

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

#Request page source from URL

In [3]:
season1_url = 'https://www.imdb.com/title/tt0445883/episodes/?season=1'
season1_page=requests.get(season1_url)
season1_page


<Response [403]>

When we try to get fetch the request for season1 url via python code, we get responde as <Response [403]>.  The HTTP 403 Forbidden response status code indicates that the server understands the request but refuses to authorize it.
This is because we are trying to scrape website usign python.
Ref: https://scrapeops.io/web-scraping-playbook/403-forbidden-error-web-scraping/

To fix this issue, we use fake agents as mentioned in the above reference.

In [4]:
r = requests.get('http://httpbin.org/headers')
print(r.text)
HEADERS = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}


{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-65df4bf8-563b2a7200ef177d30634e2d"
  }
}



Since we have 8 seasons to scrape data from, we use function to get the values like season number, imdb rating, date etc of all 8 seasons and store them in lists and then we store the combined values of all 8 seasons in a dictionary.

In [5]:
def scrape_imdb_episodes(season_soup):
  seasons= []
  episode_numbers = []
  episode_titles = []
  date_aired = []
  imdb_ratings = []

  scraped_episodes= season_soup.find_all('div',class_='ipc-title__text')
  episode_dates = season_soup.find_all('span',class_='sc-f2169d65-10 iZXnmI')
  episode_ratings= season_soup.find_all('span',class_='ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating')

  for ep in scraped_episodes:
    episode_info = ep.get_text().strip()
    seasons.append(episode_info.split('∙')[0].split('.')[0].strip())
    episode_numbers.append(episode_info.split('∙')[0].split('.')[1].strip())
    episode_titles.append(episode_info.split('∙')[1].strip())

  for date in episode_dates:
    date_info = date.get_text().strip()
    date_aired.append(date_info)

  for rating in episode_ratings:
    rating_info = rating.get_text().strip()
    imdb_ratings.append(rating_info.split('/')[0].strip())

  return seasons,episode_numbers,episode_titles,date_aired,imdb_ratings


In [6]:
base_url = 'https://www.imdb.com/title/tt0445883/episodes/?season={}'
kwk_data = {'Season':[],'Episode':[],'Title':[],'Date':[],'IMDB Rating':[]}
for season_num in range(1,9):
  #print(season_num)
  season_url = base_url.format(season_num)
  season_page = requests.get(season_url,headers=HEADERS)
  season_soup = BeautifulSoup(season_page.content,"html.parser")
  seasons,episode_numbers,episode_titles,date_aired,imdb_ratings = scrape_imdb_episodes(season_soup)
  kwk_data['Season'].extend(seasons)
  kwk_data['Episode'].extend(episode_numbers)
  kwk_data['Title'].extend(episode_titles)
  kwk_data['Date'].extend(date_aired)
  kwk_data['IMDB Rating'].extend(imdb_ratings)


Creating a dataframe based on the dictionary and converting the dataframe to excel.

In [8]:
kwk_df = pd.DataFrame(kwk_data)

In [9]:
kwk_df.head(5)

Unnamed: 0,Season,Episode,Title,Date,IMDB Rating
0,S1,E1,Shahrukh & Kajol,"Fri, Nov 19, 2004",8.3
1,S1,E2,Kareena and Rani,"Fri, Nov 26, 2004",8.3
2,S1,E3,Saif and Preity,"Fri, Dec 3, 2004",7.6
3,S1,E4,Sanjay Leela Bhansali & Aishwarya,"Sat, Dec 11, 2004",7.4
4,S1,E5,Fardeen Khan & Zayed Khan,"Fri, Dec 17, 2004",6.1


In [10]:
kwk_df.to_excel('Koffee with Karan Data.xlsx')