### IMDB Datasets downloaded from https://datasets.imdbws.com

## schema definition at https://www.imdb.com/interfaces/

### title.akas.tsv.gz - Contains the following information for titles:
1. titleId (string) - a tconst, an alphanumeric unique identifier of the title
1. ordering (integer) – a number to uniquely identify rows for a given titleId
1. title (string) – the localized title
1. region (string) - the region for this version of the title
1. language (string) - the language of the title
1. types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
1. attributes (array) - Additional terms to describe this alternative title, not enumerated
1. isOriginalTitle (boolean) – 0: not original title; 1: original title

### title.basics.tsv.gz - Contains the following information for titles:
1. tconst (string) - alphanumeric unique identifier of the title
1. titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
1. primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
1. originalTitle (string) - original title, in the original language
1. isAdult (boolean) - 0: non-adult title; 1: adult title
1. startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
1. endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
1. runtimeMinutes – primary runtime of the title, in minutes
1. genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:
1. tconst (string) - alphanumeric unique identifier of the title
1. directors (array of nconsts) - director(s) of the given title
1. writers (array of nconsts) – writer(s) of the given title

### title.episode.tsv.gz – Contains the tv episode information. Fields include:
1. tconst (string) - alphanumeric identifier of episode
1. parentTconst (string) - alphanumeric identifier of the parent TV Series
1. seasonNumber (integer) – season number the episode belongs to
1. episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz – Contains the principal cast/crew for titles
1. tconst (string) - alphanumeric unique identifier of the title
1. ordering (integer) – a number to uniquely identify rows for a given titleId
1. nconst (string) - alphanumeric unique identifier of the name/person
1. category (string) - the category of job that person was in
1. job (string) - the specific job title if applicable, else '\N'
1. characters (string) - the name of the character played if applicable, else '\N'

### title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles
1. tconst (string) - alphanumeric unique identifier of the title
1. averageRating – weighted average of all the individual user ratings
1. numVotes - number of votes the title has received

### name.basics.tsv.gz – Contains the following information for names:
1. nconst (string) - alphanumeric unique identifier of the name/person
1. primaryName (string)– name by which the person is most often credited
1. birthYear – in YYYY format
1. deathYear – in YYYY format if applicable, else '\N'
1. primaryProfession (array of strings)– the top-3 professions of the person
1. knownForTitles (array of tconsts) – titles the person is known for

## Import data into local in-memory server

https://mybinder.org/v2/gh/apache/spark/v3.1.2-rc1?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb

In [484]:
import pyspark.sql
import pandas as pd

# # create spark session
# spark = pyspark.sql.SparkSession \
#         .builder \
#         .appName("IMDB_DATA") \
#         .getOrCreate()



In [485]:
import pandas as pd

In [486]:
# name_basics_raw = pd.read_csv("compressed_tsv/name.basics.tsv.gz" , sep='\t', low_memory = False)
# title_episode_raw = pd.read_csv("compressed_tsv/title.episode.tsv.gz" , sep='\t', low_memory = False)
# title_ratings_raw = pd.read_csv("compressed_tsv/title.ratings.tsv.gz" , sep='\t', low_memory = False)
# title_crew_raw = pd.read_csv("compressed_tsv/title.crew.tsv.gz" , sep='\t', low_memory = False)
# title_basics_raw = pd.read_csv("compressed_tsv/title.basics.tsv.gz" , sep='\t', low_memory = False)
# title_akas_raw = pd.read_csv("compressed_tsv/title.akas.tsv.gz" , sep='\t', low_memory = False)
# title_principals_raw = pd.read_csv("compressed_tsv/title.principals.tsv.gz" , sep='\t', low_memory = False)

## Getting top bollywood movies form IMDB using webscraping

In [487]:
# Use beautifulSoup to get the latest data

In [513]:
from bs4 import BeautifulSoup
import requests
import re

In [562]:
bollywood_title_all = []
bollywood_year_all = []
bollywood_title_links_all = []

In [563]:
# UDF to generate urls
def url_gen(a):
    return 'https://www.imdb.com/search/title/?countries=in&locations=India&count=250&start=' \
                + str( (251) + (a * 250) ) \
                + '&ref_=adv_nxt'

In [564]:
def miner(url = 'https://www.imdb.com/search/title/?countries=in&locations=India&count=250'
          , bollywood_title_all = bollywood_title_all
          , bollywood_year_all = bollywood_year_all
          , bollywood_title_links_all = bollywood_title_links_all):
    
    # Downloading imdb top 250 movie's data
    url = url
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')

    # use select method and enter the selector path of the element
    bollywood_title = (soup.select('div.lister-item-content > h3 > a'))
    bollywood_year = (soup.select('div.lister-item-content > h3 > span.lister-item-year.text-muted.unbold'))
    bollywood_title_links = [a.attrs.get('href') for a in soup.select('div.lister-item-content > h3 > a')]

    # convert bs4.element into string
    for i in range(0, len(bollywood_title)):
        bollywood_title[i] = bollywood_title[i].text

    for i in range(0, len(bollywood_year)):
        bollywood_year[i] = bollywood_year[i].text

    bollywood_title_all += bollywood_title
    bollywood_year_all += bollywood_year
    bollywood_title_links_all += bollywood_title_links 

In [565]:
# call the miner with default page
miner()

# # iterate over the next i pages pages:
# for i in range(1, 10):
#     miner(url = url_gen(i))
#     print('mined no.', i + 1, 'page' )

In [566]:
dict_for_df = {"title":bollywood_title_all, "release_year":bollywood_year_all, "title_link":bollywood_title_links_all}

In [569]:
scraped_bollywood = pd.DataFrame(dict_for_df)
scraped_bollywood.release_year = scraped_bollywood.release_year.str.extract('(\d+)')
scraped_bollywood.title_link = scraped_bollywood.title_link.str.extract('(\d+)')

In [570]:
scraped_bollywood

Unnamed: 0,title,release_year,title_link
0,Shershaah,2021,10295212
1,Mimi,2021,10895576
2,#Home,2021,10534500
3,Boomika,2021,11073148
4,The Family Man,2019,9544034
...,...,...,...
245,Anand,1971,0066763
246,Baadshaho,2017,4906960
247,I,2015,2302966
248,Iruvar,1997,0119385


In [None]:
scraped_bollywood.describe().T

In [None]:
scraped_bollywood.title.value_counts()

In [571]:
title_basics_raw = pd.read_csv("compressed_tsv/title.basics.tsv.gz" , sep='\t', low_memory = False)

In [573]:
title_basics_raw.tconst = title_basics_raw.tconst.str.extract('(\d+)')
title_basics_raw

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
8230930,9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010,\N,\N,"Action,Drama,Family"
8230931,9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
8230932,9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
8230933,9916856,short,The Wind,The Wind,0,2015,\N,27,Short
