### IMDB Datasets downloaded from https://datasets.imdbws.com

## schema definition at https://www.imdb.com/interfaces/

### title.akas.tsv.gz - Contains the following information for titles:
1. titleId (string) - a tconst, an alphanumeric unique identifier of the title
1. ordering (integer) – a number to uniquely identify rows for a given titleId
1. title (string) – the localized title
1. region (string) - the region for this version of the title
1. language (string) - the language of the title
1. types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
1. attributes (array) - Additional terms to describe this alternative title, not enumerated
1. isOriginalTitle (boolean) – 0: not original title; 1: original title

### title.basics.tsv.gz - Contains the following information for titles:
1. tconst (string) - alphanumeric unique identifier of the title
1. titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
1. primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
1. originalTitle (string) - original title, in the original language
1. isAdult (boolean) - 0: non-adult title; 1: adult title
1. startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
1. endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
1. runtimeMinutes – primary runtime of the title, in minutes
1. genres (string array) – includes up to three genres associated with the title

### title.crew.tsv.gz – Contains the director and writer information for all the titles in IMDb. Fields include:
1. tconst (string) - alphanumeric unique identifier of the title
1. directors (array of nconsts) - director(s) of the given title
1. writers (array of nconsts) – writer(s) of the given title

### title.episode.tsv.gz – Contains the tv episode information. Fields include:
1. tconst (string) - alphanumeric identifier of episode
1. parentTconst (string) - alphanumeric identifier of the parent TV Series
1. seasonNumber (integer) – season number the episode belongs to
1. episodeNumber (integer) – episode number of the tconst in the TV series

### title.principals.tsv.gz – Contains the principal cast/crew for titles
1. tconst (string) - alphanumeric unique identifier of the title
1. ordering (integer) – a number to uniquely identify rows for a given titleId
1. nconst (string) - alphanumeric unique identifier of the name/person
1. category (string) - the category of job that person was in
1. job (string) - the specific job title if applicable, else '\N'
1. characters (string) - the name of the character played if applicable, else '\N'

### title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles
1. tconst (string) - alphanumeric unique identifier of the title
1. averageRating – weighted average of all the individual user ratings
1. numVotes - number of votes the title has received

### name.basics.tsv.gz – Contains the following information for names:
1. nconst (string) - alphanumeric unique identifier of the name/person
1. primaryName (string)– name by which the person is most often credited
1. birthYear – in YYYY format
1. deathYear – in YYYY format if applicable, else '\N'
1. primaryProfession (array of strings)– the top-3 professions of the person
1. knownForTitles (array of tconsts) – titles the person is known for

## Import data into local in-memory server

In [2]:
import pandas as pd

In [None]:
name_basics_raw = pd.read_csv("compressed_tsv/name.basics.tsv.gz" , sep='\t')
episode_raw = pd.read_csv("compressed_tsv/title.episode.tsv.gz" , sep='\t')
ratings_raw = pd.read_csv("compressed_tsv/title.ratings.tsv.gz" , sep='\t')
crew_raw = pd.read_csv("compressed_tsv/title.crew.tsv.gz" , sep='\t')
title_basics_raw = pd.read_csv("compressed_tsv/title.basics.tsv.gz" , sep='\t')
akas_raw = pd.read_csv("compressed_tsv/title.akas.tsv.gz" , sep='\t')
principals_raw = pd.read_csv("compressed_tsv/title.principals.tsv.gz" , sep='\t')

In [None]:
from bs4 import BeautifulSoup
import requests
import re
 
 
# Downloading imdb top 250 movie's data
url = 'https://www.imdb.com/search/title/?countries=in&locations=India&count=250'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

In [None]:
from bs4 import BeautifulSoup
import requests
import re
 
 
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
 
movies = soup.select('h3.lister-item-header')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
 
ratings = [b.attrs.get('data-value')
           for b in soup.select('td.posterColumn span[name=ir]')]
 
votes = [b.attrs.get('data-value')
         for b in soup.select('td.ratingColumn strong')]
 
list = []
 
# create a empty list for storing
# movie information
list = []
 
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
     
    # Separating  movie into: 'place',
    # 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    list.append(data)
 
# printing movie details with its rating.
for movie in list:
    print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
          ') -', 'Starring:', movie['star_cast'], movie['rating'])

In [None]:
movies