# How I did the analysis

This is just a jupyter notebook containing all the python code I used to do my analysis.

I won't be uploading all the raw data I used, so you won't be able to reproduce the analysis
fully (sorry, but it's not my data to redistribute). But I will upload the final dataset I used
for my analysis, the list i got from Bookmyshow and some other stuff.

Pretty certain no one will go through the whole notebook, so will just summarise everything I 
did:</br>
<br />
* Found out which movies were released in the US from 2013 to August 2018, movies which had at least one American production company involved<br>
<br />
* Total number of releases in that period came up to 5,628.<br>
<br />
* Scraped the data from the-numbers.com, chose them because imdb 
    doesn't have the release dates for all movies for some reason.
    If I remember correctly, boxofficemojo does have the release
    dates, but in a less scraping-friendly format. So decided to go with the-numbers.com. Did use boxofficemojo.com and imdb.com for other things though.<br>
<br />
* Took out movies which earned less than USD 500K at the US box office in that period, keeps the focus of the analysis on mainstream Hollywood output as well as bigger independent movies made outside the Hollywood system.<br>
<br />
* BookMyShow very generously gave me a list of all the English-language movies that were released from 2013 to 2018. This will help us figure out which Hollywood movies were released here or not. There were a few errors in the list (for example, some movies that weren't released like the fifty shades movies were still given release dates of December XX, 2018 or December XX, 2017) but I think I've managed to deal with most of the errors. Used google, critics reviews and news reports of the time to cross-check some of the bookmyshow data. <br>
<br />
* Then I just matched movies from one list (movies that made more than USD 500K at US box office and had at least one American production company involved) to the Bookmyshow list (all English-language movies released in India) to figure out which movies were released or not. <br>
<br />
* Now if you look through the code I try to come up with my own way of matching titles. A colleague told me later I could have used some fuzzy match library to do it instead. Some of my code later uses the fuzzywuzzy library.<br>
<br />
* There was some manual intervention needed when titles couldn't be matched. If movie titles can't be matched, you can't use the bookmyshow list. For example, 'Pirates Of The Carribean: Dead Men Tell No Tales' was the title in the US, 'Pirates Of The Carribean: Salazar’s Revenge' was the title in India, so things like that needed to be accounted for.<br>
<br />
* After the matching was done, wanted to find the genres. First went with imdb's data dump at https://datasets.imdbws.com/ . Later realised that they just have 3 genres listed for each movie and those data dumps are limited in other ways too. Some movies could have as many as 5-6 genres on the main IMDB website. Used the IMDBpy library to get the data I needed from the site.<br>
<br />
* Deleted some movies that were not made primarily in the English language. Used IMDBpy to figure out which movies didn't have English listed as a language. Did some cross-checking here too.<br>
<br />
* Removed documentaries, short films, standup performances, promotional releases etc.<br>
<br />
* Got various production companies for each movie from the-numbers.com , did some online research to figure out which companies are part of the various movie studios and conglomerates. For example, there are three distributors associated with Lionsgate -- "Focus Features","Universal" and "Gramercy". Had to read up a bit to figure out things like that.<br />
<br />
* After all this, just did some basic number crunching to figure out what percentages of movies in terms of genre, US distributor etc. weren't released in India. <br>



In [7]:
#the urls and titles were taken from pages on the-numbers.com
#for example, this was the page for movies produced or co-produced by US companies and released in 2018
    # https://www.the-numbers.com/United-States/movies/year/2018

#will be using the urls later on to scrape pages from the-numbers.com 
    
import csv
from lxml import html

labels = ['number','title','url','year']

with open("data/titles_urls.csv", "w") as filex:   
    wr = csv.writer(filex, delimiter = ',' , quotechar = '"' )
    wr.writerow(labels)
    
file_open_location = 'data/us_movies_2013_2018.html'

with open(file_open_location) as filex:
    sourcex = filex.read()
    treex = html.document_fromstring(sourcex)

    rows = treex.xpath("/html/body/table//tr")
    
    number = 0
    
    for row in rows[1:]:
        number += 1
        title = row.xpath("td[1]/font/a/text()")[0]
        url = row.xpath("td[1]/font/a/@href")[0]
        year = row.xpath("td[7]/text()")[0]
        
        new_row = [number,title,url,year]

        with open("data/titles_urls.csv", "a") as filez:
            wr = csv.writer(filez, delimiter = ',' , quotechar = '"' )
            wr.writerow(new_row)
    
print('DONE AND DONE')

DONE AND DONE


In [None]:
#domestic release schedules for the US were taken from pages on the-numbers.com
    #For example, this is the 2018 release schedule for movies in the US 
    #https://www.the-numbers.com/movies/release-schedule/2018


In [31]:
#in the cell blocks below, i'll be taking out the 2018 movies from the titles_urls csv 
    #the ones that were released after july 31, 2018
    #The thinking being that if a movie was released after that, it might take some time to come to India
    #Don't think things work like that anymore but just wanted to allow for that possibility
    #That if a movie was released after July 31 in the US but hasn't come to India by Sept, there is still
    # a possibility that it could be released in October
    #So july 31 is kind of a cut-off

   #Note - changed the cutoff later from july 31 to august 31, covers the whole summer period then
    
import pandas as pd

title_url_df = pd.read_csv("data/titles_urls.csv")

title_url_df.head()

Unnamed: 0,number,title,url,year
0,1,Frozen,https://www.the-numbers.com/movie/Frozen-(2013...,2013
1,2,Iron Man 3,https://www.the-numbers.com/movie/Iron-Man-3#t...,2013
2,3,Despicable Me 2,https://www.the-numbers.com/movie/Despicable-M...,2013
3,4,The Hobbit: The Desolation of Smaug,https://www.the-numbers.com/movie/Hobbit-The-D...,2013
4,5,The Hunger Games: Catching Fire,https://www.the-numbers.com/movie/Hunger-Games...,2013


In [32]:
title_url_df.dtypes

number     int64
title     object
url       object
year       int64
dtype: object

In [33]:
title_url_df_2018 = title_url_df[title_url_df['year'] == 2018 ]

title_url_df_not_2018 = title_url_df[title_url_df['year'] != 2018 ]

In [34]:
title_url_df_2018.shape

(989, 4)

In [35]:
title_url_df_2018.head()

Unnamed: 0,number,title,url,year
5061,5062,Avengers: Infinity War,https://www.the-numbers.com/movie/Avengers-Inf...,2018
5062,5063,Black Panther,https://www.the-numbers.com/movie/Black-Panthe...,2018
5063,5064,Jurassic World: Fallen Kingdom,https://www.the-numbers.com/movie/Jurassic-Wor...,2018
5064,5065,Incredibles 2,https://www.the-numbers.com/movie/Incredibles-...,2018
5065,5066,Mission: Impossible—Fallout,https://www.the-numbers.com/movie/Mission-Impo...,2018


In [36]:
num_lines = sum(1 for line in open('data/2018_titles_released_US_b4_july_31.txt'))
num_lines

655

In [37]:
with open("data/2018_titles_released_US_b4_july_31.txt", "r", encoding='utf-8') as ins:
    released_b4_july_31_list = []
    for line in ins:
        line = line.replace(u'\xa0', u' ')
        line = line.strip()
        released_b4_july_31_list.append(line)
        
released_b4_july_31_list[3:10]

['Blame',
 'Bob le flambeur',
 'Day of the Dead: Bloodline',
 "Devil's Gate",
 'Django',
 'Goldbuster',
 'Hanson and the Beast']

In [38]:
title_url_df_2018_b4_jul_31 = title_url_df_2018[title_url_df_2018['title']\
                                                .isin(released_b4_july_31_list)]
title_url_df_2018_b4_jul_31.shape

(350, 4)

In [30]:
pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)

height has been deprecated.



In [41]:
title_url_df_2018_after_jul_31 = title_url_df_2018[~title_url_df_2018['title']\
                                                .isin(released_b4_july_31_list)]
title_url_df_2018_after_jul_31['title']

5069                                              The Meg
5082                                              The Nun
5083                                    Crazy Rich Asians
5086                                    Christopher Robin
5098                                      Xie Bu Ya Zheng
5099                                                Alpha
5104                                       BlacKkKlansman
5107                                The Spy Who Dumped Me
5110                                         The Predator
5113                                              Mile 22
5116                                          Slender Man
5119                                            Searching
5124                                    The Darkest Minds
5127                                           Peppermint
5129                                The Happytime Murders
5130                                       A Simple Favor
5138    The Guernsey Literary and Potato Peel Pie Society
5140          

In [42]:
title_url_df_shorter = pd.concat([title_url_df_not_2018, title_url_df_2018_b4_jul_31])

In [43]:
title_url_df_shorter.shape

(5411, 4)

In [44]:
title_url_df_shorter.head()

Unnamed: 0,number,title,url,year
0,1,Frozen,https://www.the-numbers.com/movie/Frozen-(2013...,2013
1,2,Iron Man 3,https://www.the-numbers.com/movie/Iron-Man-3#t...,2013
2,3,Despicable Me 2,https://www.the-numbers.com/movie/Despicable-M...,2013
3,4,The Hobbit: The Desolation of Smaug,https://www.the-numbers.com/movie/Hobbit-The-D...,2013
4,5,The Hunger Games: Catching Fire,https://www.the-numbers.com/movie/Hunger-Games...,2013


In [45]:
title_url_df_shorter.to_csv('data/titles_urls_shorter.csv', index=False)

In [None]:
#DO IT ALL OVER AGAIN, WE'RE AT LEAST 150 titles short, there should be at least 500, 
    #we only have 350 2018 titles

In [1]:
import pandas as pd

titles_urls_df = pd.read_csv('data/titles_urls_final.csv',encoding='iso-8859-1', converters={'number': str})
titles_urls_df.head()

Unnamed: 0,number,title,url,year
0,1,Frozen,https://www.the-numbers.com/movie/Frozen-(2013...,2013
1,2,Iron Man 3,https://www.the-numbers.com/movie/Iron-Man-3#t...,2013
2,3,Despicable Me 2,https://www.the-numbers.com/movie/Despicable-M...,2013
3,4,The Hobbit: The Desolation of Smaug,https://www.the-numbers.com/movie/Hobbit-The-D...,2013
4,5,The Hunger Games: Catching Fire,https://www.the-numbers.com/movie/Hunger-Games...,2013


In [2]:
titles_urls_df.shape

(5566, 4)

In [3]:
import os
os.mkdir('movie_pages')

In [5]:
import requests
import csv
from lxml import html
import re
from random import randint
import time
from datetime import datetime,date

headersx = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,en-GB;q=0.8",
"cache-control": "no-cache",
"cookie": "IN_HASH=tab%3Dyear; gsScrollPos-236254187=; IN_HASH=tab%3Dyear; gsScrollPos-236254186=; gsScrollPos-236249552=0; gsScrollPos-236252306=0",
"pragma": "no-cache",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36"
}

for index,row in titles_urls_df.iterrows():
    url = row['url']
    number = row['number']
    year = row['year']
    
    try:
        responsex = requests.get(url, headers = headersx, timeout = 5)
        
    except requests.ConnectionError:
        print("OOPS!! Connection Error for title_number {}".format(number))
        print('this is the url: {}'.format(url))
        continue
    except requests.Timeout:
        print("OOPS!! Timeout Error for title_number {}".format(number))
        print('this is the url: {}'.format(url))
        continue
    except requests.RequestException:
        print("OOPS!! General Error for title_number {}".format(number))
        print('this is the url: {}'.format(url))
        continue
    except requests.HTTPError:
        print("OOPS!! HTTP Error for title_number {}".format(number))
        print('this is the url: {}'.format(url))
        continue
        
    else:
        delay = randint(5,9)

        time.sleep(delay)
        
        file_save_locationz =\
        'movie_pages/'+ number + '_' + str(year) + '.html'

        with open(file_save_locationz, 'wb') as f:
            f.write(responsex.content)

        print("done {}".format(file_save_locationz))
                    
print('done and done')

done movie_pages/0001_2013.html
done movie_pages/0002_2013.html
done movie_pages/0003_2013.html
done movie_pages/0004_2013.html
done movie_pages/0005_2013.html
done movie_pages/0006_2013.html
done movie_pages/0007_2013.html
done movie_pages/0008_2013.html
done movie_pages/0009_2013.html
done movie_pages/0010_2013.html
done movie_pages/0011_2013.html
done movie_pages/0012_2013.html
done movie_pages/0013_2013.html
done movie_pages/0014_2013.html
done movie_pages/0015_2013.html
done movie_pages/0016_2013.html
done movie_pages/0017_2013.html
done movie_pages/0018_2013.html
done movie_pages/0019_2013.html
done movie_pages/0020_2013.html
done movie_pages/0021_2013.html
done movie_pages/0022_2013.html
done movie_pages/0023_2013.html
done movie_pages/0024_2013.html
done movie_pages/0025_2013.html
done movie_pages/0026_2013.html
done movie_pages/0027_2013.html
done movie_pages/0028_2013.html
done movie_pages/0029_2013.html
done movie_pages/0030_2013.html
done movie_pages/0031_2013.html
done mov

done movie_pages/0254_2013.html
done movie_pages/0255_2013.html
done movie_pages/0256_2013.html
done movie_pages/0257_2013.html
done movie_pages/0258_2013.html
done movie_pages/0259_2013.html
done movie_pages/0260_2013.html
done movie_pages/0261_2013.html
done movie_pages/0262_2013.html
done movie_pages/0263_2013.html
done movie_pages/0264_2013.html
done movie_pages/0265_2013.html
done movie_pages/0266_2013.html
done movie_pages/0267_2013.html
done movie_pages/0268_2013.html
done movie_pages/0269_2013.html
done movie_pages/0270_2013.html
done movie_pages/0271_2013.html
done movie_pages/0272_2013.html
done movie_pages/0273_2013.html
done movie_pages/0274_2013.html
done movie_pages/0275_2013.html
done movie_pages/0276_2013.html
done movie_pages/0277_2013.html
done movie_pages/0278_2013.html
done movie_pages/0279_2013.html
done movie_pages/0280_2013.html
done movie_pages/0281_2013.html
done movie_pages/0282_2013.html
done movie_pages/0283_2013.html
done movie_pages/0284_2013.html
done mov

done movie_pages/0508_2013.html
done movie_pages/0509_2013.html
done movie_pages/0510_2013.html
done movie_pages/0511_2013.html
done movie_pages/0512_2013.html
done movie_pages/0513_2013.html
done movie_pages/0514_2013.html
done movie_pages/0515_2013.html
done movie_pages/0516_2013.html
done movie_pages/0517_2013.html
done movie_pages/0518_2013.html
done movie_pages/0519_2013.html
done movie_pages/0520_2013.html
done movie_pages/0521_2013.html
done movie_pages/0522_2013.html
done movie_pages/0523_2013.html
done movie_pages/0524_2013.html
done movie_pages/0525_2013.html
done movie_pages/0526_2013.html
done movie_pages/0527_2013.html
done movie_pages/0528_2013.html
done movie_pages/0529_2013.html
done movie_pages/0530_2013.html
done movie_pages/0531_2013.html
done movie_pages/0532_2013.html
done movie_pages/0533_2013.html
done movie_pages/0534_2013.html
done movie_pages/0535_2013.html
done movie_pages/0536_2013.html
done movie_pages/0537_2013.html
done movie_pages/0538_2013.html
done mov

done movie_pages/0765_2013.html
done movie_pages/0766_2013.html
done movie_pages/0767_2013.html
done movie_pages/0768_2013.html
done movie_pages/0769_2013.html
done movie_pages/0770_2013.html
done movie_pages/0771_2013.html
done movie_pages/0772_2013.html
done movie_pages/0773_2013.html
done movie_pages/0774_2013.html
done movie_pages/0775_2013.html
done movie_pages/0776_2013.html
done movie_pages/0777_2013.html
done movie_pages/0778_2013.html
done movie_pages/0779_2013.html
done movie_pages/0780_2013.html
done movie_pages/0781_2013.html
done movie_pages/0782_2013.html
done movie_pages/0783_2013.html
done movie_pages/0784_2013.html
done movie_pages/0785_2013.html
done movie_pages/0786_2013.html
done movie_pages/0787_2014.html
done movie_pages/0788_2014.html
done movie_pages/0789_2014.html
done movie_pages/0790_2014.html
done movie_pages/0791_2014.html
done movie_pages/0792_2014.html
done movie_pages/0793_2014.html
done movie_pages/0794_2014.html
done movie_pages/0795_2014.html
done mov

done movie_pages/1022_2014.html
done movie_pages/1023_2014.html
done movie_pages/1024_2014.html
done movie_pages/1025_2014.html
done movie_pages/1026_2014.html
done movie_pages/1027_2014.html
done movie_pages/1028_2014.html
done movie_pages/1029_2014.html
done movie_pages/1030_2014.html
done movie_pages/1031_2014.html
done movie_pages/1032_2014.html
done movie_pages/1033_2014.html
done movie_pages/1034_2014.html
done movie_pages/1035_2014.html
done movie_pages/1036_2014.html
done movie_pages/1037_2014.html
done movie_pages/1038_2014.html
done movie_pages/1039_2014.html
done movie_pages/1040_2014.html
done movie_pages/1041_2014.html
done movie_pages/1042_2014.html
done movie_pages/1043_2014.html
done movie_pages/1044_2014.html
done movie_pages/1045_2014.html
done movie_pages/1046_2014.html
done movie_pages/1047_2014.html
done movie_pages/1048_2014.html
done movie_pages/1049_2014.html
done movie_pages/1050_2014.html
done movie_pages/1051_2014.html
done movie_pages/1052_2014.html
done mov

done movie_pages/1279_2014.html
done movie_pages/1280_2014.html
done movie_pages/1281_2014.html
done movie_pages/1282_2014.html
done movie_pages/1283_2014.html
done movie_pages/1284_2014.html
done movie_pages/1285_2014.html
done movie_pages/1286_2014.html
done movie_pages/1287_2014.html
done movie_pages/1288_2014.html
done movie_pages/1289_2014.html
done movie_pages/1290_2014.html
done movie_pages/1291_2014.html
done movie_pages/1292_2014.html
done movie_pages/1293_2014.html
done movie_pages/1294_2014.html
done movie_pages/1295_2014.html
done movie_pages/1296_2014.html
done movie_pages/1297_2014.html
done movie_pages/1298_2014.html
done movie_pages/1299_2014.html
done movie_pages/1300_2014.html
done movie_pages/1301_2014.html
done movie_pages/1302_2014.html
done movie_pages/1303_2014.html
done movie_pages/1304_2014.html
done movie_pages/1305_2014.html
done movie_pages/1306_2014.html
done movie_pages/1307_2014.html
done movie_pages/1308_2014.html
done movie_pages/1309_2014.html
done mov

done movie_pages/1536_2014.html
done movie_pages/1537_2014.html
done movie_pages/1538_2014.html
done movie_pages/1539_2014.html
done movie_pages/1540_2014.html
done movie_pages/1541_2014.html
done movie_pages/1542_2014.html
done movie_pages/1543_2014.html
done movie_pages/1544_2014.html
done movie_pages/1545_2014.html
done movie_pages/1546_2014.html
done movie_pages/1547_2014.html
done movie_pages/1548_2014.html
done movie_pages/1549_2014.html
done movie_pages/1550_2014.html
done movie_pages/1551_2014.html
done movie_pages/1552_2014.html
done movie_pages/1553_2014.html
done movie_pages/1554_2014.html
done movie_pages/1555_2014.html
done movie_pages/1556_2014.html
done movie_pages/1557_2014.html
done movie_pages/1558_2014.html
done movie_pages/1559_2014.html
done movie_pages/1560_2014.html
done movie_pages/1561_2014.html
done movie_pages/1562_2014.html
done movie_pages/1563_2014.html
done movie_pages/1564_2014.html
done movie_pages/1565_2014.html
done movie_pages/1566_2014.html
done mov

done movie_pages/1793_2015.html
done movie_pages/1794_2015.html
done movie_pages/1795_2015.html
done movie_pages/1796_2015.html
done movie_pages/1797_2015.html
done movie_pages/1798_2015.html
done movie_pages/1799_2015.html
done movie_pages/1800_2015.html
done movie_pages/1801_2015.html
done movie_pages/1802_2015.html
done movie_pages/1803_2015.html
done movie_pages/1804_2015.html
done movie_pages/1805_2015.html
done movie_pages/1806_2015.html
done movie_pages/1807_2015.html
done movie_pages/1808_2015.html
done movie_pages/1809_2015.html
done movie_pages/1810_2015.html
done movie_pages/1811_2015.html
done movie_pages/1812_2015.html
done movie_pages/1813_2015.html
done movie_pages/1814_2015.html
done movie_pages/1815_2015.html
done movie_pages/1816_2015.html
done movie_pages/1817_2015.html
done movie_pages/1818_2015.html
done movie_pages/1819_2015.html
done movie_pages/1820_2015.html
done movie_pages/1821_2015.html
done movie_pages/1822_2015.html
done movie_pages/1823_2015.html
done mov

done movie_pages/2050_2015.html
done movie_pages/2051_2015.html
done movie_pages/2052_2015.html
done movie_pages/2053_2015.html
done movie_pages/2054_2015.html
done movie_pages/2055_2015.html
done movie_pages/2056_2015.html
done movie_pages/2057_2015.html
done movie_pages/2058_2015.html
done movie_pages/2059_2015.html
done movie_pages/2060_2015.html
done movie_pages/2061_2015.html
done movie_pages/2062_2015.html
done movie_pages/2063_2015.html
done movie_pages/2064_2015.html
done movie_pages/2065_2015.html
done movie_pages/2066_2015.html
done movie_pages/2067_2015.html
done movie_pages/2068_2015.html
done movie_pages/2069_2015.html
done movie_pages/2070_2015.html
done movie_pages/2071_2015.html
done movie_pages/2072_2015.html
done movie_pages/2073_2015.html
done movie_pages/2074_2015.html
done movie_pages/2075_2015.html
done movie_pages/2076_2015.html
done movie_pages/2077_2015.html
done movie_pages/2078_2015.html
done movie_pages/2079_2015.html
done movie_pages/2080_2015.html
done mov

done movie_pages/2307_2015.html
done movie_pages/2308_2015.html
done movie_pages/2309_2015.html
done movie_pages/2310_2015.html
done movie_pages/2311_2015.html
done movie_pages/2312_2015.html
done movie_pages/2313_2015.html
done movie_pages/2314_2015.html
done movie_pages/2315_2015.html
done movie_pages/2316_2015.html
done movie_pages/2317_2015.html
done movie_pages/2318_2015.html
done movie_pages/2319_2015.html
done movie_pages/2320_2015.html
done movie_pages/2321_2015.html
done movie_pages/2322_2015.html
done movie_pages/2323_2015.html
done movie_pages/2324_2015.html
done movie_pages/2325_2015.html
done movie_pages/2326_2015.html
done movie_pages/2327_2015.html
done movie_pages/2328_2015.html
done movie_pages/2329_2015.html
done movie_pages/2330_2015.html
done movie_pages/2331_2015.html
done movie_pages/2332_2015.html
done movie_pages/2333_2015.html
done movie_pages/2334_2015.html
done movie_pages/2335_2015.html
done movie_pages/2336_2015.html
done movie_pages/2337_2015.html
done mov

done movie_pages/2564_2015.html
done movie_pages/2565_2015.html
done movie_pages/2566_2015.html
done movie_pages/2567_2015.html
done movie_pages/2568_2015.html
done movie_pages/2569_2015.html
done movie_pages/2570_2015.html
done movie_pages/2571_2015.html
done movie_pages/2572_2015.html
done movie_pages/2573_2015.html
done movie_pages/2574_2015.html
done movie_pages/2575_2015.html
done movie_pages/2576_2015.html
done movie_pages/2577_2015.html
done movie_pages/2578_2015.html
done movie_pages/2579_2015.html
done movie_pages/2580_2015.html
done movie_pages/2581_2015.html
done movie_pages/2582_2015.html
done movie_pages/2583_2015.html
done movie_pages/2584_2015.html
done movie_pages/2585_2015.html
done movie_pages/2586_2015.html
done movie_pages/2587_2015.html
done movie_pages/2588_2015.html
done movie_pages/2589_2015.html
done movie_pages/2590_2015.html
done movie_pages/2591_2015.html
done movie_pages/2592_2015.html
done movie_pages/2593_2015.html
done movie_pages/2594_2015.html
done mov

done movie_pages/2821_2015.html
done movie_pages/2822_2015.html
done movie_pages/2823_2015.html
done movie_pages/2824_2015.html
done movie_pages/2825_2015.html
done movie_pages/2826_2015.html
done movie_pages/2827_2015.html
done movie_pages/2828_2015.html
done movie_pages/2829_2015.html
done movie_pages/2830_2015.html
done movie_pages/2831_2015.html
done movie_pages/2832_2015.html
done movie_pages/2833_2015.html
done movie_pages/2834_2015.html
done movie_pages/2835_2015.html
done movie_pages/2836_2015.html
done movie_pages/2837_2015.html
done movie_pages/2838_2015.html
done movie_pages/2839_2015.html
done movie_pages/2840_2015.html
done movie_pages/2841_2015.html
done movie_pages/2842_2015.html
done movie_pages/2843_2015.html
done movie_pages/2844_2015.html
done movie_pages/2845_2015.html
done movie_pages/2846_2015.html
done movie_pages/2847_2015.html
done movie_pages/2848_2015.html
done movie_pages/2849_2015.html
done movie_pages/2850_2015.html
done movie_pages/2851_2015.html
done mov

done movie_pages/3075_2016.html
done movie_pages/3076_2016.html
done movie_pages/3077_2016.html
done movie_pages/3078_2016.html
done movie_pages/3079_2016.html
done movie_pages/3080_2016.html
done movie_pages/3081_2016.html
done movie_pages/3082_2016.html
done movie_pages/3083_2016.html
done movie_pages/3084_2016.html
done movie_pages/3085_2016.html
done movie_pages/3086_2016.html
done movie_pages/3087_2016.html
done movie_pages/3088_2016.html
done movie_pages/3089_2016.html
done movie_pages/3090_2016.html
done movie_pages/3091_2016.html
done movie_pages/3092_2016.html
done movie_pages/3093_2016.html
done movie_pages/3094_2016.html
done movie_pages/3095_2016.html
done movie_pages/3096_2016.html
done movie_pages/3097_2016.html
done movie_pages/3098_2016.html
done movie_pages/3099_2016.html
done movie_pages/3100_2016.html
done movie_pages/3101_2016.html
done movie_pages/3102_2016.html
done movie_pages/3103_2016.html
done movie_pages/3104_2016.html
done movie_pages/3105_2016.html
done mov

done movie_pages/3329_2016.html
done movie_pages/3330_2016.html
done movie_pages/3331_2016.html
done movie_pages/3332_2016.html
done movie_pages/3333_2016.html
done movie_pages/3334_2016.html
done movie_pages/3335_2016.html
done movie_pages/3336_2016.html
done movie_pages/3337_2016.html
done movie_pages/3338_2016.html
done movie_pages/3339_2016.html
done movie_pages/3340_2016.html
done movie_pages/3341_2016.html
done movie_pages/3342_2016.html
done movie_pages/3343_2016.html
done movie_pages/3344_2016.html
done movie_pages/3345_2016.html
done movie_pages/3346_2016.html
done movie_pages/3347_2016.html
done movie_pages/3348_2016.html
done movie_pages/3349_2016.html
done movie_pages/3350_2016.html
done movie_pages/3351_2016.html
done movie_pages/3352_2016.html
done movie_pages/3353_2016.html
done movie_pages/3354_2016.html
done movie_pages/3355_2016.html
done movie_pages/3356_2016.html
done movie_pages/3357_2016.html
done movie_pages/3358_2016.html
done movie_pages/3359_2016.html
done mov

done movie_pages/3586_2016.html
done movie_pages/3587_2016.html
done movie_pages/3588_2016.html
done movie_pages/3589_2016.html
done movie_pages/3590_2016.html
done movie_pages/3591_2016.html
done movie_pages/3592_2016.html
done movie_pages/3593_2016.html
done movie_pages/3594_2016.html
done movie_pages/3595_2016.html
done movie_pages/3596_2016.html
done movie_pages/3597_2016.html
done movie_pages/3598_2016.html
done movie_pages/3599_2016.html
done movie_pages/3600_2016.html
done movie_pages/3601_2016.html
done movie_pages/3602_2016.html
done movie_pages/3603_2016.html
done movie_pages/3604_2016.html
done movie_pages/3605_2016.html
done movie_pages/3606_2016.html
done movie_pages/3607_2016.html
done movie_pages/3608_2016.html
done movie_pages/3609_2016.html
done movie_pages/3610_2016.html
done movie_pages/3611_2016.html
done movie_pages/3612_2016.html
done movie_pages/3613_2016.html
done movie_pages/3614_2016.html
done movie_pages/3615_2016.html
done movie_pages/3616_2016.html
done mov

done movie_pages/3843_2016.html
done movie_pages/3844_2016.html
done movie_pages/3845_2016.html
done movie_pages/3846_2016.html
done movie_pages/3847_2016.html
done movie_pages/3848_2016.html
done movie_pages/3849_2016.html
done movie_pages/3850_2016.html
done movie_pages/3851_2016.html
done movie_pages/3852_2016.html
done movie_pages/3853_2016.html
done movie_pages/3854_2016.html
done movie_pages/3855_2016.html
done movie_pages/3856_2016.html
done movie_pages/3857_2017.html
done movie_pages/3858_2017.html
done movie_pages/3859_2017.html
done movie_pages/3860_2017.html
done movie_pages/3861_2017.html
done movie_pages/3862_2017.html
done movie_pages/3863_2017.html
done movie_pages/3864_2017.html
done movie_pages/3865_2017.html
done movie_pages/3866_2017.html
done movie_pages/3867_2017.html
done movie_pages/3868_2017.html
done movie_pages/3869_2017.html
done movie_pages/3870_2017.html
done movie_pages/3871_2017.html
done movie_pages/3872_2017.html
done movie_pages/3873_2017.html
done mov

done movie_pages/4097_2017.html
done movie_pages/4098_2017.html
done movie_pages/4099_2017.html
done movie_pages/4100_2017.html
done movie_pages/4101_2017.html
done movie_pages/4102_2017.html
done movie_pages/4103_2017.html
done movie_pages/4104_2017.html
done movie_pages/4105_2017.html
done movie_pages/4106_2017.html
done movie_pages/4107_2017.html
done movie_pages/4108_2017.html
done movie_pages/4109_2017.html
done movie_pages/4110_2017.html
done movie_pages/4111_2017.html
done movie_pages/4112_2017.html
done movie_pages/4113_2017.html
done movie_pages/4114_2017.html
done movie_pages/4115_2017.html
done movie_pages/4116_2017.html
done movie_pages/4117_2017.html
done movie_pages/4118_2017.html
done movie_pages/4119_2017.html
done movie_pages/4120_2017.html
done movie_pages/4121_2017.html
done movie_pages/4122_2017.html
done movie_pages/4123_2017.html
done movie_pages/4124_2017.html
done movie_pages/4125_2017.html
done movie_pages/4126_2017.html
done movie_pages/4127_2017.html
done mov

done movie_pages/4354_2017.html
done movie_pages/4355_2017.html
done movie_pages/4356_2017.html
done movie_pages/4357_2017.html
done movie_pages/4358_2017.html
done movie_pages/4359_2017.html
done movie_pages/4360_2017.html
done movie_pages/4361_2017.html
done movie_pages/4362_2017.html
done movie_pages/4363_2017.html
done movie_pages/4364_2017.html
done movie_pages/4365_2017.html
done movie_pages/4366_2017.html
done movie_pages/4367_2017.html
done movie_pages/4368_2017.html
done movie_pages/4369_2017.html
done movie_pages/4370_2017.html
done movie_pages/4371_2017.html
done movie_pages/4372_2017.html
done movie_pages/4373_2017.html
done movie_pages/4374_2017.html
done movie_pages/4375_2017.html
done movie_pages/4376_2017.html
done movie_pages/4377_2017.html
done movie_pages/4378_2017.html
done movie_pages/4379_2017.html
done movie_pages/4380_2017.html
done movie_pages/4381_2017.html
done movie_pages/4382_2017.html
done movie_pages/4383_2017.html
done movie_pages/4384_2017.html
done mov

done movie_pages/4611_2017.html
done movie_pages/4612_2017.html
done movie_pages/4613_2017.html
done movie_pages/4614_2017.html
done movie_pages/4615_2017.html
done movie_pages/4616_2017.html
done movie_pages/4617_2017.html
done movie_pages/4618_2017.html
done movie_pages/4619_2017.html
done movie_pages/4620_2017.html
done movie_pages/4621_2017.html
done movie_pages/4622_2017.html
done movie_pages/4623_2017.html
done movie_pages/4624_2017.html
done movie_pages/4625_2017.html
done movie_pages/4626_2017.html
done movie_pages/4627_2017.html
done movie_pages/4628_2017.html
done movie_pages/4629_2017.html
done movie_pages/4630_2017.html
done movie_pages/4631_2017.html
done movie_pages/4632_2017.html
done movie_pages/4633_2017.html
done movie_pages/4634_2017.html
done movie_pages/4635_2017.html
done movie_pages/4636_2017.html
done movie_pages/4637_2017.html
done movie_pages/4638_2017.html
done movie_pages/4639_2017.html
done movie_pages/4640_2017.html
done movie_pages/4641_2017.html
done mov

done movie_pages/4865_2017.html
done movie_pages/4866_2017.html
done movie_pages/4867_2017.html
done movie_pages/4868_2017.html
done movie_pages/4869_2017.html
done movie_pages/4870_2017.html
done movie_pages/4871_2017.html
done movie_pages/4872_2017.html
done movie_pages/4873_2017.html
done movie_pages/4874_2017.html
done movie_pages/4875_2017.html
done movie_pages/4876_2017.html
done movie_pages/4877_2017.html
done movie_pages/4878_2017.html
done movie_pages/4879_2017.html
done movie_pages/4880_2017.html
done movie_pages/4881_2017.html
done movie_pages/4882_2017.html
done movie_pages/4883_2017.html
done movie_pages/4884_2017.html
done movie_pages/4885_2017.html
done movie_pages/4886_2017.html
done movie_pages/4887_2017.html
done movie_pages/4888_2017.html
done movie_pages/4889_2017.html
done movie_pages/4890_2017.html
done movie_pages/4891_2017.html
done movie_pages/4892_2017.html
done movie_pages/4893_2017.html
done movie_pages/4894_2017.html
done movie_pages/4895_2017.html
done mov

done movie_pages/5122_2018.html
done movie_pages/5123_2018.html
done movie_pages/5124_2018.html
done movie_pages/5125_2018.html
done movie_pages/5126_2018.html
done movie_pages/5127_2018.html
done movie_pages/5128_2018.html
done movie_pages/5129_2018.html
done movie_pages/5130_2018.html
done movie_pages/5131_2018.html
done movie_pages/5132_2018.html
done movie_pages/5133_2018.html
done movie_pages/5134_2018.html
done movie_pages/5135_2018.html
done movie_pages/5136_2018.html
done movie_pages/5137_2018.html
done movie_pages/5138_2018.html
done movie_pages/5139_2018.html
done movie_pages/5140_2018.html
done movie_pages/5141_2018.html
done movie_pages/5142_2018.html
done movie_pages/5143_2018.html
done movie_pages/5144_2018.html
done movie_pages/5145_2018.html
done movie_pages/5146_2018.html
done movie_pages/5147_2018.html
done movie_pages/5148_2018.html
done movie_pages/5149_2018.html
done movie_pages/5150_2018.html
done movie_pages/5151_2018.html
done movie_pages/5152_2018.html
done mov

done movie_pages/5379_2018.html
done movie_pages/5380_2018.html
done movie_pages/5381_2018.html
done movie_pages/5382_2018.html
done movie_pages/5383_2018.html
done movie_pages/5384_2018.html
done movie_pages/5385_2018.html
done movie_pages/5386_2018.html
done movie_pages/5387_2018.html
done movie_pages/5388_2018.html
done movie_pages/5389_2018.html
done movie_pages/5390_2018.html
done movie_pages/5391_2018.html
done movie_pages/5392_2018.html
done movie_pages/5393_2018.html
done movie_pages/5394_2018.html
done movie_pages/5395_2018.html
done movie_pages/5396_2018.html
done movie_pages/5397_2018.html
done movie_pages/5398_2018.html
done movie_pages/5399_2018.html
done movie_pages/5400_2018.html
done movie_pages/5401_2018.html
done movie_pages/5402_2018.html
done movie_pages/5403_2018.html
done movie_pages/5404_2018.html
done movie_pages/5405_2018.html
done movie_pages/5406_2018.html
done movie_pages/5407_2018.html
done movie_pages/5408_2018.html
done movie_pages/5409_2018.html
done mov

In [25]:
import glob
import csv
from lxml import html
import re
from datetime import datetime

labels = ['number','year','title','domestic_box_office','international_box_office',
          'production_budget','domestic_release_date','domestic_distributor','mpaa_rating','franchise',
          'source','genre','production_method','creative_type','production_companies',
          'production_countries']

with open("data/scraped_data.csv", "w") as filex:   
    wr = csv.writer(filex, delimiter = ',' , quotechar = '"' )
    wr.writerow(labels)

search_string = 'movie_pages/*_*.html'

for filename in glob.glob(search_string):
    
    print(filename)
    
    numberx = (re.findall(r"movie_pages/(\d{4})_20\d{2}.html", filename))[0]
    
    with open(filename, encoding='utf-8') as filex:
        
        sourcex = filex.read()
        treex = html.document_fromstring(sourcex)
        
        title_year_raw = treex.xpath("//*[@id='main']/div/h1/text()")[0]
        
        year = (re.findall(r".*\s\((.*)\)", title_year_raw))[0]
        
        title = (re.findall(r"(.*)\s\(.*\)", title_year_raw))[0]
        
        domestic_box_office_raw = treex.xpath("//*[@id='movie_finances']/tr[2]/td[2]/text()")[0]
        domestic_box_office = domestic_box_office_raw.replace(",","").replace("$","")
        domestic_box_office = int(domestic_box_office)
            
        #sometimes domestic box office can be zero
        
        international_box_office_raw = treex.xpath("//*[@id='movie_finances']/tr[2]/td[2]/text()")[0]
        international_box_office = international_box_office_raw.replace(",","").replace("$","")
        international_box_office = int(international_box_office)
        
        try:
            production_budget_raw = treex.xpath\
("//*[@id='summary']//b[contains(text(),'Production') and contains(text(),'Budget')]/following::td[1]/text()")[0]
        
        except:
            print('problem with production budget on filename {}'.format(filename))
            production_budget = 0
            
        else:
            production_budget = production_budget_raw.replace(",","").replace("$","")
            production_budget = int(production_budget)
            
        
        domestic_release_date = ""

        for number in [1,2,3,4]:
            string =\
"//*[@id='summary']//b[contains(text(),'Domestic') and contains(text(),'Releases')]/following::td[1]/text()["+\
str(number) + "]"
            try:
                domestic_release_date_raw = treex.xpath(string)[0]
            except:
                continue
            else:
                if 'Wide' in domestic_release_date_raw:
                    try:
                        domestic_release_date = (re.findall(r"(.*)\,.*", domestic_release_date_raw))[0]
                        domestic_release_date = domestic_release_date[:-2]
                        domestic_release_date = domestic_release_date + " " + year
                    
                        domestic_release_date = datetime.strptime(domestic_release_date, '%B %d %Y')\
                                                        .strftime('%Y-%m-%d')
                    except:
                        domestic_release_date = ""
                        
                    stringx = \
"//*[@id='summary']//b[contains(text(),'Domestic') and contains(text(),'Releases')]/following::td[1]/a["+\
str(number) + "]/text()"
                    try:
                        domestic_distributor_raw = treex.xpath(stringx)[0]
                    except:
                        domestic_distributor = ""
                    else:
                        domestic_distributor = domestic_distributor_raw
                            
                        
        if domestic_release_date == "":
            
            for number in [1,2,3,4]:
                string =\
"//*[@id='summary']//b[contains(text(),'Domestic') and contains(text(),'Releases')]/following::td[1]/text()["+\
    str(number) + "]"
                try:
                    domestic_release_date_raw = treex.xpath(string)[0]
                except:
                    continue
                else:
                    if 'Limited' in domestic_release_date_raw:
                        try:
                            domestic_release_date = (re.findall(r"(.*)\,.*", domestic_release_date_raw))[0]
                            domestic_release_date = domestic_release_date[:-2]
                            domestic_release_date = domestic_release_date + " " + year
                        
                            domestic_release_date = datetime.strptime(domestic_release_date, '%B %d %Y')\
                                                            .strftime('%Y-%m-%d')
                        except:
                            domestic_release_date = ""
                            
                        stringz = \
"//*[@id='summary']//b[contains(text(),'Domestic') and contains(text(),'Releases')]/following::td[1]/a["+\
str(number) + "]/text()"
                        try:
                            domestic_distributor_raw = treex.xpath(stringz)[0]
                        except:
                            domestic_distributor = ""
                        else:
                            domestic_distributor = domestic_distributor_raw
                            
                            
        try:
            mpaa_rating =\
                treex.xpath\
("//*[@id='summary']//b[contains(text(),'MPAA') and contains(text(),'Rating')]/following::td[1]/a/text()")[0]
        except:
            mpaa_rating = ""
    
        try:
            franchise_raw = treex.xpath\
("//*[@id='summary']//b[contains(text(),'Franchise')]/following::td[1]/a/text()")[0]
        
        except:
            franchise = ""
        else:
            franchise = franchise_raw
        
        try:
            source = treex.xpath\
    ("//*[@id='summary']//b[contains(text(),'Source')]/following::td[1]/a/text()")[0]
        except:
            source = ""
        
        try:
            genre = treex.xpath\
    ("//*[@id='summary']//b[contains(text(),'Genre')]/following::td[1]/a/text()")[0]
        except:
            genre = ""
            
        try:
            production_method = treex.xpath\
    ("//*[@id='summary']//b[contains(text(),'Production') and contains(text(),'Method')]/following::td[1]/a/text()")[0]
        except:
            production_method = ""
        
        try:
            creative_type = treex.xpath\
    ("//*[@id='summary']//b[contains(text(),'Creative') and contains(text(),'Type')]/following::td[1]/a/text()")[0]
        except:
            creative_type = ""
            
        try:
            production_companies_raw = treex.xpath\
("//*[@id='summary']//b[contains(text(),'Production') and contains(text(),'Companies')]/following::td[1]//a")
            production_companies_list = []
            for a in production_companies_raw:
                production_company_single = a.xpath("text()")[0]
                production_companies_list.append(production_company_single)
                
        except:
            production_companies = ""
        else:
            production_companies = ",".join(production_companies_list)

        
        try:
            production_countries_raw = treex.xpath\
("//*[@id='summary']//b[contains(text(),'Production') and contains(text(),'Countries')]/following::td[1]//a")
            production_countries_list = []
            for a in production_countries_raw:
                production_country_single = a.xpath("text()")[0]
                production_countries_list.append(production_country_single)

        except:
            production_countries = ""
            
        else:
            production_countries = ",".join(production_countries_list)
        
        new_row = [numberx, year, title, domestic_box_office, international_box_office,
                   production_budget, domestic_release_date, domestic_distributor, mpaa_rating, franchise,
                   source, genre, production_method, creative_type, production_companies,
                   production_countries]

        with open("data/scraped_data.csv", "a") as filez:
            wr = csv.writer(filez, delimiter = ',' , quotechar = '"' )
            wr.writerow(new_row)
    
print("DONE AND DONE")

movie_pages/4025_2017.html
problem with production budget on filename movie_pages/4025_2017.html
movie_pages/0269_2013.html
problem with production budget on filename movie_pages/0269_2013.html
movie_pages/4428_2017.html
problem with production budget on filename movie_pages/4428_2017.html
movie_pages/0650_2013.html
problem with production budget on filename movie_pages/0650_2013.html
movie_pages/1882_2015.html
problem with production budget on filename movie_pages/1882_2015.html
movie_pages/1187_2014.html
problem with production budget on filename movie_pages/1187_2014.html
movie_pages/2230_2015.html
problem with production budget on filename movie_pages/2230_2015.html
movie_pages/0559_2013.html
problem with production budget on filename movie_pages/0559_2013.html
movie_pages/0164_2013.html
movie_pages/1451_2014.html
problem with production budget on filename movie_pages/1451_2014.html
movie_pages/4608_2017.html
problem with production budget on filename movie_pages/4608_2017.html
mov

movie_pages/2772_2015.html
movie_pages/3157_2016.html
problem with production budget on filename movie_pages/3157_2016.html
movie_pages/1476_2014.html
problem with production budget on filename movie_pages/1476_2014.html
movie_pages/5240_2018.html
problem with production budget on filename movie_pages/5240_2018.html
movie_pages/2714_2015.html
movie_pages/0504_2013.html
problem with production budget on filename movie_pages/0504_2013.html
movie_pages/2337_2015.html
movie_pages/2927_2016.html
movie_pages/4406_2017.html
problem with production budget on filename movie_pages/4406_2017.html
movie_pages/0870_2014.html
movie_pages/1173_2014.html
problem with production budget on filename movie_pages/1173_2014.html
movie_pages/1362_2014.html
problem with production budget on filename movie_pages/1362_2014.html
movie_pages/1370_2014.html
problem with production budget on filename movie_pages/1370_2014.html
movie_pages/0618_2013.html
problem with production budget on filename movie_pages/0618_20

movie_pages/0908_2014.html
movie_pages/1125_2014.html
movie_pages/1587_2014.html
problem with production budget on filename movie_pages/1587_2014.html
movie_pages/0323_2013.html
problem with production budget on filename movie_pages/0323_2013.html
movie_pages/3293_2016.html
problem with production budget on filename movie_pages/3293_2016.html
movie_pages/2121_2015.html
problem with production budget on filename movie_pages/2121_2015.html
movie_pages/5367_2018.html
problem with production budget on filename movie_pages/5367_2018.html
movie_pages/5296_2018.html
problem with production budget on filename movie_pages/5296_2018.html
movie_pages/4278_2017.html
problem with production budget on filename movie_pages/4278_2017.html
movie_pages/1245_2014.html
problem with production budget on filename movie_pages/1245_2014.html
movie_pages/5125_2018.html
problem with production budget on filename movie_pages/5125_2018.html
movie_pages/4139_2017.html
problem with production budget on filename mov

problem with production budget on filename movie_pages/4024_2017.html
movie_pages/0039_2013.html
movie_pages/1805_2015.html
movie_pages/0052_2013.html
movie_pages/4560_2017.html
problem with production budget on filename movie_pages/4560_2017.html
movie_pages/2124_2015.html
problem with production budget on filename movie_pages/2124_2015.html
movie_pages/5306_2018.html
problem with production budget on filename movie_pages/5306_2018.html
movie_pages/1045_2014.html
problem with production budget on filename movie_pages/1045_2014.html
movie_pages/0873_2014.html
problem with production budget on filename movie_pages/0873_2014.html
movie_pages/4832_2017.html
problem with production budget on filename movie_pages/4832_2017.html
movie_pages/1753_2015.html
movie_pages/5383_2018.html
problem with production budget on filename movie_pages/5383_2018.html
movie_pages/5501_2018.html
problem with production budget on filename movie_pages/5501_2018.html
movie_pages/1548_2014.html
problem with produc

problem with production budget on filename movie_pages/1157_2014.html
movie_pages/4770_2017.html
problem with production budget on filename movie_pages/4770_2017.html
movie_pages/4446_2017.html
problem with production budget on filename movie_pages/4446_2017.html
movie_pages/0160_2013.html
movie_pages/2936_2016.html
movie_pages/2956_2016.html
movie_pages/5356_2018.html
problem with production budget on filename movie_pages/5356_2018.html
movie_pages/1477_2014.html
problem with production budget on filename movie_pages/1477_2014.html
movie_pages/4831_2017.html
problem with production budget on filename movie_pages/4831_2017.html
movie_pages/1360_2014.html
problem with production budget on filename movie_pages/1360_2014.html
movie_pages/5402_2018.html
problem with production budget on filename movie_pages/5402_2018.html
movie_pages/0503_2013.html
problem with production budget on filename movie_pages/0503_2013.html
movie_pages/3016_2016.html
movie_pages/3143_2016.html
problem with produc

movie_pages/1988_2015.html
problem with production budget on filename movie_pages/1988_2015.html
movie_pages/3819_2016.html
problem with production budget on filename movie_pages/3819_2016.html
movie_pages/1016_2014.html
problem with production budget on filename movie_pages/1016_2014.html
movie_pages/4823_2017.html
problem with production budget on filename movie_pages/4823_2017.html
movie_pages/0759_2013.html
problem with production budget on filename movie_pages/0759_2013.html
movie_pages/4949_2017.html
problem with production budget on filename movie_pages/4949_2017.html
movie_pages/0727_2013.html
problem with production budget on filename movie_pages/0727_2013.html
movie_pages/0102_2013.html
movie_pages/3932_2017.html
problem with production budget on filename movie_pages/3932_2017.html
movie_pages/3907_2017.html
movie_pages/2378_2015.html
problem with production budget on filename movie_pages/2378_2015.html
movie_pages/0483_2013.html
problem with production budget on filename mov

movie_pages/4181_2017.html
problem with production budget on filename movie_pages/4181_2017.html
movie_pages/0717_2013.html
problem with production budget on filename movie_pages/0717_2013.html
movie_pages/3003_2016.html
movie_pages/2207_2015.html
problem with production budget on filename movie_pages/2207_2015.html
movie_pages/1403_2014.html
problem with production budget on filename movie_pages/1403_2014.html
movie_pages/0645_2013.html
problem with production budget on filename movie_pages/0645_2013.html
movie_pages/0577_2013.html
problem with production budget on filename movie_pages/0577_2013.html
movie_pages/2380_2015.html
problem with production budget on filename movie_pages/2380_2015.html
movie_pages/2193_2015.html
problem with production budget on filename movie_pages/2193_2015.html
movie_pages/5289_2018.html
problem with production budget on filename movie_pages/5289_2018.html
movie_pages/4325_2017.html
problem with production budget on filename movie_pages/4325_2017.html
mov

movie_pages/1634_2014.html
problem with production budget on filename movie_pages/1634_2014.html
movie_pages/4518_2017.html
problem with production budget on filename movie_pages/4518_2017.html
movie_pages/2985_2016.html
movie_pages/1651_2014.html
problem with production budget on filename movie_pages/1651_2014.html
movie_pages/1262_2014.html
problem with production budget on filename movie_pages/1262_2014.html
movie_pages/5261_2018.html
movie_pages/5418_2018.html
problem with production budget on filename movie_pages/5418_2018.html
movie_pages/1380_2014.html
problem with production budget on filename movie_pages/1380_2014.html
movie_pages/1425_2014.html
problem with production budget on filename movie_pages/1425_2014.html
movie_pages/5434_2018.html
movie_pages/4987_2017.html
problem with production budget on filename movie_pages/4987_2017.html
movie_pages/1459_2014.html
problem with production budget on filename movie_pages/1459_2014.html
movie_pages/1378_2014.html
problem with produc

movie_pages/0188_2013.html
movie_pages/4763_2017.html
problem with production budget on filename movie_pages/4763_2017.html
movie_pages/0658_2013.html
problem with production budget on filename movie_pages/0658_2013.html
movie_pages/4720_2017.html
problem with production budget on filename movie_pages/4720_2017.html
movie_pages/0226_2013.html
problem with production budget on filename movie_pages/0226_2013.html
movie_pages/3267_2016.html
problem with production budget on filename movie_pages/3267_2016.html
movie_pages/1662_2014.html
problem with production budget on filename movie_pages/1662_2014.html
movie_pages/2328_2015.html
problem with production budget on filename movie_pages/2328_2015.html
movie_pages/3422_2016.html
problem with production budget on filename movie_pages/3422_2016.html
movie_pages/0996_2014.html
problem with production budget on filename movie_pages/0996_2014.html
movie_pages/5290_2018.html
problem with production budget on filename movie_pages/5290_2018.html
mov

problem with production budget on filename movie_pages/3924_2017.html
movie_pages/4317_2017.html
problem with production budget on filename movie_pages/4317_2017.html
movie_pages/0486_2013.html
problem with production budget on filename movie_pages/0486_2013.html
movie_pages/1143_2014.html
problem with production budget on filename movie_pages/1143_2014.html
movie_pages/2255_2015.html
problem with production budget on filename movie_pages/2255_2015.html
movie_pages/1446_2014.html
problem with production budget on filename movie_pages/1446_2014.html
movie_pages/1417_2014.html
problem with production budget on filename movie_pages/1417_2014.html
movie_pages/3165_2016.html
problem with production budget on filename movie_pages/3165_2016.html
movie_pages/5105_2018.html
problem with production budget on filename movie_pages/5105_2018.html
movie_pages/1090_2014.html
problem with production budget on filename movie_pages/1090_2014.html
movie_pages/0225_2013.html
problem with production budget

movie_pages/3131_2016.html
problem with production budget on filename movie_pages/3131_2016.html
movie_pages/0680_2013.html
problem with production budget on filename movie_pages/0680_2013.html
movie_pages/4419_2017.html
problem with production budget on filename movie_pages/4419_2017.html
movie_pages/1835_2015.html
movie_pages/3999_2017.html
problem with production budget on filename movie_pages/3999_2017.html
movie_pages/2596_2015.html
problem with production budget on filename movie_pages/2596_2015.html
movie_pages/2819_2015.html
movie_pages/3291_2016.html
problem with production budget on filename movie_pages/3291_2016.html
movie_pages/1221_2014.html
problem with production budget on filename movie_pages/1221_2014.html
movie_pages/0221_2013.html
problem with production budget on filename movie_pages/0221_2013.html
movie_pages/3543_2016.html
problem with production budget on filename movie_pages/3543_2016.html
movie_pages/0182_2013.html
movie_pages/0875_2014.html
movie_pages/2818_20

movie_pages/4410_2017.html
problem with production budget on filename movie_pages/4410_2017.html
movie_pages/2733_2015.html
problem with production budget on filename movie_pages/2733_2015.html
movie_pages/4041_2017.html
problem with production budget on filename movie_pages/4041_2017.html
movie_pages/2619_2015.html
problem with production budget on filename movie_pages/2619_2015.html
movie_pages/5028_2017.html
problem with production budget on filename movie_pages/5028_2017.html
movie_pages/1218_2014.html
problem with production budget on filename movie_pages/1218_2014.html
movie_pages/2506_2015.html
problem with production budget on filename movie_pages/2506_2015.html
movie_pages/0708_2013.html
problem with production budget on filename movie_pages/0708_2013.html
movie_pages/2807_2015.html
problem with production budget on filename movie_pages/2807_2015.html
movie_pages/4524_2017.html
problem with production budget on filename movie_pages/4524_2017.html
movie_pages/0233_2013.html
pro

movie_pages/0915_2014.html
movie_pages/0732_2013.html
problem with production budget on filename movie_pages/0732_2013.html
movie_pages/2734_2015.html
problem with production budget on filename movie_pages/2734_2015.html
movie_pages/4985_2017.html
problem with production budget on filename movie_pages/4985_2017.html
movie_pages/1018_2014.html
problem with production budget on filename movie_pages/1018_2014.html
movie_pages/3483_2016.html
problem with production budget on filename movie_pages/3483_2016.html
movie_pages/4108_2017.html
problem with production budget on filename movie_pages/4108_2017.html
movie_pages/4710_2017.html
problem with production budget on filename movie_pages/4710_2017.html
movie_pages/5152_2018.html
problem with production budget on filename movie_pages/5152_2018.html
movie_pages/5002_2017.html
problem with production budget on filename movie_pages/5002_2017.html
movie_pages/4081_2017.html
problem with production budget on filename movie_pages/4081_2017.html
mov

problem with production budget on filename movie_pages/5347_2018.html
movie_pages/5010_2017.html
problem with production budget on filename movie_pages/5010_2017.html
movie_pages/5292_2018.html
problem with production budget on filename movie_pages/5292_2018.html
movie_pages/1637_2014.html
problem with production budget on filename movie_pages/1637_2014.html
movie_pages/3911_2017.html
movie_pages/1812_2015.html
movie_pages/2033_2015.html
problem with production budget on filename movie_pages/2033_2015.html
movie_pages/3942_2017.html
problem with production budget on filename movie_pages/3942_2017.html
movie_pages/5404_2018.html
problem with production budget on filename movie_pages/5404_2018.html
movie_pages/2774_2015.html
problem with production budget on filename movie_pages/2774_2015.html
movie_pages/4440_2017.html
problem with production budget on filename movie_pages/4440_2017.html
movie_pages/1442_2014.html
problem with production budget on filename movie_pages/1442_2014.html
mov

movie_pages/2173_2015.html
problem with production budget on filename movie_pages/2173_2015.html
movie_pages/3967_2017.html
movie_pages/3115_2016.html
problem with production budget on filename movie_pages/3115_2016.html
movie_pages/4474_2017.html
problem with production budget on filename movie_pages/4474_2017.html
movie_pages/5086_2018.html
problem with production budget on filename movie_pages/5086_2018.html
movie_pages/2597_2015.html
problem with production budget on filename movie_pages/2597_2015.html
movie_pages/3039_2016.html
movie_pages/0882_2014.html
movie_pages/1842_2015.html
movie_pages/0838_2014.html
movie_pages/2423_2015.html
problem with production budget on filename movie_pages/2423_2015.html
movie_pages/0895_2014.html
movie_pages/4901_2017.html
problem with production budget on filename movie_pages/4901_2017.html
movie_pages/4974_2017.html
problem with production budget on filename movie_pages/4974_2017.html
movie_pages/4566_2017.html
problem with production budget on f

movie_pages/2721_2015.html
problem with production budget on filename movie_pages/2721_2015.html
movie_pages/5432_2018.html
movie_pages/4979_2017.html
problem with production budget on filename movie_pages/4979_2017.html
movie_pages/2637_2015.html
movie_pages/4349_2017.html
problem with production budget on filename movie_pages/4349_2017.html
movie_pages/1848_2015.html
movie_pages/5454_2018.html
problem with production budget on filename movie_pages/5454_2018.html
movie_pages/1621_2014.html
problem with production budget on filename movie_pages/1621_2014.html
movie_pages/0578_2013.html
problem with production budget on filename movie_pages/0578_2013.html
movie_pages/4883_2017.html
problem with production budget on filename movie_pages/4883_2017.html
movie_pages/2392_2015.html
movie_pages/2839_2015.html
problem with production budget on filename movie_pages/2839_2015.html
movie_pages/1296_2014.html
problem with production budget on filename movie_pages/1296_2014.html
movie_pages/2016_20

movie_pages/1233_2014.html
problem with production budget on filename movie_pages/1233_2014.html
movie_pages/4414_2017.html
problem with production budget on filename movie_pages/4414_2017.html
movie_pages/2098_2015.html
movie_pages/3361_2016.html
problem with production budget on filename movie_pages/3361_2016.html
movie_pages/3343_2016.html
problem with production budget on filename movie_pages/3343_2016.html
movie_pages/4159_2017.html
problem with production budget on filename movie_pages/4159_2017.html
movie_pages/2479_2015.html
problem with production budget on filename movie_pages/2479_2015.html
movie_pages/4810_2017.html
problem with production budget on filename movie_pages/4810_2017.html
movie_pages/0962_2014.html
problem with production budget on filename movie_pages/0962_2014.html
movie_pages/0970_2014.html
problem with production budget on filename movie_pages/0970_2014.html
movie_pages/0239_2013.html
problem with production budget on filename movie_pages/0239_2013.html
mov

problem with production budget on filename movie_pages/2175_2015.html
movie_pages/4249_2017.html
problem with production budget on filename movie_pages/4249_2017.html
movie_pages/0892_2014.html
movie_pages/0387_2013.html
problem with production budget on filename movie_pages/0387_2013.html
movie_pages/5021_2017.html
problem with production budget on filename movie_pages/5021_2017.html
movie_pages/1682_2014.html
problem with production budget on filename movie_pages/1682_2014.html
movie_pages/3592_2016.html
movie_pages/2400_2015.html
problem with production budget on filename movie_pages/2400_2015.html
movie_pages/5099_2018.html
problem with production budget on filename movie_pages/5099_2018.html
movie_pages/4594_2017.html
problem with production budget on filename movie_pages/4594_2017.html
movie_pages/4733_2017.html
problem with production budget on filename movie_pages/4733_2017.html
movie_pages/0652_2013.html
problem with production budget on filename movie_pages/0652_2013.html
mov

movie_pages/4395_2017.html
problem with production budget on filename movie_pages/4395_2017.html
movie_pages/5264_2018.html
problem with production budget on filename movie_pages/5264_2018.html
movie_pages/0437_2013.html
problem with production budget on filename movie_pages/0437_2013.html
movie_pages/2116_2015.html
problem with production budget on filename movie_pages/2116_2015.html
movie_pages/1209_2014.html
problem with production budget on filename movie_pages/1209_2014.html
movie_pages/4838_2017.html
problem with production budget on filename movie_pages/4838_2017.html
movie_pages/0468_2013.html
problem with production budget on filename movie_pages/0468_2013.html
movie_pages/1547_2014.html
problem with production budget on filename movie_pages/1547_2014.html
movie_pages/4892_2017.html
problem with production budget on filename movie_pages/4892_2017.html
movie_pages/2266_2015.html
problem with production budget on filename movie_pages/2266_2015.html
movie_pages/2675_2015.html
mov

problem with production budget on filename movie_pages/1043_2014.html
movie_pages/0796_2014.html
movie_pages/0563_2013.html
problem with production budget on filename movie_pages/0563_2013.html
movie_pages/3095_2016.html
problem with production budget on filename movie_pages/3095_2016.html
movie_pages/5179_2018.html
movie_pages/4463_2017.html
problem with production budget on filename movie_pages/4463_2017.html
movie_pages/3105_2016.html
problem with production budget on filename movie_pages/3105_2016.html
movie_pages/1836_2015.html
movie_pages/1041_2014.html
problem with production budget on filename movie_pages/1041_2014.html
movie_pages/5188_2018.html
movie_pages/5167_2018.html
problem with production budget on filename movie_pages/5167_2018.html
movie_pages/5279_2018.html
problem with production budget on filename movie_pages/5279_2018.html
movie_pages/3308_2016.html
problem with production budget on filename movie_pages/3308_2016.html
movie_pages/3700_2016.html
problem with produc

movie_pages/3498_2016.html
problem with production budget on filename movie_pages/3498_2016.html
movie_pages/1826_2015.html
movie_pages/0183_2013.html
problem with production budget on filename movie_pages/0183_2013.html
movie_pages/0067_2013.html
movie_pages/1408_2014.html
problem with production budget on filename movie_pages/1408_2014.html
movie_pages/2929_2016.html
movie_pages/4793_2017.html
problem with production budget on filename movie_pages/4793_2017.html
movie_pages/4027_2017.html
problem with production budget on filename movie_pages/4027_2017.html
movie_pages/1598_2014.html
problem with production budget on filename movie_pages/1598_2014.html
movie_pages/4232_2017.html
problem with production budget on filename movie_pages/4232_2017.html
movie_pages/2728_2015.html
problem with production budget on filename movie_pages/2728_2015.html
movie_pages/4177_2017.html
problem with production budget on filename movie_pages/4177_2017.html
movie_pages/1658_2014.html
problem with produc

problem with production budget on filename movie_pages/4265_2017.html
movie_pages/4559_2017.html
problem with production budget on filename movie_pages/4559_2017.html
movie_pages/3836_2016.html
problem with production budget on filename movie_pages/3836_2016.html
movie_pages/1450_2014.html
problem with production budget on filename movie_pages/1450_2014.html
movie_pages/1062_2014.html
problem with production budget on filename movie_pages/1062_2014.html
movie_pages/1963_2015.html
problem with production budget on filename movie_pages/1963_2015.html
movie_pages/2133_2015.html
problem with production budget on filename movie_pages/2133_2015.html
movie_pages/3480_2016.html
problem with production budget on filename movie_pages/3480_2016.html
movie_pages/1079_2014.html
problem with production budget on filename movie_pages/1079_2014.html
movie_pages/1080_2014.html
movie_pages/4390_2017.html
problem with production budget on filename movie_pages/4390_2017.html
movie_pages/1329_2014.html
pro

movie_pages/2231_2015.html
problem with production budget on filename movie_pages/2231_2015.html
movie_pages/5151_2018.html
problem with production budget on filename movie_pages/5151_2018.html
movie_pages/4138_2017.html
problem with production budget on filename movie_pages/4138_2017.html
movie_pages/3302_2016.html
problem with production budget on filename movie_pages/3302_2016.html
movie_pages/0818_2014.html
movie_pages/1254_2014.html
problem with production budget on filename movie_pages/1254_2014.html
movie_pages/3282_2016.html
problem with production budget on filename movie_pages/3282_2016.html
movie_pages/2324_2015.html
problem with production budget on filename movie_pages/2324_2015.html
movie_pages/1531_2014.html
problem with production budget on filename movie_pages/1531_2014.html
movie_pages/4019_2017.html
problem with production budget on filename movie_pages/4019_2017.html
movie_pages/5456_2018.html
problem with production budget on filename movie_pages/5456_2018.html
mov

problem with production budget on filename movie_pages/5230_2018.html
movie_pages/0334_2013.html
problem with production budget on filename movie_pages/0334_2013.html
movie_pages/2704_2015.html
problem with production budget on filename movie_pages/2704_2015.html
movie_pages/4086_2017.html
problem with production budget on filename movie_pages/4086_2017.html
movie_pages/0894_2014.html
movie_pages/3141_2016.html
problem with production budget on filename movie_pages/3141_2016.html
movie_pages/3079_2016.html
problem with production budget on filename movie_pages/3079_2016.html
movie_pages/0683_2013.html
problem with production budget on filename movie_pages/0683_2013.html
movie_pages/0829_2014.html
movie_pages/5129_2018.html
problem with production budget on filename movie_pages/5129_2018.html
movie_pages/0237_2013.html
problem with production budget on filename movie_pages/0237_2013.html
movie_pages/3823_2016.html
problem with production budget on filename movie_pages/3823_2016.html
mov

movie_pages/2321_2015.html
problem with production budget on filename movie_pages/2321_2015.html
movie_pages/5388_2018.html
problem with production budget on filename movie_pages/5388_2018.html
movie_pages/2705_2015.html
problem with production budget on filename movie_pages/2705_2015.html
movie_pages/2602_2015.html
problem with production budget on filename movie_pages/2602_2015.html
movie_pages/0663_2013.html
movie_pages/1770_2015.html
movie_pages/2699_2015.html
problem with production budget on filename movie_pages/2699_2015.html
movie_pages/1449_2014.html
problem with production budget on filename movie_pages/1449_2014.html
movie_pages/3088_2016.html
problem with production budget on filename movie_pages/3088_2016.html
movie_pages/3656_2016.html
problem with production budget on filename movie_pages/3656_2016.html
movie_pages/1197_2014.html
problem with production budget on filename movie_pages/1197_2014.html
movie_pages/0820_2014.html
movie_pages/4251_2017.html
problem with produc

problem with production budget on filename movie_pages/4018_2017.html
movie_pages/1724_2014.html
movie_pages/4355_2017.html
problem with production budget on filename movie_pages/4355_2017.html
movie_pages/1534_2014.html
problem with production budget on filename movie_pages/1534_2014.html
movie_pages/3454_2016.html
problem with production budget on filename movie_pages/3454_2016.html
movie_pages/0425_2013.html
problem with production budget on filename movie_pages/0425_2013.html
movie_pages/3228_2016.html
problem with production budget on filename movie_pages/3228_2016.html
movie_pages/3865_2017.html
movie_pages/4160_2017.html
problem with production budget on filename movie_pages/4160_2017.html
movie_pages/5489_2018.html
problem with production budget on filename movie_pages/5489_2018.html
movie_pages/1182_2014.html
problem with production budget on filename movie_pages/1182_2014.html
movie_pages/4739_2017.html
problem with production budget on filename movie_pages/4739_2017.html
mov

movie_pages/2757_2015.html
problem with production budget on filename movie_pages/2757_2015.html
movie_pages/3479_2016.html
problem with production budget on filename movie_pages/3479_2016.html
movie_pages/5295_2018.html
problem with production budget on filename movie_pages/5295_2018.html
movie_pages/4575_2017.html
problem with production budget on filename movie_pages/4575_2017.html
movie_pages/5220_2018.html
problem with production budget on filename movie_pages/5220_2018.html
movie_pages/4909_2017.html
problem with production budget on filename movie_pages/4909_2017.html
movie_pages/5559_2018.html
problem with production budget on filename movie_pages/5559_2018.html
movie_pages/2778_2015.html
problem with production budget on filename movie_pages/2778_2015.html
movie_pages/5409_2018.html
problem with production budget on filename movie_pages/5409_2018.html
movie_pages/0613_2013.html
problem with production budget on filename movie_pages/0613_2013.html
movie_pages/4120_2017.html
pro

movie_pages/0073_2013.html
movie_pages/3021_2016.html
movie_pages/5505_2018.html
movie_pages/0890_2014.html
movie_pages/5365_2018.html
problem with production budget on filename movie_pages/5365_2018.html
movie_pages/3312_2016.html
problem with production budget on filename movie_pages/3312_2016.html
movie_pages/2822_2015.html
problem with production budget on filename movie_pages/2822_2015.html
movie_pages/3315_2016.html
problem with production budget on filename movie_pages/3315_2016.html
movie_pages/2781_2015.html
problem with production budget on filename movie_pages/2781_2015.html
movie_pages/3950_2017.html
problem with production budget on filename movie_pages/3950_2017.html
movie_pages/2320_2015.html
problem with production budget on filename movie_pages/2320_2015.html
movie_pages/3581_2016.html
problem with production budget on filename movie_pages/3581_2016.html
movie_pages/3765_2016.html
problem with production budget on filename movie_pages/3765_2016.html
movie_pages/1122_20

movie_pages/4239_2017.html
problem with production budget on filename movie_pages/4239_2017.html
movie_pages/2608_2015.html
problem with production budget on filename movie_pages/2608_2015.html
movie_pages/0143_2013.html
problem with production budget on filename movie_pages/0143_2013.html
movie_pages/2162_2015.html
problem with production budget on filename movie_pages/2162_2015.html
movie_pages/4029_2017.html
problem with production budget on filename movie_pages/4029_2017.html
movie_pages/0640_2013.html
problem with production budget on filename movie_pages/0640_2013.html
movie_pages/3031_2016.html
movie_pages/0914_2014.html
problem with production budget on filename movie_pages/0914_2014.html
movie_pages/4465_2017.html
problem with production budget on filename movie_pages/4465_2017.html
movie_pages/0280_2013.html
problem with production budget on filename movie_pages/0280_2013.html
movie_pages/3621_2016.html
problem with production budget on filename movie_pages/3621_2016.html
mov

problem with production budget on filename movie_pages/3662_2016.html
movie_pages/4870_2017.html
problem with production budget on filename movie_pages/4870_2017.html
movie_pages/1992_2015.html
problem with production budget on filename movie_pages/1992_2015.html
movie_pages/5291_2018.html
problem with production budget on filename movie_pages/5291_2018.html
movie_pages/3533_2016.html
problem with production budget on filename movie_pages/3533_2016.html
movie_pages/5020_2017.html
problem with production budget on filename movie_pages/5020_2017.html
movie_pages/4806_2017.html
problem with production budget on filename movie_pages/4806_2017.html
movie_pages/4248_2017.html
problem with production budget on filename movie_pages/4248_2017.html
movie_pages/1148_2014.html
problem with production budget on filename movie_pages/1148_2014.html
movie_pages/1755_2015.html
movie_pages/4276_2017.html
problem with production budget on filename movie_pages/4276_2017.html
movie_pages/3954_2017.html
mov

problem with production budget on filename movie_pages/2756_2015.html
movie_pages/0302_2013.html
movie_pages/3585_2016.html
problem with production budget on filename movie_pages/3585_2016.html
movie_pages/3913_2017.html
movie_pages/1857_2015.html
movie_pages/4073_2017.html
problem with production budget on filename movie_pages/4073_2017.html
movie_pages/3750_2016.html
problem with production budget on filename movie_pages/3750_2016.html
movie_pages/0773_2013.html
problem with production budget on filename movie_pages/0773_2013.html
movie_pages/4703_2017.html
problem with production budget on filename movie_pages/4703_2017.html
movie_pages/5363_2018.html
problem with production budget on filename movie_pages/5363_2018.html
movie_pages/0846_2014.html
movie_pages/0975_2014.html
problem with production budget on filename movie_pages/0975_2014.html
movie_pages/2240_2015.html
movie_pages/5115_2018.html
problem with production budget on filename movie_pages/5115_2018.html
movie_pages/2212_20

problem with production budget on filename movie_pages/3065_2016.html
movie_pages/1021_2014.html
problem with production budget on filename movie_pages/1021_2014.html
movie_pages/2299_2015.html
problem with production budget on filename movie_pages/2299_2015.html
movie_pages/2478_2015.html
movie_pages/3112_2016.html
problem with production budget on filename movie_pages/3112_2016.html
movie_pages/0491_2013.html
problem with production budget on filename movie_pages/0491_2013.html
movie_pages/1216_2014.html
problem with production budget on filename movie_pages/1216_2014.html
movie_pages/3713_2016.html
problem with production budget on filename movie_pages/3713_2016.html
movie_pages/4003_2017.html
problem with production budget on filename movie_pages/4003_2017.html
movie_pages/5427_2018.html
movie_pages/3364_2016.html
problem with production budget on filename movie_pages/3364_2016.html
movie_pages/3840_2016.html
problem with production budget on filename movie_pages/3840_2016.html
mov

movie_pages/2396_2015.html
problem with production budget on filename movie_pages/2396_2015.html
movie_pages/3185_2016.html
problem with production budget on filename movie_pages/3185_2016.html
movie_pages/0495_2013.html
problem with production budget on filename movie_pages/0495_2013.html
movie_pages/3661_2016.html
problem with production budget on filename movie_pages/3661_2016.html
movie_pages/5462_2018.html
problem with production budget on filename movie_pages/5462_2018.html
movie_pages/2430_2015.html
problem with production budget on filename movie_pages/2430_2015.html
movie_pages/2856_2015.html
problem with production budget on filename movie_pages/2856_2015.html
movie_pages/3083_2016.html
movie_pages/0093_2013.html
movie_pages/1167_2014.html
problem with production budget on filename movie_pages/1167_2014.html
movie_pages/1999_2015.html
movie_pages/5325_2018.html
problem with production budget on filename movie_pages/5325_2018.html
movie_pages/3778_2016.html
problem with produc

movie_pages/4183_2017.html
problem with production budget on filename movie_pages/4183_2017.html
movie_pages/3944_2017.html
movie_pages/3739_2016.html
problem with production budget on filename movie_pages/3739_2016.html
movie_pages/0330_2013.html
problem with production budget on filename movie_pages/0330_2013.html
movie_pages/4450_2017.html
problem with production budget on filename movie_pages/4450_2017.html
movie_pages/3791_2016.html
problem with production budget on filename movie_pages/3791_2016.html
movie_pages/0011_2013.html
movie_pages/3832_2016.html
problem with production budget on filename movie_pages/3832_2016.html
movie_pages/0112_2013.html
movie_pages/1925_2015.html
problem with production budget on filename movie_pages/1925_2015.html
movie_pages/3440_2016.html
problem with production budget on filename movie_pages/3440_2016.html
movie_pages/4704_2017.html
problem with production budget on filename movie_pages/4704_2017.html
movie_pages/4457_2017.html
problem with produc

problem with production budget on filename movie_pages/3614_2016.html
movie_pages/2117_2015.html
problem with production budget on filename movie_pages/2117_2015.html
movie_pages/0223_2013.html
movie_pages/3738_2016.html
problem with production budget on filename movie_pages/3738_2016.html
movie_pages/3811_2016.html
problem with production budget on filename movie_pages/3811_2016.html
movie_pages/2890_2016.html
movie_pages/3200_2016.html
problem with production budget on filename movie_pages/3200_2016.html
movie_pages/4913_2017.html
problem with production budget on filename movie_pages/4913_2017.html
movie_pages/2706_2015.html
problem with production budget on filename movie_pages/2706_2015.html
movie_pages/4694_2017.html
problem with production budget on filename movie_pages/4694_2017.html
movie_pages/4820_2017.html
problem with production budget on filename movie_pages/4820_2017.html
movie_pages/3697_2016.html
problem with production budget on filename movie_pages/3697_2016.html
mov

movie_pages/2665_2015.html
movie_pages/5381_2018.html
problem with production budget on filename movie_pages/5381_2018.html
movie_pages/4760_2017.html
problem with production budget on filename movie_pages/4760_2017.html
movie_pages/1452_2014.html
problem with production budget on filename movie_pages/1452_2014.html
movie_pages/0634_2013.html
problem with production budget on filename movie_pages/0634_2013.html
movie_pages/5534_2018.html
problem with production budget on filename movie_pages/5534_2018.html
movie_pages/0679_2013.html
problem with production budget on filename movie_pages/0679_2013.html
movie_pages/1659_2014.html
problem with production budget on filename movie_pages/1659_2014.html
movie_pages/2653_2015.html
problem with production budget on filename movie_pages/2653_2015.html
movie_pages/4014_2017.html
problem with production budget on filename movie_pages/4014_2017.html
movie_pages/4228_2017.html
problem with production budget on filename movie_pages/4228_2017.html
mov

movie_pages/4632_2017.html
problem with production budget on filename movie_pages/4632_2017.html
movie_pages/3703_2016.html
problem with production budget on filename movie_pages/3703_2016.html
movie_pages/3506_2016.html
problem with production budget on filename movie_pages/3506_2016.html
movie_pages/0492_2013.html
problem with production budget on filename movie_pages/0492_2013.html
movie_pages/3673_2016.html
problem with production budget on filename movie_pages/3673_2016.html
movie_pages/4692_2017.html
problem with production budget on filename movie_pages/4692_2017.html
movie_pages/2555_2015.html
problem with production budget on filename movie_pages/2555_2015.html
movie_pages/5323_2018.html
problem with production budget on filename movie_pages/5323_2018.html
movie_pages/1089_2014.html
problem with production budget on filename movie_pages/1089_2014.html
movie_pages/2681_2015.html
problem with production budget on filename movie_pages/2681_2015.html
movie_pages/1728_2014.html
pro

movie_pages/3989_2017.html
problem with production budget on filename movie_pages/3989_2017.html
movie_pages/1605_2014.html
problem with production budget on filename movie_pages/1605_2014.html
movie_pages/4649_2017.html
problem with production budget on filename movie_pages/4649_2017.html
movie_pages/3857_2017.html
movie_pages/0090_2013.html
movie_pages/1884_2015.html
problem with production budget on filename movie_pages/1884_2015.html
movie_pages/0821_2014.html
movie_pages/5299_2018.html
problem with production budget on filename movie_pages/5299_2018.html
movie_pages/1327_2014.html
problem with production budget on filename movie_pages/1327_2014.html
movie_pages/1863_2015.html
movie_pages/2453_2015.html
problem with production budget on filename movie_pages/2453_2015.html
movie_pages/1833_2015.html
problem with production budget on filename movie_pages/1833_2015.html
movie_pages/4525_2017.html
problem with production budget on filename movie_pages/4525_2017.html
movie_pages/2043_20

problem with production budget on filename movie_pages/2820_2015.html
movie_pages/2571_2015.html
problem with production budget on filename movie_pages/2571_2015.html
movie_pages/1202_2014.html
problem with production budget on filename movie_pages/1202_2014.html
movie_pages/5331_2018.html
problem with production budget on filename movie_pages/5331_2018.html
movie_pages/1211_2014.html
movie_pages/3400_2016.html
problem with production budget on filename movie_pages/3400_2016.html
movie_pages/1976_2015.html
problem with production budget on filename movie_pages/1976_2015.html
movie_pages/5438_2018.html
movie_pages/0458_2013.html
problem with production budget on filename movie_pages/0458_2013.html
movie_pages/3049_2016.html
problem with production budget on filename movie_pages/3049_2016.html
movie_pages/3877_2017.html
movie_pages/5133_2018.html
problem with production budget on filename movie_pages/5133_2018.html
movie_pages/4487_2017.html
problem with production budget on filename mov

movie_pages/5197_2018.html
problem with production budget on filename movie_pages/5197_2018.html
movie_pages/4657_2017.html
problem with production budget on filename movie_pages/4657_2017.html
movie_pages/4902_2017.html
problem with production budget on filename movie_pages/4902_2017.html
movie_pages/2447_2015.html
problem with production budget on filename movie_pages/2447_2015.html
movie_pages/2727_2015.html
problem with production budget on filename movie_pages/2727_2015.html
movie_pages/1948_2015.html
problem with production budget on filename movie_pages/1948_2015.html
movie_pages/2027_2015.html
problem with production budget on filename movie_pages/2027_2015.html
movie_pages/4680_2017.html
problem with production budget on filename movie_pages/4680_2017.html
movie_pages/5327_2018.html
problem with production budget on filename movie_pages/5327_2018.html
movie_pages/5267_2018.html
problem with production budget on filename movie_pages/5267_2018.html
movie_pages/5387_2018.html
pro

movie_pages/5218_2018.html
problem with production budget on filename movie_pages/5218_2018.html
movie_pages/4411_2017.html
problem with production budget on filename movie_pages/4411_2017.html
movie_pages/5081_2018.html
problem with production budget on filename movie_pages/5081_2018.html
movie_pages/5238_2018.html
problem with production budget on filename movie_pages/5238_2018.html
movie_pages/3404_2016.html
problem with production budget on filename movie_pages/3404_2016.html
movie_pages/1510_2014.html
problem with production budget on filename movie_pages/1510_2014.html
movie_pages/2880_2015.html
problem with production budget on filename movie_pages/2880_2015.html
movie_pages/2343_2015.html
problem with production budget on filename movie_pages/2343_2015.html
movie_pages/0782_2013.html
problem with production budget on filename movie_pages/0782_2013.html
movie_pages/5546_2018.html
problem with production budget on filename movie_pages/5546_2018.html
movie_pages/5269_2018.html
pro

movie_pages/4658_2017.html
problem with production budget on filename movie_pages/4658_2017.html
movie_pages/0145_2013.html
movie_pages/0336_2013.html
problem with production budget on filename movie_pages/0336_2013.html
movie_pages/1746_2015.html
movie_pages/3325_2016.html
problem with production budget on filename movie_pages/3325_2016.html
movie_pages/0498_2013.html
problem with production budget on filename movie_pages/0498_2013.html
movie_pages/1522_2014.html
problem with production budget on filename movie_pages/1522_2014.html
movie_pages/3897_2017.html
movie_pages/1394_2014.html
problem with production budget on filename movie_pages/1394_2014.html
movie_pages/4195_2017.html
problem with production budget on filename movie_pages/4195_2017.html
movie_pages/0287_2013.html
problem with production budget on filename movie_pages/0287_2013.html
movie_pages/0866_2014.html
movie_pages/1086_2014.html
movie_pages/2661_2015.html
problem with production budget on filename movie_pages/2661_20

movie_pages/2568_2015.html
problem with production budget on filename movie_pages/2568_2015.html
movie_pages/0218_2013.html
problem with production budget on filename movie_pages/0218_2013.html
movie_pages/0561_2013.html
problem with production budget on filename movie_pages/0561_2013.html
movie_pages/1562_2014.html
problem with production budget on filename movie_pages/1562_2014.html
movie_pages/1851_2015.html
problem with production budget on filename movie_pages/1851_2015.html
movie_pages/0089_2013.html
movie_pages/1653_2014.html
problem with production budget on filename movie_pages/1653_2014.html
movie_pages/5130_2018.html
problem with production budget on filename movie_pages/5130_2018.html
movie_pages/2383_2015.html
problem with production budget on filename movie_pages/2383_2015.html
movie_pages/5562_2018.html
problem with production budget on filename movie_pages/5562_2018.html
movie_pages/5486_2018.html
problem with production budget on filename movie_pages/5486_2018.html
mov

movie_pages/5461_2018.html
problem with production budget on filename movie_pages/5461_2018.html
movie_pages/4272_2017.html
problem with production budget on filename movie_pages/4272_2017.html
movie_pages/3370_2016.html
problem with production budget on filename movie_pages/3370_2016.html
movie_pages/3471_2016.html
problem with production budget on filename movie_pages/3471_2016.html
movie_pages/2670_2015.html
problem with production budget on filename movie_pages/2670_2015.html
movie_pages/1920_2015.html
problem with production budget on filename movie_pages/1920_2015.html
movie_pages/4376_2017.html
problem with production budget on filename movie_pages/4376_2017.html
movie_pages/1560_2014.html
problem with production budget on filename movie_pages/1560_2014.html
movie_pages/5078_2018.html
problem with production budget on filename movie_pages/5078_2018.html
movie_pages/5036_2017.html
problem with production budget on filename movie_pages/5036_2017.html
movie_pages/1435_2014.html
pro

movie_pages/1053_2014.html
problem with production budget on filename movie_pages/1053_2014.html
movie_pages/3677_2016.html
problem with production budget on filename movie_pages/3677_2016.html
movie_pages/1594_2014.html
problem with production budget on filename movie_pages/1594_2014.html
movie_pages/1998_2015.html
movie_pages/0478_2013.html
problem with production budget on filename movie_pages/0478_2013.html
movie_pages/2752_2015.html
problem with production budget on filename movie_pages/2752_2015.html
movie_pages/1825_2015.html
movie_pages/3901_2017.html
movie_pages/1026_2014.html
problem with production budget on filename movie_pages/1026_2014.html
movie_pages/5314_2018.html
problem with production budget on filename movie_pages/5314_2018.html
movie_pages/4482_2017.html
problem with production budget on filename movie_pages/4482_2017.html
movie_pages/0137_2013.html
problem with production budget on filename movie_pages/0137_2013.html
movie_pages/2519_2015.html
problem with produc

movie_pages/5366_2018.html
problem with production budget on filename movie_pages/5366_2018.html
movie_pages/2750_2015.html
problem with production budget on filename movie_pages/2750_2015.html
movie_pages/3386_2016.html
problem with production budget on filename movie_pages/3386_2016.html
movie_pages/4864_2017.html
problem with production budget on filename movie_pages/4864_2017.html
movie_pages/5352_2018.html
problem with production budget on filename movie_pages/5352_2018.html
movie_pages/0216_2013.html
problem with production budget on filename movie_pages/0216_2013.html
movie_pages/4339_2017.html
problem with production budget on filename movie_pages/4339_2017.html
movie_pages/5380_2018.html
problem with production budget on filename movie_pages/5380_2018.html
movie_pages/3518_2016.html
problem with production budget on filename movie_pages/3518_2016.html
movie_pages/1015_2014.html
problem with production budget on filename movie_pages/1015_2014.html
movie_pages/0677_2013.html
pro

problem with production budget on filename movie_pages/4800_2017.html
movie_pages/2459_2015.html
problem with production budget on filename movie_pages/2459_2015.html
movie_pages/2923_2016.html
movie_pages/4778_2017.html
problem with production budget on filename movie_pages/4778_2017.html
movie_pages/0804_2014.html
movie_pages/3382_2016.html
problem with production budget on filename movie_pages/3382_2016.html
movie_pages/4006_2017.html
movie_pages/5258_2018.html
movie_pages/2275_2015.html
problem with production budget on filename movie_pages/2275_2015.html
movie_pages/1116_2014.html
problem with production budget on filename movie_pages/1116_2014.html
movie_pages/1498_2014.html
problem with production budget on filename movie_pages/1498_2014.html
movie_pages/4116_2017.html
problem with production budget on filename movie_pages/4116_2017.html
movie_pages/3959_2017.html
movie_pages/3419_2016.html
problem with production budget on filename movie_pages/3419_2016.html
movie_pages/2405_20

movie_pages/0268_2013.html
problem with production budget on filename movie_pages/0268_2013.html
movie_pages/0566_2013.html
problem with production budget on filename movie_pages/0566_2013.html
movie_pages/1009_2014.html
problem with production budget on filename movie_pages/1009_2014.html
movie_pages/0290_2013.html
problem with production budget on filename movie_pages/0290_2013.html
movie_pages/2960_2016.html
movie_pages/4112_2017.html
problem with production budget on filename movie_pages/4112_2017.html
movie_pages/0606_2013.html
problem with production budget on filename movie_pages/0606_2013.html
movie_pages/0842_2014.html
movie_pages/3420_2016.html
problem with production budget on filename movie_pages/3420_2016.html
movie_pages/0484_2013.html
problem with production budget on filename movie_pages/0484_2013.html
movie_pages/1410_2014.html
problem with production budget on filename movie_pages/1410_2014.html
movie_pages/5332_2018.html
problem with production budget on filename mov

problem with production budget on filename movie_pages/4212_2017.html
movie_pages/3678_2016.html
problem with production budget on filename movie_pages/3678_2016.html
movie_pages/2316_2015.html
problem with production budget on filename movie_pages/2316_2015.html
movie_pages/1118_2014.html
movie_pages/0513_2013.html
problem with production budget on filename movie_pages/0513_2013.html
movie_pages/4087_2017.html
problem with production budget on filename movie_pages/4087_2017.html
movie_pages/1745_2015.html
movie_pages/1165_2014.html
problem with production budget on filename movie_pages/1165_2014.html
movie_pages/1975_2015.html
problem with production budget on filename movie_pages/1975_2015.html
movie_pages/0262_2013.html
problem with production budget on filename movie_pages/0262_2013.html
movie_pages/4522_2017.html
problem with production budget on filename movie_pages/4522_2017.html
movie_pages/0784_2013.html
problem with production budget on filename movie_pages/0784_2013.html
mov

movie_pages/3878_2017.html
movie_pages/5437_2018.html
problem with production budget on filename movie_pages/5437_2018.html
movie_pages/0312_2013.html
problem with production budget on filename movie_pages/0312_2013.html
movie_pages/1864_2015.html
movie_pages/2707_2015.html
problem with production budget on filename movie_pages/2707_2015.html
movie_pages/4125_2017.html
problem with production budget on filename movie_pages/4125_2017.html
movie_pages/0097_2013.html
movie_pages/4034_2017.html
problem with production budget on filename movie_pages/4034_2017.html
movie_pages/4207_2017.html
problem with production budget on filename movie_pages/4207_2017.html
movie_pages/4960_2017.html
problem with production budget on filename movie_pages/4960_2017.html
movie_pages/4844_2017.html
problem with production budget on filename movie_pages/4844_2017.html
movie_pages/3809_2016.html
problem with production budget on filename movie_pages/3809_2016.html
movie_pages/1776_2015.html
movie_pages/4549_20

movie_pages/2080_2015.html
problem with production budget on filename movie_pages/2080_2015.html
movie_pages/2932_2016.html
movie_pages/4461_2017.html
problem with production budget on filename movie_pages/4461_2017.html
movie_pages/1985_2015.html
problem with production budget on filename movie_pages/1985_2015.html
movie_pages/1007_2014.html
problem with production budget on filename movie_pages/1007_2014.html
movie_pages/2032_2015.html
problem with production budget on filename movie_pages/2032_2015.html
movie_pages/0972_2014.html
problem with production budget on filename movie_pages/0972_2014.html
movie_pages/1361_2014.html
problem with production budget on filename movie_pages/1361_2014.html
movie_pages/1993_2015.html
problem with production budget on filename movie_pages/1993_2015.html
movie_pages/1911_2015.html
problem with production budget on filename movie_pages/1911_2015.html
movie_pages/2325_2015.html
problem with production budget on filename movie_pages/2325_2015.html
mov

movie_pages/1031_2014.html
problem with production budget on filename movie_pages/1031_2014.html
movie_pages/1901_2015.html
problem with production budget on filename movie_pages/1901_2015.html
movie_pages/2206_2015.html
movie_pages/2039_2015.html
problem with production budget on filename movie_pages/2039_2015.html
movie_pages/0122_2013.html
movie_pages/5052_2017.html
problem with production budget on filename movie_pages/5052_2017.html
movie_pages/2887_2016.html
movie_pages/1557_2014.html
problem with production budget on filename movie_pages/1557_2014.html
movie_pages/3072_2016.html
movie_pages/2990_2016.html
movie_pages/3276_2016.html
problem with production budget on filename movie_pages/3276_2016.html
movie_pages/1937_2015.html
problem with production budget on filename movie_pages/1937_2015.html
movie_pages/3032_2016.html
movie_pages/3921_2017.html
problem with production budget on filename movie_pages/3921_2017.html
movie_pages/5493_2018.html
problem with production budget on f

movie_pages/0104_2013.html
movie_pages/0485_2013.html
problem with production budget on filename movie_pages/0485_2013.html
movie_pages/0649_2013.html
problem with production budget on filename movie_pages/0649_2013.html
movie_pages/4535_2017.html
problem with production budget on filename movie_pages/4535_2017.html
movie_pages/0916_2014.html
problem with production budget on filename movie_pages/0916_2014.html
movie_pages/4557_2017.html
problem with production budget on filename movie_pages/4557_2017.html
movie_pages/1552_2014.html
problem with production budget on filename movie_pages/1552_2014.html
movie_pages/2276_2015.html
movie_pages/4373_2017.html
problem with production budget on filename movie_pages/4373_2017.html
movie_pages/4491_2017.html
problem with production budget on filename movie_pages/4491_2017.html
movie_pages/3660_2016.html
problem with production budget on filename movie_pages/3660_2016.html
movie_pages/2398_2015.html
problem with production budget on filename mov

In [26]:
import pandas as pd

scraped_data_df = pd.read_csv('data/scraped_data.csv', converters={'number': str})
scraped_data_df.head()

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,genre,production_method,creative_type,production_companies,production_countries
0,4025,2017,Brad’s Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,Original Screenplay,Comedy,Live Action,Contemporary Fiction,,United States
1,269,2013,Free Angela & All Political Prisoners,129102,129102,0,2013-04-05,CODEBLACK Films/Lionsgate,Not Rated,,Based on Real Life Events,Documentary,Live Action,Factual,"REALside Productions,De Films en Aiguille,Dire...",United States
2,4428,2017,The Long Home,0,0,0,,CODEBLACK Films/Lionsgate,R,,Based on Fiction Book/Short Story,Drama,Live Action,Historical Fiction,Rabbit Bandini Productions,United States
3,650,2013,Big Ass Spider,0,0,0,2013-10-18,Epic Pictures Group,PG-13,,Original Screenplay,Comedy,Live Action,Science Fiction,Epic Pictures Group,United States
4,1882,2015,Apparition,0,0,0,,Epic Pictures Group,Not Rated,,Original Screenplay,Thriller/Suspense,Live Action,Fantasy,,United States


In [27]:
scraped_data_df.shape

(5566, 16)

In [28]:
scraped_data_df.dtypes

number                      object
year                         int64
title                       object
domestic_box_office          int64
international_box_office     int64
production_budget            int64
domestic_release_date       object
domestic_distributor        object
mpaa_rating                 object
franchise                   object
source                      object
genre                       object
production_method           object
creative_type               object
production_companies        object
production_countries        object
dtype: object

In [37]:
len(scraped_data_df)

5566

In [47]:
len(scraped_data_df[scraped_data_df['domestic_box_office'] >= 500000])

1078

In [48]:
scraped_data_df_500000 = scraped_data_df[scraped_data_df['domestic_box_office'] >= 500000]

In [49]:
scraped_data_df_500000.to_csv('data/scraped_data_500000.csv', index=False)

In [None]:
#by keeping the minimum US box office at half million, ensure big arthouse/indie movies aren't filtered out

In [170]:
scraped_data_df_500000 = pd.read_csv("data/scraped_data_500000.csv")

In [171]:
scraped_data_df_500000.head()

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,genre,production_method,creative_type,production_companies,production_countries
0,4025,2017,Brad’s Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,Original Screenplay,Comedy,Live Action,Contemporary Fiction,,United States
1,164,2013,The East,2274649,2274649,6500000,2013-05-31,Fox Searchlight,PG-13,,Original Screenplay,Drama,Live Action,Contemporary Fiction,Scott Free Films,United States
2,5442,2018,The Catcher Was A Spy,714205,714205,0,2018-06-22,IFC Films,R,,Based on Fiction Book/Short Story,Drama,Live Action,Historical Fiction,"Animus Films,PalmStar Media,Serena Films,Windy...",United States
3,3062,2016,Miles Ahead,2610896,2610896,0,2016-04-01,Sony Pictures Classics,R,,Based on Real Life Events,Drama,Live Action,Dramatization,"Bifrost Pictures,Sony Pictures Classics,Miles ...",United States
4,12,2013,World War Z,202359711,202359711,190000000,2013-06-21,Paramount Pictures,PG-13,World War Z,Based on Fiction Book/Short Story,Action,Live Action,Science Fiction,"Skydance Productions,Hemisphere Media Capital,...",United States


In [172]:
scraped_data_df_500000.dtypes

number                       int64
year                         int64
title                       object
domestic_box_office          int64
international_box_office     int64
production_budget            int64
domestic_release_date       object
domestic_distributor        object
mpaa_rating                 object
franchise                   object
source                      object
genre                       object
production_method           object
creative_type               object
production_companies        object
production_countries        object
dtype: object

In [173]:
#number of movies in this list for which we don't have estimates of production budgets 

len(scraped_data_df_500000[scraped_data_df_500000['production_budget'] == 0])

320

In [174]:
bms_release_list = pd.read_csv("data/bookmyshow_list.csv")
bms_release_list

Unnamed: 0,Release_Year,EventGroup_strTitle,Event_dtmReleaseDate
0,2013,Hansel And Gretel Witch Hunters,01/01/2013 00:00:00
1,2013,7 Days In Slow Motion,04/01/2013 00:00:00
2,2013,Chinese Zodiac,04/01/2013 00:00:00
3,2013,The Impossible,04/01/2013 00:00:00
4,2013,Tropical Races 3,04/01/2013 00:00:00
5,2013,Universal Soldier: Day Of Reckoning,04/01/2013 00:00:00
6,2013,Comic Vampire,05/01/2013 00:00:00
7,2013,Dinosaur Reappearance,05/01/2013 00:00:00
8,2013,Roller Coaster & Dinosaur Adventure,05/01/2013 00:00:00
9,2013,Gangster Squad,11/01/2013 00:00:00


In [175]:
bms_release_list.dtypes

Release_Year             int64
EventGroup_strTitle     object
Event_dtmReleaseDate    object
dtype: object

In [176]:
bms_release_list.rename(columns={'Release_Year': 'release_year',
                                  'EventGroup_strTitle':'title',
                                  'Event_dtmReleaseDate':'release_date'}, inplace=True)

In [177]:
bms_release_list.dtypes

release_year     int64
title           object
release_date    object
dtype: object

In [178]:
#useful snippets
#scraped_data_df_500000.loc[scraped_data_df_500000['title'].str.contains(":")]
#len(scraped_data_df[(scraped_data_df['international_box_office'] == 0) &
  #                 (scraped_data_df['domestic_box_office'] == 0)])
#     scraped_data_df.loc[(scraped_data_df['domestic_box_office'] != 0) &
#                    (scraped_data_df['production_budget'] == 0)]

In [179]:
bms_release_list.loc[india_release_list['title'].str.contains("Pirates")]

Unnamed: 0,release_year,title,release_date
77,2013,Pirates and Roller Coaster Ride,24/04/2013 00:00:00
105,2013,Predator & Pirates,29/05/2013 00:00:00
557,2014,Pirates,01/04/2014 00:00:00
712,2014,Pirates and Wild Coaster,01/09/2014 00:00:00
886,2015,Pirates and Crazy Snowmobile,15/02/2015 00:00:00
1608,2017,Pirates of the Caribbean: Salazar`s Revenge,26/05/2017 00:00:00


In [180]:
import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in scraped_data_df_500000.iterrows():
    
    print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ") #we get a list of words here
    
    scraped_title_year = row['year']
    
    for indexx,rowx in bms_release_list.iterrows():
        
        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')
        
        bms_title_year = rowx['release_year']
        
        if ((all(word in bms_title_words for word in scraped_title_words)) &
                            (bms_title_year == scraped_title_year)):
            
            scraped_data_df_500000.loc[index,'all_words_there'] = 'yes'
            
            scraped_data_df_500000.loc[index,'all_words_there_bms_csv_text'] = rowx['title']
            
            break
            
        else:
            
            continue
        
print("DONE AND DONE")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [181]:
scraped_data_df_500000.head()

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,genre,production_method,creative_type,production_companies,production_countries,all_words_there,all_words_there_bms_csv_text
0,4025,2017,Brad’s Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,Original Screenplay,Comedy,Live Action,Contemporary Fiction,,United States,,
1,164,2013,The East,2274649,2274649,6500000,2013-05-31,Fox Searchlight,PG-13,,Original Screenplay,Drama,Live Action,Contemporary Fiction,Scott Free Films,United States,yes,The East
2,5442,2018,The Catcher Was A Spy,714205,714205,0,2018-06-22,IFC Films,R,,Based on Fiction Book/Short Story,Drama,Live Action,Historical Fiction,"Animus Films,PalmStar Media,Serena Films,Windy...",United States,,
3,3062,2016,Miles Ahead,2610896,2610896,0,2016-04-01,Sony Pictures Classics,R,,Based on Real Life Events,Drama,Live Action,Dramatization,"Bifrost Pictures,Sony Pictures Classics,Miles ...",United States,,
4,12,2013,World War Z,202359711,202359711,190000000,2013-06-21,Paramount Pictures,PG-13,World War Z,Based on Fiction Book/Short Story,Action,Live Action,Science Fiction,"Skydance Productions,Hemisphere Media Capital,...",United States,yes,World War Z


In [182]:
import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in scraped_data_df_500000.iterrows():
    
    print(index)
    
    scraped_title = simplify_title(row['title'])
    
    scraped_title_year = row['year']

    for indexx,rowx in bms_release_list.iterrows():
        
        bms_title = simplify_title(rowx['title'])
        
        bms_title_year = rowx['release_year']

        if ((bms_title == scraped_title) & (bms_title_year == scraped_title_year)):
            
            scraped_data_df_500000.loc[index,'exact_match'] = 'yes'
            
            scraped_data_df_500000.loc[index,'exact_match_bms_csv_text'] = rowx['title']
            
            break
            
        else:
            
            continue
        
print("DONE AND DONE")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [183]:
#for some reason we get Nan results with len(scraped_data_df_500000['exact_match'] == 'yes') 
    #so not using it

len(scraped_data_df_500000.loc[scraped_data_df_500000['exact_match'] == 'yes'])

618

In [184]:
#the code below from https://chrisalbon.com/machine_learning/preprocessing_text/remove_stop_words/

# Load library
from nltk.corpus import stopwords

# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shijith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [185]:
stop_words = stopwords.words('english')
stop_words[:5]

['i', 'me', 'my', 'myself', 'we']

In [186]:
len(stop_words)

179

In [187]:
import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in scraped_data_df_500000.iterrows():
    
    print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ") #we get a list of words here
    
    scraped_title_words_remaining = [word for word in scraped_title_words if word not in stop_words]
    
    scraped_title_year = row['year']
    
    for indexx,rowx in bms_release_list.iterrows():
        
        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')
        
        bms_title_words_remaining = [word for word in bms_title_words if word not in stop_words]
        
        bms_title_year = rowx['release_year']
        
        if ((set(bms_title_words_remaining) == set(scraped_title_words_remaining)) &
                            (bms_title_year == scraped_title_year)):
            
            scraped_data_df_500000.loc[index,'after_stopwords_removed'] = 'yes'
            
            scraped_data_df_500000.loc[index,'after_stopwords_removed_bms_csv_text'] = rowx['title']
            
            break
            
        else:
            
            continue
        
print("DONE AND DONE")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [188]:
import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in scraped_data_df_500000.iterrows():
    
    print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ") #we get a list of words here
    
    scraped_title_year = row['year']
    
    for indexx,rowx in bms_release_list.iterrows():
        
        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')
        
        bms_title_year = rowx['release_year']
        
        if ((set(bms_title_words) == set(scraped_title_words)) &
                            (bms_title_year == scraped_title_year)):
            
            scraped_data_df_500000.loc[index,'words_same'] = 'yes'
            
            scraped_data_df_500000.loc[index,'words_same_bms_csv_text'] = rowx['title']
            
            break
            
        else:
            
            continue
        
print("DONE AND DONE")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [189]:
scraped_data_df_500000.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text'],
      dtype='object')

In [190]:
len(scraped_data_df_500000.loc[scraped_data_df_500000['exact_match'] == 'yes'])

618

In [191]:
len(scraped_data_df_500000.loc[scraped_data_df_500000['all_words_there'] == 'yes'])

633

In [192]:
len(scraped_data_df_500000.loc[scraped_data_df_500000['words_same'] == 'yes'])

618

In [193]:
len(scraped_data_df_500000.loc[scraped_data_df_500000['after_stopwords_removed'] == 'yes'])

630

In [194]:
len(scraped_data_df_500000.loc[(scraped_data_df_500000['all_words_there'] == 'yes') |
                    (scraped_data_df_500000['after_stopwords_removed'] == 'yes') |
                    (scraped_data_df_500000['words_same'] == 'yes') |
                    (scraped_data_df_500000['exact_match'] == 'yes') 
                              ])

642

In [195]:
#changing number of rows that can be displayed in pandas

pd.set_option('display.max_rows', 700)

In [206]:
master_list_df = scraped_data_df_500000.loc[(scraped_data_df_500000['all_words_there'] == 'yes') |
                    (scraped_data_df_500000['after_stopwords_removed'] == 'yes') |
                    (scraped_data_df_500000['words_same'] == 'yes') |
                    (scraped_data_df_500000['exact_match'] == 'yes') 
                              ]

In [207]:
len(scraped_data_df_500000)

1078

In [208]:
len(master_list_df)

642

In [209]:
master_list_df_inverse = scraped_data_df_500000.loc[~((scraped_data_df_500000['all_words_there'] == 'yes') |
                    (scraped_data_df_500000['after_stopwords_removed'] == 'yes') |
                    (scraped_data_df_500000['words_same'] == 'yes') |
                    (scraped_data_df_500000['exact_match'] == 'yes')) 
                              ]

In [210]:
len(master_list_df_inverse)

436

In [197]:
# master_list_df.reset_index(drop=True,inplace=True)

In [211]:
master_list_df.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text'],
      dtype='object')

In [212]:
# column_drop_list = ['domestic_box_office',
#        'international_box_office', 'production_budget',
#        'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
#        'franchise', 'source', 'genre', 'production_method', 'creative_type',
#        'production_companies', 'production_countries']

# master_list_df_for_check = master_list_df.drop(column_drop_list, axis=1) #no inplace=True

In [220]:
master_list_df_further_check = master_list_df.loc[\
                                        (master_list_df['exact_match'] != 'yes')]

In [221]:
master_list_df_confirmed = master_list_df.loc[\
                                        (master_list_df['exact_match'] == 'yes')]

In [222]:
master_list_df_confirmed['released_in_india'] = 'yes'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [223]:
master_list_df_inverse['released_in_india'] = 'no'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [224]:
master_list_df_further_check

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_companies,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text
66,4054,2017,Inhumans,500000,500000,0,,Parkside Releasing,Not Rated,,...,,United States,yes,Marvel`s Inhumans,,,,,,
199,1835,2015,Criminal,14708696,14708696,31500000,2015-04-15,Lionsgate,R,,...,"Summit Entertainment,Millennium Films,Bendersp...","United Kingdom,United States",yes,Criminal Activities,,,,,,
215,3030,2016,Ratchet and Clank,8813410,8813410,20000000,2016-04-29,Focus Features,PG,,...,"Blockade,Rainmaker Entertainment,Gramercy Pict...",United States,,,,,yes,Ratchet & Clank,,
241,156,2013,No,2341226,2341226,0,2013-02-15,Sony Pictures Classics,R,,...,"Fabula,Participant Media,Funny Balloons","Chile,France,United States",,,,,yes,We Are What We Are,,
307,2949,2016,Why Him?,60323786,60323786,38000000,2016-12-23,20th Century Fox,R,,...,"21 Laps Entertainment,Red Hour Productions,TSG...",United States,,,,,yes,Me Before You,,
322,835,2014,Planes: Fire and Rescue,59157732,59157732,50000000,2014-07-18,Walt Disney,PG,Planes,...,"Walt Disney Pictures,DisneyToon Studios,Prana ...",United States,,,,,yes,Planes: Fire & Rescue,,
353,4008,2017,The Wall,1803064,1803064,3000000,2017-05-12,Roadside Attractions,R,,...,"Amazon Studios,Hypnotic Pictures",United States,yes,The Great Wall,,,,,,
385,5188,2018,The Hurricane Heist,6115824,6115824,40000000,2018-03-09,Entertainment Studios Motion Pictures,PG-13,,...,"Foresight Unlimited,Parkside Pictures,Windfall...",United States,,,,,yes,Hurricane Heist,,
410,4052,2017,The Little Hours,1647175,1647175,0,2017-06-30,Gunpowder & Sky,R,,...,,United States,,,,,yes,Little Hours,,
471,1817,2015,Woman in Gold,33307793,33307793,11000000,2015-04-01,Weinstein Co.,PG-13,,...,Origin Pictures Productions,"United Kingdom,United States",yes,The Woman in Gold,,,yes,The Woman in Gold,,


In [225]:
master_list_df_further_check['released_in_india'] = 'yes'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [227]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [226]:
master_list_df_further_check

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
66,4054,2017,Inhumans,500000,500000,0,,Parkside Releasing,Not Rated,,...,United States,yes,Marvel`s Inhumans,,,,,,,yes
199,1835,2015,Criminal,14708696,14708696,31500000,2015-04-15,Lionsgate,R,,...,"United Kingdom,United States",yes,Criminal Activities,,,,,,,yes
215,3030,2016,Ratchet and Clank,8813410,8813410,20000000,2016-04-29,Focus Features,PG,,...,United States,,,,,yes,Ratchet & Clank,,,yes
241,156,2013,No,2341226,2341226,0,2013-02-15,Sony Pictures Classics,R,,...,"Chile,France,United States",,,,,yes,We Are What We Are,,,yes
307,2949,2016,Why Him?,60323786,60323786,38000000,2016-12-23,20th Century Fox,R,,...,United States,,,,,yes,Me Before You,,,yes
322,835,2014,Planes: Fire and Rescue,59157732,59157732,50000000,2014-07-18,Walt Disney,PG,Planes,...,United States,,,,,yes,Planes: Fire & Rescue,,,yes
353,4008,2017,The Wall,1803064,1803064,3000000,2017-05-12,Roadside Attractions,R,,...,United States,yes,The Great Wall,,,,,,,yes
385,5188,2018,The Hurricane Heist,6115824,6115824,40000000,2018-03-09,Entertainment Studios Motion Pictures,PG-13,,...,United States,,,,,yes,Hurricane Heist,,,yes
410,4052,2017,The Little Hours,1647175,1647175,0,2017-06-30,Gunpowder & Sky,R,,...,United States,,,,,yes,Little Hours,,,yes
471,1817,2015,Woman in Gold,33307793,33307793,11000000,2015-04-01,Weinstein Co.,PG-13,,...,"United Kingdom,United States",yes,The Woman in Gold,,,yes,The Woman in Gold,,,yes


In [228]:
change_list = [199,241,307,353,494,618,624,881,1015]

for index_number in change_list:
    master_list_df_further_check.loc[index_number,'released_in_india'] = 'no'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [229]:
master_list_df_further_check

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
66,4054,2017,Inhumans,500000,500000,0,,Parkside Releasing,Not Rated,,...,United States,yes,Marvel`s Inhumans,,,,,,,yes
199,1835,2015,Criminal,14708696,14708696,31500000,2015-04-15,Lionsgate,R,,...,"United Kingdom,United States",yes,Criminal Activities,,,,,,,no
215,3030,2016,Ratchet and Clank,8813410,8813410,20000000,2016-04-29,Focus Features,PG,,...,United States,,,,,yes,Ratchet & Clank,,,yes
241,156,2013,No,2341226,2341226,0,2013-02-15,Sony Pictures Classics,R,,...,"Chile,France,United States",,,,,yes,We Are What We Are,,,no
307,2949,2016,Why Him?,60323786,60323786,38000000,2016-12-23,20th Century Fox,R,,...,United States,,,,,yes,Me Before You,,,no
322,835,2014,Planes: Fire and Rescue,59157732,59157732,50000000,2014-07-18,Walt Disney,PG,Planes,...,United States,,,,,yes,Planes: Fire & Rescue,,,yes
353,4008,2017,The Wall,1803064,1803064,3000000,2017-05-12,Roadside Attractions,R,,...,United States,yes,The Great Wall,,,,,,,no
385,5188,2018,The Hurricane Heist,6115824,6115824,40000000,2018-03-09,Entertainment Studios Motion Pictures,PG-13,,...,United States,,,,,yes,Hurricane Heist,,,yes
410,4052,2017,The Little Hours,1647175,1647175,0,2017-06-30,Gunpowder & Sky,R,,...,United States,,,,,yes,Little Hours,,,yes
471,1817,2015,Woman in Gold,33307793,33307793,11000000,2015-04-01,Weinstein Co.,PG-13,,...,"United Kingdom,United States",yes,The Woman in Gold,,,yes,The Woman in Gold,,,yes


In [230]:
master_list_df_reconstituted = pd.concat([master_list_df_inverse, master_list_df_confirmed, master_list_df_further_check])
len(master_list_df_reconstituted)

1078

In [231]:
# add pirates caribbean title

#know for a fact that pirates of the caribbean was released under another name outside India, so changing that entry specifically

master_list_df_reconstituted.loc[master_list_df_reconstituted['title'].str.contains("Pirates")]

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
77,3867,2017,Pirates of the Caribbean: Dead Men Tell No Tales,172558876,172558876,230000000,2017-05-26,Walt Disney,PG-13,Pirates of the Caribbean,...,United States,,,,,,,,,no


In [232]:
master_list_df_reconstituted.loc[77,'released_in_india'] = 'yes'

In [233]:
master_list_df_reconstituted.loc[master_list_df_reconstituted['title'].str.contains("Pirates")]

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
77,3867,2017,Pirates of the Caribbean: Dead Men Tell No Tales,172558876,172558876,230000000,2017-05-26,Walt Disney,PG-13,Pirates of the Caribbean,...,United States,,,,,,,,,yes


In [238]:
master_list_df_reconstituted.head()

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
0,4025,2017,Brad’s Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,...,United States,,,,,,,,,no
2,5442,2018,The Catcher Was A Spy,714205,714205,0,2018-06-22,IFC Films,R,,...,United States,,,,,,,,,no
3,3062,2016,Miles Ahead,2610896,2610896,0,2016-04-01,Sony Pictures Classics,R,,...,United States,,,,,,,,,no
5,3145,2016,Unsullied,510957,510957,1500000,2016-04-22,Indican Pictures,R,,...,United States,,,,,,,,,no
6,19,2013,The Hangover 3,112200072,112200072,103000000,2013-05-23,Warner Bros.,R,Hangover,...,United States,,,,,,,,,no


In [241]:
master_list_df_reconstituted.to_csv('data/india_release_check_v1.csv', encoding='utf_8_sig')

In [236]:
index_number_list = list(range(1,1079))
for index,row in master_list_df_reconstituted.iterrows():
    if index not in index_number_list:
        print(index)
        
print('done')

0
done


In [242]:
master_list_df_reconstituted.head()

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
0,4025,2017,Brad’s Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,...,United States,,,,,,,,,no
2,5442,2018,The Catcher Was A Spy,714205,714205,0,2018-06-22,IFC Films,R,,...,United States,,,,,,,,,no
3,3062,2016,Miles Ahead,2610896,2610896,0,2016-04-01,Sony Pictures Classics,R,,...,United States,,,,,,,,,no
5,3145,2016,Unsullied,510957,510957,1500000,2016-04-22,Indican Pictures,R,,...,United States,,,,,,,,,no
6,19,2013,The Hangover 3,112200072,112200072,103000000,2013-05-23,Warner Bros.,R,Hangover,...,United States,,,,,,,,,no


In [244]:
master_list_df_reconstituted.replace({"`":"'"}, regex=True)

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
0,4025,2017,Brad’s Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,...,United States,,,,,,,,,no
2,5442,2018,The Catcher Was A Spy,714205,714205,0,2018-06-22,IFC Films,R,,...,United States,,,,,,,,,no
3,3062,2016,Miles Ahead,2610896,2610896,0,2016-04-01,Sony Pictures Classics,R,,...,United States,,,,,,,,,no
5,3145,2016,Unsullied,510957,510957,1500000,2016-04-22,Indican Pictures,R,,...,United States,,,,,,,,,no
6,19,2013,The Hangover 3,112200072,112200072,103000000,2013-05-23,Warner Bros.,R,Hangover,...,United States,,,,,,,,,no
7,4000,2017,Norman: The Moderate Rise and Tragic Fall of a...,3814868,3814868,0,2017-04-14,Sony Pictures Classics,R,,...,"Israel,United States",,,,,,,,,no
8,1955,2015,Faith of Our Fathers,1004105,1004105,0,2015-07-01,Pure Flix / Samuel Goldwyn Films,PG-13,,...,United States,,,,,,,,,no
9,209,2013,Muscle Shoals,696241,696241,0,2013-09-27,Magnolia Pictures,PG,,...,United States,,,,,,,,,no
10,939,2014,CitizenFour,2800870,2800870,0,2014-10-24,RADiUS-TWC,R,,...,United States,,,,,,,,,no
12,1006,2014,On Any Sunday: The Next Chapter,509916,509916,0,2014-11-07,Hannover House,PG,On Any Sunday,...,United States,,,,,,,,,no


In [245]:
master_list_df_reconstituted.replace({"’":"'"}, regex=True)

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,production_countries,all_words_there,all_words_there_bms_csv_text,exact_match,exact_match_bms_csv_text,after_stopwords_removed,after_stopwords_removed_bms_csv_text,words_same,words_same_bms_csv_text,released_in_india
0,4025,2017,Brad's Status,2133158,2133158,0,2017-09-15,Annapurna Pictures,R,,...,United States,,,,,,,,,no
2,5442,2018,The Catcher Was A Spy,714205,714205,0,2018-06-22,IFC Films,R,,...,United States,,,,,,,,,no
3,3062,2016,Miles Ahead,2610896,2610896,0,2016-04-01,Sony Pictures Classics,R,,...,United States,,,,,,,,,no
5,3145,2016,Unsullied,510957,510957,1500000,2016-04-22,Indican Pictures,R,,...,United States,,,,,,,,,no
6,19,2013,The Hangover 3,112200072,112200072,103000000,2013-05-23,Warner Bros.,R,Hangover,...,United States,,,,,,,,,no
7,4000,2017,Norman: The Moderate Rise and Tragic Fall of a...,3814868,3814868,0,2017-04-14,Sony Pictures Classics,R,,...,"Israel,United States",,,,,,,,,no
8,1955,2015,Faith of Our Fathers,1004105,1004105,0,2015-07-01,Pure Flix / Samuel Goldwyn Films,PG-13,,...,United States,,,,,,,,,no
9,209,2013,Muscle Shoals,696241,696241,0,2013-09-27,Magnolia Pictures,PG,,...,United States,,,,,,,,,no
10,939,2014,CitizenFour,2800870,2800870,0,2014-10-24,RADiUS-TWC,R,,...,United States,,,,,,,,,no
12,1006,2014,On Any Sunday: The Next Chapter,509916,509916,0,2014-11-07,Hannover House,PG,On Any Sunday,...,United States,,,,,,,,,no


In [252]:
master_list_df_reconstituted['released_in_india_2nd_check'] = master_list_df_reconstituted['released_in_india']

In [253]:
master_list_df_reconstituted['released_in_india_2nd_check_match_1'] = master_list_df_reconstituted['released_in_india']

In [254]:
master_list_df_reconstituted['released_in_india_2nd_check_match_2'] = master_list_df_reconstituted['released_in_india']

In [255]:
master_list_df_reconstituted['released_in_india_2nd_check_match_3'] = master_list_df_reconstituted['released_in_india']

In [263]:
#just backing up the df to a csv 

master_list_df_reconstituted.to_csv('data/india_release_check_v2.csv')

In [292]:
bms_release_list = pd.read_csv("data/bookmyshow_list.csv")
bms_release_list

Unnamed: 0,Release_Year,EventGroup_strTitle,Event_dtmReleaseDate
0,2013,Hansel And Gretel Witch Hunters,01/01/2013 00:00:00
1,2013,7 Days In Slow Motion,04/01/2013 00:00:00
2,2013,Chinese Zodiac,04/01/2013 00:00:00
3,2013,The Impossible,04/01/2013 00:00:00
4,2013,Tropical Races 3,04/01/2013 00:00:00
5,2013,Universal Soldier: Day Of Reckoning,04/01/2013 00:00:00
6,2013,Comic Vampire,05/01/2013 00:00:00
7,2013,Dinosaur Reappearance,05/01/2013 00:00:00
8,2013,Roller Coaster & Dinosaur Adventure,05/01/2013 00:00:00
9,2013,Gangster Squad,11/01/2013 00:00:00


In [293]:
bms_release_list.dtypes

Release_Year             int64
EventGroup_strTitle     object
Event_dtmReleaseDate    object
dtype: object

In [294]:
bms_release_list.rename(columns={'Release_Year': 'release_year',
                                  'EventGroup_strTitle':'title',
                                  'Event_dtmReleaseDate':'release_date'}, inplace=True)

In [295]:
bms_release_list.dtypes

release_year     int64
title           object
release_date    object
dtype: object

In [330]:
master_list_df_reconstituted = pd.read_csv("data/india_release_check_v2.csv")

In [331]:
master_list_df_reconstituted['released_in_india_2nd_check_match_4'] = master_list_df_reconstituted['released_in_india']

master_list_df_reconstituted['released_in_india_2nd_check_match_5'] = master_list_df_reconstituted['released_in_india']

master_list_df_reconstituted['released_in_india_2nd_check_match_6'] = master_list_df_reconstituted['released_in_india']

master_list_df_reconstituted['released_in_india_2nd_check_match_7'] = master_list_df_reconstituted['released_in_india']

master_list_df_reconstituted['released_in_india_2nd_check_match_8'] = master_list_df_reconstituted['released_in_india']

In [332]:
master_list_df_reconstituted['released_in_india_2nd_check_match_1_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_2_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_3_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_4_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_5_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_6_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_7_bms_csv_text'] = ""

master_list_df_reconstituted['released_in_india_2nd_check_match_8_bms_csv_text'] = ""

In [333]:
# Load library
from nltk.corpus import stopwords

# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shijith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [334]:
stop_words = stopwords.words('english')

In [335]:
master_list_df_reconstituted.shape

(1078, 43)

In [336]:
master_list_df_reconstituted_yes = master_list_df_reconstituted[master_list_df_reconstituted['released_in_india']=='yes']

In [337]:
master_list_df_reconstituted_no = master_list_df_reconstituted[master_list_df_reconstituted['released_in_india']=='no']

In [339]:
#master_list_df_reconstituted_no

import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in master_list_df_reconstituted_no.iterrows():
    
    print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ") #we get a list of words here
    
    scraped_title_words = [word for word in scraped_title_words if word not in stop_words]
    
    len_scraped_title_words = len(scraped_title_words)
    
    if len_scraped_title_words == 0:
        
        print('no words to search for title -- {}'.format(row['title']))
        
        continue
        
    elif len_scraped_title_words == 1:
        
        for indexx,rowx in bms_release_list.iterrows():

            bms_title = simplify_title(rowx['title'])

            bms_title_words = bms_title.split(' ')
            
            bms_title_words = [word for word in bms_title_words if word not in stop_words]
            
            if (set(bms_title_words) == set(scraped_title_words)):
                
                master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_1'] = 'yes'
                
                master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_1_bms_csv_text'] = rowx['title']
                
                break
            
            else:
                
                continue
            
        continue
        
    elif len_scraped_title_words > 1:
        
        scraped_title_words_combo_list = list(itertools.combinations(scraped_title_words, (len_scraped_title_words - 1))) #you get a list of tuples here
        
        #if you want to understand what's going on here, list(itertools.combinations([1,2,3], 2)) gives us [(1, 2), (1, 3), (2, 3)]
        
        iteratorx = 0
        
        #you have the pass statement available to you as well

        #what happens if we don't even get a 1st match?

        for indexx,rowx in bms_release_list.iterrows():

            bms_title = simplify_title(rowx['title'])

            bms_title_words = bms_title.split(' ')
            
            bms_title_words = [word for word in bms_title_words if word not in stop_words]
            
            if iteratorx == 0:
            
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_1'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_1_bms_csv_text'] = rowx['title']

                        iteratorx = 1

                        break

                    else:

                        continue
                
                continue
            
            elif iteratorx == 1:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_2'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_2_bms_csv_text'] = rowx['title']

                        iteratorx = 2

                        break

                    else:

                        continue
                
                continue
            
            elif iteratorx == 2:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_3'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_3_bms_csv_text'] = rowx['title']

                        iteratorx = 3

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 3:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_4'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_4_bms_csv_text'] = rowx['title']

                        iteratorx = 4

                        break

                    else:

                        continue
                
                continue
                
                
            elif iteratorx == 4:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_5'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_5_bms_csv_text'] = rowx['title']

                        iteratorx = 5

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 5:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_6'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_6_bms_csv_text'] = rowx['title']

                        iteratorx = 6

                        break

                    else:

                        continue
                
                continue
                
            elif iteratorx == 6:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_7'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_7_bms_csv_text'] = rowx['title']

                        iteratorx = 7

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 7:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_8'] = 'yes'

                        master_list_df_reconstituted_no.loc[index,'released_in_india_2nd_check_match_8_bms_csv_text'] = rowx['title']

                        iteratorx = 8

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 8:

                break

        continue

print("DONE AND DONE")

0
1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
2

In [340]:
master_list_df_reconstituted_no.columns

Index(['Unnamed: 0', 'number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india',
       'released_in_india_2nd_check', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_

In [341]:
master_list_df_reconstituted_no_for_check = master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check_match_1'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_2'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_3'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_4'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_5'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_6'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_7'] == 'yes') |
                                                                                (master_list_df_reconstituted_no['released_in_india_2nd_check_match_8'] == 'yes') ]

In [348]:
from IPython.core.display import HTML
HTML("<style>.rendered_html th {max-width: 50px;}</style>")

In [342]:
master_list_df_reconstituted_no_for_check.loc[:, ['title',
                                                  'released_in_india_2nd_check_match_1_bms_csv_text',
                                                  'released_in_india_2nd_check_match_2_bms_csv_text',
                                                  'released_in_india_2nd_check_match_3_bms_csv_text',
                                                  'released_in_india_2nd_check_match_4_bms_csv_text',
                                                  'released_in_india_2nd_check_match_5_bms_csv_text',
                                                  'released_in_india_2nd_check_match_6_bms_csv_text',
                                                  'released_in_india_2nd_check_match_7_bms_csv_text',
                                                  'released_in_india_2nd_check_match_8_bms_csv_text']]

Unnamed: 0,title,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text
1,The Catcher Was A Spy,Spy,The Spy Who Dumped Me,,,,,,
4,The Hangover 3,Tropical Races 3,Iron Man 3,3 Geezers,The Hangover Part III,Dungeons And Dragons 3,Dhooom 3.5,Nurse 3-D,3 Days To Kill
11,The Case for Christ,A Case Of You,,,,,,,
12,God's Not Dead,Evil Dead,Evil Dead,Detention Of The Dead,Only God Forgives,God Loves Uganda,Birth Of The Living Dead,God's Not Dead,Son of God
13,"Three Billboards Outside Ebbing, Missouri","Three Billboards Outside Ebbing, Missouri",,,,,,,
14,Wind River,Brothers Of The Wind,,,,,,,
15,Love & Mercy,All the Boys Love Mandy Lane,I'm In Love With A Church Girl,Last Love,The Falls: Testament Of Love,Love & Air Sex,Endless Love,"Love, Rosie",Love & Friendship
16,Girls Trip,Asian School Girls,Chiang Mai Deep Trip,,,,,,
18,What Maisie Knew,The Man Who Knew Infinity,,,,,,,
22,Fifty Shades Freed,Fifty Shades of Grey,Fifty Shades Darker,,,,,,


In [349]:
#there's a title 'why him' that was released in the US but isn't showing up, i guess because both 'why' and 'him' are stop words, so just manually correcting for this title
#am getting the index number for it

master_list_df_reconstituted_no.loc[master_list_df_reconstituted_no['title'].str.contains("Why Him")]

Unnamed: 0.1,Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,...,released_in_india_2nd_check_match_7,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text
1058,307,2949,2016,Why Him?,60323786,60323786,38000000,2016-12-23,20th Century Fox,R,...,no,no,,,,,,,,


In [350]:
#got this correction list by going through the table above
#so i have like 8 potential matches in the bookmyshow list for each movie released domestically in the US (and which grossed at least $500,000 at the box office)
#if it seems like one of the 8 matches is actually a match, have entered the index number in this list, the correction_list

correction_list = [4,12,13,24,29,32,42,43,48,50,51,53,55,56,59,61,62,64,69,72,78,80,86,90,92,95,101,104,106,114,120,121,128,129,132,134,137,138,139,140,145,146,148,149,157,159,166,172,173,176,178,182,186,
                187,188,190,197,203,205,206,208,209,212,215,221,225,233,
239,241,243,244,250,255,256,257,258,259,262,264,265,266,271,272,274,277,278,281,282,283,286,296,297,299,316,317,320,322,327,329,331,338,340,343,348,354,359,361,366,367,370,371,373,376,379,383,384,
                385,386,387,388,389,390,393,394,396,401,407,409,
414,419,421,425,428,431,434,1055,1060,1065,1069,1058]

In [351]:
master_list_df_reconstituted_no.shape

(444, 43)

In [353]:
len(master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'yes')])

0

In [354]:
len(master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'no')])

444

In [355]:
for index_number in correction_list:
    master_list_df_reconstituted_no.loc[index_number,'released_in_india_2nd_check'] = 'yes'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [356]:
len(master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'yes')])

140

In [357]:
len(master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'no')])

304

In [358]:
master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'yes')]

Unnamed: 0.1,Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,...,released_in_india_2nd_check_match_7,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text
4,6,19,2013,The Hangover 3,112200072,112200072,103000000,2013-05-23,Warner Bros.,R,...,yes,yes,Tropical Races 3,Iron Man 3,3 Geezers,The Hangover Part III,Dungeons And Dragons 3,Dhooom 3.5,Nurse 3-D,3 Days To Kill
12,23,870,2014,God's Not Dead,60755732,60755732,1150000,2014-03-21,Pure Flix Entertainment,PG,...,yes,yes,Evil Dead,Evil Dead,Detention Of The Dead,Only God Forgives,God Loves Uganda,Birth Of The Living Dead,God's Not Dead,Son of God
13,25,3904,2017,"Three Billboards Outside Ebbing, Missouri",54513740,54513740,12000000,2017-12-01,Fox Searchlight,R,...,no,no,"Three Billboards Outside Ebbing, Missouri",,,,,,,
24,55,964,2014,Awake: The Life of Yogananda,1511750,1511750,0,2014-10-10,Counterpoint Films & Self-Realization Fellowship,PG,...,no,no,Awake: The Life of Yogananda,,,,,,,
29,68,3961,2017,Wish Upon,14301505,14301505,12000000,2017-07-14,Broad Green Pictures,PG-13,...,no,no,Once Upon a Time in Venice,Wish Upon,Death Wish,,,,,
32,80,1796,2015,Spotlight,45055776,45055776,20000000,2015-11-06,Open Road,R,...,no,no,Spotlight,,,,,,,
42,103,3900,2017,Daddy's Home 2,104029443,104029443,70000000,2017-11-10,Paramount Pictures,PG-13,...,no,no,Daddy's Home,Daddy's Home 2,,,,,,
43,106,1890,2015,Truth,2541854,2541854,0,2015-10-16,Sony Pictures Classics,R,...,no,no,Truth,,,,,,,
48,119,1000,2014,Jodorowsky's Dune,646512,646512,0,2014-03-21,Sony Pictures Classics,PG-13,...,no,no,Jodorowsky's Dune,,,,,,,
50,123,1923,2015,The Diary of a Teenage Girl,1477002,1477002,2000000,2015-08-07,Sony Pictures Classics,R,...,no,no,The Diary Of A Teenage Girl,,,,,,,


In [359]:
master_list_df_reconstituted_no.loc[:, ['title',
                                                  'released_in_india',
                                                  'released_in_india_2nd_check']]

Unnamed: 0,title,released_in_india,released_in_india_2nd_check
0,Brad's Status,no,no
1,The Catcher Was A Spy,no,no
2,Miles Ahead,no,no
3,Unsullied,no,no
4,The Hangover 3,no,yes
5,Norman: The Moderate Rise and Tragic Fall of a...,no,no
6,Faith of Our Fathers,no,no
7,Muscle Shoals,no,no
8,CitizenFour,no,no
9,On Any Sunday: The Next Chapter,no,no


In [363]:
#so colons in the title names were throwing my search off
#below is a list of all the films with the colon mark ':' in their titles and have been marked as not released in India
#will be checking this list manually and compiling another correction list

master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['title'].str.contains(":")) & (master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'no')]

Unnamed: 0.1,Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,...,released_in_india_2nd_check_match_7,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text
5,7,4000,2017,Norman: The Moderate Rise and Tragic Fall of a...,3814868,3814868,0,2017-04-14,Sony Pictures Classics,R,...,no,no,,,,,,,,
9,12,1006,2014,On Any Sunday: The Next Chapter,509916,509916,0,2014-11-07,Hannover House,PG,...,no,no,,,,,,,,
21,48,165,2013,The Met: Live in HD - Aida,2800000,2800000,0,,First Run Features,Not Rated,...,no,no,,,,,,,,
75,168,4077,2017,May it Last: A Portrait of the Avett Brothers,725286,725286,0,2017-09-12,Oscilloscope Pictures,Not Rated,...,no,no,,,,,,,,
83,180,4045,2017,Mark Felt: The Man Who Brought Down the White ...,768646,768646,0,2017-09-29,Sony Pictures Classics,PG-13,...,no,no,,,,,,,,
100,230,3081,2016,Hillsong: Let Hope Rise,2394725,2394725,0,2016-09-16,Pure Flix Entertainment,PG,...,no,no,,,,,,,,
102,238,924,2014,America: Imagine a World Without Her,14444502,14444502,0,2014-07-02,Lionsgate,PG-13,...,no,no,,,,,,,,
108,253,5189,2018,God's Not Dead: A Light in Darkness,5728940,5728940,0,2018-03-30,Pure Flix Entertainment,PG,...,no,no,,,,,,,,
110,256,3035,2016,The Beatles: Eight Days a Week,2934445,2934445,0,2016-09-16,Abramorama Films,Not Rated,...,no,no,,,,,,,,
112,258,3103,2016,The Music of Strangers: Yo-Yo Ma and the Silk ...,1169214,1169214,0,2016-06-10,The Orchard,PG-13,...,no,no,,,,,,,,


In [362]:
master_list_df_reconstituted_no.loc[master_list_df_reconstituted_no['title'].str.contains("\(")]

Unnamed: 0.1,Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,...,released_in_india_2nd_check_match_7,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text
429,1061,850,2014,Birdman or (The Unexpected Virtue of Ignorance),42340598,42340598,18000000,2014-10-17,Fox Searchlight,R,...,no,no,,,,,,,,


In [364]:
len(master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'yes')])

140

In [365]:
#looked at the table above, checked the csv manually, found which movies were released and which weren't

correction_list_2 = [429,83,169,234,245,301,311,364,432]

for index_number in correction_list_2:
    master_list_df_reconstituted_no.loc[index_number,'released_in_india_2nd_check'] = 'yes'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [366]:
len(master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'yes')])

149

In [367]:
#time to reconstitute the dataframe again

master_list_df_reconstituted_no.shape

(444, 43)

In [368]:
master_list_df_reconstituted_yes.shape

(634, 43)

In [369]:
master_list_df_reconstituted_again = pd.concat([master_list_df_reconstituted_yes, master_list_df_reconstituted_no])
len(master_list_df_reconstituted_again)

1078

In [371]:
master_list_df_reconstituted_again.columns

Index(['Unnamed: 0', 'number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india',
       'released_in_india_2nd_check', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_

In [372]:
master_list_df_reconstituted_again.drop('Unnamed: 0', axis=1, inplace=True)

In [373]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india',
       'released_in_india_2nd_check', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_india_2nd_chec

In [374]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v3', index = False)

In [375]:
master_list_df_reconstituted_again['domestic_box_office_ranked'] = master_list_df_reconstituted_again.groupby(['year'])['domestic_box_office'].rank(ascending=False)

In [376]:
#master_list_df_reconstituted_no.loc[(master_list_df_reconstituted_no['released_in_india_2nd_check'] == 'yes')]

master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['domestic_box_office_ranked'] == 1]

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text,domestic_box_office_ranked
762,5,2013,The Hunger Games: Catching Fire,424668047,424668047,130000000,2013-11-22,Lionsgate,PG-13,Hunger Games,...,yes,,,,,,,,,1.0
900,5117,2018,Black Panther,700059566,700059566,200000000,2018-02-16,Walt Disney,PG-13,Marvel Cinematic Universe,...,yes,,,,,,,,,1.0
1036,2887,2016,Rogue One: A Star Wars Story,532177324,532177324,200000000,2016-12-16,Walt Disney,PG-13,Star Wars,...,yes,,,,,,,,,1.0
244,799,2014,American Sniper,350126372,350126372,58000000,2014-01-16,Warner Bros.,R,,...,yes,An American Promise,American Hustle,American Sniper,American Ultra,American Pastoral,American Honey,American Assassin,American Made,1.0
311,3857,2017,Star Wars Ep. VIII: The Last Jedi,620181382,620181382,317000000,2017-12-15,Walt Disney,PG-13,Star Wars,...,no,,,,,,,,,1.0
432,1736,2015,Star Wars Ep. VII: The Force Awakens,936662225,936662225,306000000,2015-12-18,Walt Disney,PG-13,Star Wars,...,no,,,,,,,,,1.0


In [386]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v4', index = False)

In [387]:
master_list_df_reconstituted_again_no = master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes']

In [389]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [390]:
#doing one more check, just this time I'm checking if all the words from the bookmyshow title are there in the official US title to get us matches

import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in master_list_df_reconstituted_again_no.iterrows():
    
#     print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ")
    
    for indexx,rowx in bms_release_list.iterrows():

        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')

        if all(word in scraped_title_words for word in bms_title_words): #the order for this was reversed in my earlier searches

            print('This title from BookMyShow india release list -- {} -- could be a match for this official US title -- {}. The index number for official US title in master_list_xxx_dataframe is {}'\
                  .format(rowx['title'],row['title'],index))

            break

        else:

            continue
        
print("DONE AND DONE")

This title from BookMyShow india release list -- Spy -- could be a match for this official US title -- The Catcher Was A Spy. The index number for official US title in master_list_xxx_dataframe is 1
This title from BookMyShow india release list -- Room -- could be a match for this official US title -- War Room. The index number for official US title in master_list_xxx_dataframe is 27
This title from BookMyShow india release list -- Pete's Dragon -- could be a match for this official US title -- Pete's Dragon. The index number for official US title in master_list_xxx_dataframe is 44
This title from BookMyShow india release list -- All the Money in the World -- could be a match for this official US title -- All the Money in the World. The index number for official US title in master_list_xxx_dataframe is 45
This title from BookMyShow india release list -- Fifty Shades Darker -- could be a match for this official US title -- Fifty Shades Darker. The index number for official US title in m

In [391]:
#not taking into account multiple matches

#doing one more check, just this time I'm checking if all the words from the bookmyshow title are there in the official US title to get us matches

import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in master_list_df_reconstituted_again_no.iterrows():
    
#     print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ")
    
#     scraped_title_words = [word for word in scraped_title_words if word not in stop_words]
    
    
    for indexx,rowx in bms_release_list.iterrows():

        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')

        if set(scraped_title_words) == set(bms_title_words): #the order for this was reversed in my earlier searches

            print('This title from BookMyShow india release list -- {} -- could be a match for this official US title -- {}. The index number for official US title in master_list_xxx_dataframe is {}'\
                  .format(rowx['title'],row['title'],index))

            break

        else:

            continue
        
print("DONE AND DONE")


This title from BookMyShow india release list -- Pete's Dragon -- could be a match for this official US title -- Pete's Dragon. The index number for official US title in master_list_xxx_dataframe is 44
This title from BookMyShow india release list -- All the Money in the World -- could be a match for this official US title -- All the Money in the World. The index number for official US title in master_list_xxx_dataframe is 45
This title from BookMyShow india release list -- Fifty Shades Darker -- could be a match for this official US title -- Fifty Shades Darker. The index number for official US title in master_list_xxx_dataframe is 73
This title from BookMyShow india release list -- Live By Night -- could be a match for this official US title -- Live by Night. The index number for official US title in master_list_xxx_dataframe is 249
This title from BookMyShow india release list -- Patriots Day -- could be a match for this official US title -- Patriots Day. The index number for offici

In [392]:
#doing one more check, just this time I'm checking if all the words from the bookmyshow title are there in the official US title to get us matches

import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in master_list_df_reconstituted_again_no.iterrows():
    
#     print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ")
    
    for indexx,rowx in bms_release_list.iterrows():

        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')

        if all(word in bms_title_words for word in scraped_title_words):

            print('This title from BookMyShow india release list -- {} -- could be a match for this official US title -- {}. The index number for official US title in master_list_xxx_dataframe is {}'\
                  .format(rowx['title'],row['title'],index))

            break

        else:

            continue
        
print("DONE AND DONE")

This title from BookMyShow india release list -- Pete's Dragon -- could be a match for this official US title -- Pete's Dragon. The index number for official US title in master_list_xxx_dataframe is 44
This title from BookMyShow india release list -- All the Money in the World -- could be a match for this official US title -- All the Money in the World. The index number for official US title in master_list_xxx_dataframe is 45
This title from BookMyShow india release list -- Fifty Shades Darker -- could be a match for this official US title -- Fifty Shades Darker. The index number for official US title in master_list_xxx_dataframe is 73
This title from BookMyShow india release list -- Beauty and the Beast -- could be a match for this official US title -- Beast. The index number for official US title in master_list_xxx_dataframe is 97
This title from BookMyShow india release list -- Beyond All Boundaries -- could be a match for this official US title -- Boundaries. The index number for o

In [396]:
master_list_df_reconstituted_again_no.loc[master_list_df_reconstituted_again_no['title'].str.contains("It")]

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text,domestic_box_office_ranked
349,917,2014,And So It Goes,15160801,15160801,18000000,2014-07-25,Clarius Entertainment,PG-13,,...,no,,,,,,,,,116.0
381,5193,2018,Itzhak,618626,618626,0,2018-03-09,Greenwich,Not Rated,,...,no,,,,,,,,,98.0
408,994,2014,Life Itself,810454,810454,0,2014-07-04,Magnolia Pictures,R,,...,no,Life,,,,,,,,176.0


In [397]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['title'].str.contains("Life Itself")]

Unnamed: 0,number,year,title,domestic_box_office,international_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,...,released_in_india_2nd_check_match_8,released_in_india_2nd_check_match_1_bms_csv_text,released_in_india_2nd_check_match_2_bms_csv_text,released_in_india_2nd_check_match_3_bms_csv_text,released_in_india_2nd_check_match_4_bms_csv_text,released_in_india_2nd_check_match_5_bms_csv_text,released_in_india_2nd_check_match_6_bms_csv_text,released_in_india_2nd_check_match_7_bms_csv_text,released_in_india_2nd_check_match_8_bms_csv_text,domestic_box_office_ranked
408,994,2014,Life Itself,810454,810454,0,2014-07-04,Magnolia Pictures,R,,...,no,Life,,,,,,,,176.0


In [398]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india',
       'released_in_india_2nd_check', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_india_2nd_chec

In [399]:
correction_list_3 = [44,45,249,433, 85,382]

for index_number in correction_list_3:
    master_list_df_reconstituted_again.loc[index_number,'released_in_india_2nd_check'] = 'yes'

In [400]:
master_list_df_reconstituted_again_no = master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes']

In [401]:
import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in master_list_df_reconstituted_again_no.iterrows():
    
#     print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ")
    
    for indexx,rowx in bms_release_list.iterrows():

        bms_title = simplify_title(rowx['title'])

        bms_title_words = bms_title.split(' ')

        if set(scraped_title_words) == set(bms_title_words):

            print('This title from BookMyShow india release list -- {} -- could be a match for this official US title -- {}. The index number for official US title in master_list_xxx_dataframe is {}'\
                  .format(rowx['title'],row['title'],index))

            break

        else:

            continue
        
print("DONE AND DONE")

This title from BookMyShow india release list -- Fifty Shades Darker -- could be a match for this official US title -- Fifty Shades Darker. The index number for official US title in master_list_xxx_dataframe is 73
DONE AND DONE


In [12]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v5.csv', index = False)

In [62]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v5.csv')

In [63]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india',
       'released_in_india_2nd_check', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_india_2nd_chec

In [64]:
master_list_df_reconstituted_again['released_in_india_3rd_check'] = master_list_df_reconstituted_again['released_in_india_2nd_check']

In [65]:
master_list_df_reconstituted_again['released_in_india_3rd_check_match_1'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_2'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_3'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_4'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_5'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_6'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_7'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_8'] = ""

In [66]:
master_list_df_reconstituted_again['released_in_india_3rd_check_match_1_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_2_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_3_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_4_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_5_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_6_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_7_bms_csv_text'] = ""

master_list_df_reconstituted_again['released_in_india_3rd_check_match_8_bms_csv_text'] = ""

In [67]:
master_list_df_reconstituted_again_yes = master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_3rd_check'] == 'yes']

master_list_df_reconstituted_again_no = master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_3rd_check'] != 'yes']

In [68]:
bms_release_list = pd.read_csv("data/bookmyshow_list.csv")

In [69]:
bms_release_list.rename(columns={'Release_Year': 'release_year',
                                  'EventGroup_strTitle':'title',
                                  'Event_dtmReleaseDate':'release_date'}, inplace=True)

In [70]:
#this is the same check as I did above, just that i'm not using all the stopwords, just a minimal list

import unicodedata
import re
from lxml import html
from random import randint
import time
import itertools


def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&"]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    for before, after in (('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'),\
                          ('œ', 'oe'),('ﬁ','fi'),('ﬂ','fl'),('ø','oe'),('Ð','D'),('Þ','TH')\
                          # put any more transformations here...
                          ):
        data = data.replace(before, after)
        
    data = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
    data = str(data,'utf-8')

    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

for index,row in master_list_df_reconstituted_again_no.iterrows():
    
    print(index)
    
    scraped_title = simplify_title(row['title'])

    scraped_title_words = scraped_title.split(" ") #we get a list of words here
    
    minimal_stop_word_list = ['the','a','of','I','to','in']
    
    if any(elem in scraped_title_words for elem in minimal_stop_word_list):
        
        scraped_title_words = [x for x in scraped_title_words if x not in minimal_stop_word_list]
    
    len_scraped_title_words = len(scraped_title_words)
    
    if len_scraped_title_words == 0:
        
        print('no words to search for title -- {}'.format(row['title']))
        
        continue
        
    elif len_scraped_title_words == 1:
        
        for indexx,rowx in bms_release_list.iterrows():

            bms_title = simplify_title(rowx['title'])

            bms_title_words = bms_title.split(' ')
            
            if (set(bms_title_words) == set(scraped_title_words)):
                
                master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_1'] = 'yes'
                
                master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_1_bms_csv_text'] = rowx['title']
                
                break
            
            else:
                
                continue
            
        continue
        
    elif len_scraped_title_words > 1:
        
        scraped_title_words_combo_list = list(itertools.combinations(scraped_title_words, (len_scraped_title_words - 1))) #you get a list of tuples here
        
        #if you want to understand what's going on here, list(itertools.combinations([1,2,3], 2)) gives us [(1, 2), (1, 3), (2, 3)]
        
        iteratorx = 0
        
        #you have the pass statement available to you as well

        #what happens if we don't even get a 1st match?

        for indexx,rowx in bms_release_list.iterrows():

            bms_title = simplify_title(rowx['title'])

            bms_title_words = bms_title.split(' ')
            
            if iteratorx == 0:
            
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_1'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_1_bms_csv_text'] = rowx['title']

                        iteratorx = 1

                        break

                    else:

                        continue
                
                continue
            
            elif iteratorx == 1:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_2'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_2_bms_csv_text'] = rowx['title']

                        iteratorx = 2

                        break

                    else:

                        continue
                
                continue
            
            elif iteratorx == 2:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_3'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_3_bms_csv_text'] = rowx['title']

                        iteratorx = 3

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 3:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_4'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_4_bms_csv_text'] = rowx['title']

                        iteratorx = 4

                        break

                    else:

                        continue
                
                continue
                
                
            elif iteratorx == 4:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_5'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_5_bms_csv_text'] = rowx['title']

                        iteratorx = 5

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 5:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_6'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_6_bms_csv_text'] = rowx['title']

                        iteratorx = 6

                        break

                    else:

                        continue
                
                continue
                
            elif iteratorx == 6:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_7'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_7_bms_csv_text'] = rowx['title']

                        iteratorx = 7

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 7:
                
                for combo in scraped_title_words_combo_list:

                    if all(word in bms_title_words for word in combo):

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_8'] = 'yes'

                        master_list_df_reconstituted_again_no.loc[index,'released_in_india_3rd_check_match_8_bms_csv_text'] = rowx['title']

                        iteratorx = 8

                        break

                    else:

                        continue
                
                continue

            elif iteratorx == 8:

                break

        continue

print("DONE AND DONE")

634
635
636
637
639
640
641
642
643
644
645
648


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


649
650
651
652
653
654
655
656
657
659
660
661
662
664
666
667
668
669
670
671
672
673
674
679
680
682
685
687
690
691
693
696
698
699
700
701
703
704
706
707
708
709
710
712
714
715
717
720
721
722
724
726
727
729
730
731
732
733
735
736
738
740
741
742
743
744
745
746
748
749
750
751
752
755
756
757
758
759
760
763
764
766
768
769
774
775
776
777
780
783
784
785
786
787
788
789
791
793
794
795
796
797
798
800
801
803
804
807
808
810
812
813
814
816
817
818
822
824
825
826
827
828
829
831
832
833
834
835
837
840
843
844
846
847
849
850
851
852
853
855
856
857
859
860
861
862
863
864
865
868
869
870
871
873
875
879
880
881
884
885
886
887
893
894
896
900
901
902
903
906
908
909
912
913
917
918
920
921
922
923
924
925
926
927
928
931
933
935
936
937
938
939
940
941
942
943
945
946
947
948
951
952
954
956
957
958
959
961
963
965
966
967
968
969
970
972
974
975
977
978
979
980
982
983
984
985
986
988
989
990
991
993
995
996
998
1001
1002
1005
1007
1008
1010
1011
1013
1014
1024
1025
1028


In [71]:
pd.set_option('display.max_rows', 500)

In [72]:
master_list_df_reconstituted_again_no_for_check = master_list_df_reconstituted_again_no.loc[(master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_1'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_2'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_3'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_4'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_5'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_6'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_7'] == 'yes') |
                                                 (master_list_df_reconstituted_again_no['released_in_india_3rd_check_match_8'] == 'yes') ]

In [73]:
master_list_df_reconstituted_again_no_for_check.loc[:, ['title',
                                                  'released_in_india_3rd_check_match_1_bms_csv_text',
                                                  'released_in_india_3rd_check_match_2_bms_csv_text',
                                                  'released_in_india_3rd_check_match_3_bms_csv_text',
                                                  'released_in_india_3rd_check_match_4_bms_csv_text',
                                                  'released_in_india_3rd_check_match_5_bms_csv_text',
                                                  'released_in_india_3rd_check_match_6_bms_csv_text',
                                                  'released_in_india_3rd_check_match_7_bms_csv_text',
                                                  'released_in_india_3rd_check_match_8_bms_csv_text']]

Unnamed: 0,title,released_in_india_3rd_check_match_1_bms_csv_text,released_in_india_3rd_check_match_2_bms_csv_text,released_in_india_3rd_check_match_3_bms_csv_text,released_in_india_3rd_check_match_4_bms_csv_text,released_in_india_3rd_check_match_5_bms_csv_text,released_in_india_3rd_check_match_6_bms_csv_text,released_in_india_3rd_check_match_7_bms_csv_text,released_in_india_3rd_check_match_8_bms_csv_text
648,Wind River,Brothers Of The Wind,,,,,,,
649,Love & Mercy,All the Boys Love Mandy Lane,I'm In Love With A Church Girl,Last Love,The Falls: Testament Of Love,Love & Air Sex,Endless Love,"Love, Rosie",Love & Friendship
650,Girls Trip,Asian School Girls,Chiang Mai Deep Trip,,,,,,
656,Fifty Shades Freed,Fifty Shades of Grey,Fifty Shades Darker,,,,,,
661,War Room,World War Z,I Declare War,Lost War of Jurassic Park,Echoes of War,Polar Region War,Room,Green Room,The Huntsman: Winter's War
667,Tulip Fever,Comedy Fever,,,,,,,
669,A Beautiful Planet,Beautiful Creatures,Escape from Planet Earth,SOS Planet,Dawn of the Planet of the Apes,SOS Planet,She's Beautiful When She's Angry,Planet of the Machines & Wild Coaster,Dino Planet
670,Fifty Shades of Black,Fifty Shades of Grey,Fifty Shades Darker,,,,,,
671,Range 15,The 15:17 to Paris,,,,,,,
674,The Fluffy Movie,Scary Movie 5,Dinosaur Movie,Dislecksia: The Movie,The Lego Movie,The SpongeBob Movie: Sponge Out of Water,Smosh: The Movie,Shaun The Sheep Movie,Snoopy and Charlie Brown: The Peanuts Movie


In [None]:
#cool, have done another check, no corrections to be made, think we've got all the titles that have been released

In [74]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v5.csv')

In [79]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries', 'all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india',
       'released_in_india_2nd_check', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_india_2nd_chec

In [81]:
column_drop_list = ['all_words_there',
       'all_words_there_bms_csv_text', 'exact_match',
       'exact_match_bms_csv_text', 'after_stopwords_removed',
       'after_stopwords_removed_bms_csv_text', 'words_same',
       'words_same_bms_csv_text', 'released_in_india', 'released_in_india_2nd_check_match_1',
       'released_in_india_2nd_check_match_2',
       'released_in_india_2nd_check_match_3',
       'released_in_india_2nd_check_match_4',
       'released_in_india_2nd_check_match_5',
       'released_in_india_2nd_check_match_6',
       'released_in_india_2nd_check_match_7',
       'released_in_india_2nd_check_match_8',
       'released_in_india_2nd_check_match_1_bms_csv_text',
       'released_in_india_2nd_check_match_2_bms_csv_text',
       'released_in_india_2nd_check_match_3_bms_csv_text',
       'released_in_india_2nd_check_match_4_bms_csv_text',
       'released_in_india_2nd_check_match_5_bms_csv_text',
       'released_in_india_2nd_check_match_6_bms_csv_text',
       'released_in_india_2nd_check_match_7_bms_csv_text',
       'released_in_india_2nd_check_match_8_bms_csv_text']


In [None]:
master_list_df_reconstituted_again.drop(column_drop_list, axis=1, inplace=True)

In [None]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v6.csv', index=False)

In [1]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v6.csv')

In [2]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office',
       'international_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries',
       'released_in_india_2nd_check', 'domestic_box_office_ranked'],
      dtype='object')

In [3]:
master_list_df_reconstituted_again['source'].unique()

array(['Based on Theme Park Ride', 'Original Screenplay',
       'Based on Fiction Book/Short Story', 'Based on Real Life Events',
       'Based on Comic/Graphic Novel', 'Remake', 'Based on TV',
       'Based on Game', 'Based on Factual Book/Article', 'Based on Toy',
       'Based on Folk Tale/Legend/Fairytale', 'Based on Religious Text',
       'Based on Movie', 'Spin-Off', 'Based on Short Film',
       'Based on Play', 'Based on Web Series', 'Based on Musical Group',
       nan, 'Compilation', 'Based on Musical or Opera', 'Based on Song'], dtype=object)

In [4]:
master_list_df_reconstituted_again['genre'].unique()

array(['Adventure', 'Drama', 'Action', 'Thriller/Suspense', 'Horror',
       'Comedy', 'Concert/Performance', 'Romantic Comedy', 'Musical',
       'Black Comedy', 'Western', 'Documentary', 'Multiple Genres'], dtype=object)

In [5]:
master_list_df_reconstituted_again['production_method'].unique()

array(['Live Action', 'Animation/Live Action', 'Digital Animation',
       'Stop-Motion Animation', nan, 'Multiple Production Methods'], dtype=object)

In [6]:
master_list_df_reconstituted_again['creative_type'].unique()

array(['Fantasy', 'Contemporary Fiction', 'Science Fiction',
       'Dramatization', 'Super Hero', 'Kids Fiction', 'Factual',
       'Historical Fiction', 'Multiple Creative Types', nan], dtype=object)

In [18]:
master_list_df_reconstituted_again[master_list_df_reconstituted_again['domestic_box_office_ranked'] <= 50].groupby(['released_in_india_2nd_check', 'genre']).size()

released_in_india_2nd_check  genre            
no                           Comedy                4
                             Drama                 4
                             Romantic Comedy       2
                             Thriller/Suspense     2
yes                          Action               72
                             Adventure            80
                             Black Comedy          2
                             Comedy               44
                             Drama                35
                             Horror               18
                             Musical               9
                             Thriller/Suspense    26
                             Western               2
dtype: int64

In [29]:
top_number = 50

df = pd.DataFrame(columns=["genre", "total","yes","no","yes %","no %"])

for genre in master_list_df_reconstituted_again['genre'].unique():
    movies_in_genre = len(master_list_df_reconstituted_again[(master_list_df_reconstituted_again['genre'] == genre) &
                         (master_list_df_reconstituted_again['domestic_box_office_ranked'] <= top_number)])
    movies_in_genre_released_india_yes = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genre'] == genre) &
                                                                                    (master_list_df_reconstituted_again['domestic_box_office_ranked'] <= top_number) &
                                                                                    (master_list_df_reconstituted_again['released_in_india_2nd_check'] == 'yes') ])
    movies_in_genre_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genre'] == genre) &
                                                                                   (master_list_df_reconstituted_again['domestic_box_office_ranked'] <= top_number) &
                                                                                   (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') ])
    if movies_in_genre != 0:
        movies_in_genre_released_india_yes_percent = (movies_in_genre_released_india_yes/movies_in_genre)*100
        movies_in_genre_released_india_yes_percent = round(movies_in_genre_released_india_yes_percent,1)
        movies_in_genre_released_india_no_percent = (movies_in_genre_released_india_no/movies_in_genre)*100
        movies_in_genre_released_india_no_percent = round(movies_in_genre_released_india_no_percent,1)
    else:
        movies_in_genre_released_india_yes_percent = 0
        movies_in_genre_released_india_no_percent = 0
    
    df = df.append({"genre": genre,
                    "total": movies_in_genre,
                    "yes": movies_in_genre_released_india_yes,
                    "no": movies_in_genre_released_india_no,
                    "yes %": movies_in_genre_released_india_yes_percent,
                    "no %": movies_in_genre_released_india_no_percent
                    }, ignore_index=True)

#     print("if we look at the top {} movies in terms of US domestic gross, there were {} movies in the genre {}.".format(top_number, movies_in_genre,genre,))
#     print("{} movies were released in India, that is {} %. {} movies were not released in India, that is {} %.".format(movies_in_genre_released_india_yes,movies_in_genre_released_india_yes_percent,
#                                                                                                                       movies_in_genre_released_india_no,movies_in_genre_released_india_no_percent))
#     print("\n")
df
# print("### DONE AND DONE ###")    

Unnamed: 0,genre,total,yes,no,yes %,no %
0,Adventure,80,80,0,100.0,0.0
1,Drama,39,35,4,89.7,10.3
2,Action,72,72,0,100.0,0.0
3,Thriller/Suspense,28,26,2,92.9,7.1
4,Horror,18,18,0,100.0,0.0
5,Comedy,48,44,4,91.7,8.3
6,Concert/Performance,0,0,0,0.0,0.0
7,Romantic Comedy,2,0,2,0.0,100.0
8,Musical,9,9,0,100.0,0.0
9,Black Comedy,2,2,0,100.0,0.0


In [22]:
#this gives us number of nan values in each column

master_list_df_reconstituted_again.isnull().sum()

number                           0
year                             0
title                            0
domestic_box_office              0
international_box_office         0
production_budget                0
domestic_release_date           15
domestic_distributor             9
mpaa_rating                      0
franchise                      836
source                           1
genre                            0
production_method                1
creative_type                    1
production_companies            54
production_countries             0
released_in_india_2nd_check      0
domestic_box_office_ranked       0
dtype: int64

In [23]:
#this gets us non-zero values in each column, note that Nan values are counted as non-zero

master_list_df_reconstituted_again.astype(bool).sum(axis=0)

number                         1078
year                           1078
title                          1078
domestic_box_office            1078
international_box_office       1078
production_budget               758
domestic_release_date          1078
domestic_distributor           1078
mpaa_rating                    1078
franchise                      1078
source                         1078
genre                          1078
production_method              1078
creative_type                  1078
production_companies           1078
production_countries           1078
released_in_india_2nd_check    1078
domestic_box_office_ranked     1078
dtype: int64

In [50]:
master_list_df_reconstituted_again.iloc[649]

number                                                                   1866
year                                                                     2015
title                                                            Love & Mercy
domestic_box_office                                                  12551031
international_box_office                                             12551031
production_budget                                                           0
domestic_release_date                                              2015-06-05
domestic_distributor                                     Roadside Attractions
mpaa_rating                                                             PG-13
franchise                                                                 NaN
source                                              Based on Real Life Events
genre                                                                   Drama
production_method                                               

In [None]:
#just realised that the international box_office column is a duplicate of the domestic box office column, yikes, that must have happened during the scraping, after checking the pages, the domestic box office numbers
    #are ok, but the international box office numbers are wrong, don't need that column anyway, so am just going to drop it from the data frame
    


In [51]:
master_list_df_reconstituted_again.drop("international_box_office", axis=1, inplace=True)

In [52]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v7.csv', index=False)

In [73]:
#of all the movies where you have production budget information, find the ones with the biggest production budgets that weren't released in India


master_list_df_reconstituted_again['production_budget'].max()

330600000

In [104]:
import numpy

bins = numpy.arange(start=0,stop=350000000,step=25000000)
list(bins)


[0,
 25000000,
 50000000,
 75000000,
 100000000,
 125000000,
 150000000,
 175000000,
 200000000,
 225000000,
 250000000,
 275000000,
 300000000,
 325000000]

In [107]:
master_list_df_reconstituted_again[master_list_df_reconstituted_again['production_budget'] > 0]['production_budget'].min()

100000

In [108]:
master_list_df_reconstituted_again[master_list_df_reconstituted_again['production_budget'] > 0]['production_budget'].max()

330600000

In [141]:
bins =   [0,100000, 25000000,50000000, 75000000, 100000000, 125000000, 150000000, 175000000, 200000000, 225000000, 250000000, 275000000, 300000000, 325000000,350000000]

In [142]:
len(bins)

16

In [143]:
labels = ['no info','100k-25m','25m-50m','50m-75m','75m-100m','100m-125m','125m-150m','150m-175m','175m-200m','200m-225m','225m-250m','250m-275m', '275m-300m','300m-325m','325m-350m']

In [145]:
master_list_df_reconstituted_again['production_budget_bins'] = pd.cut(master_list_df_reconstituted_again['production_budget'], bins = bins, labels = labels, right = False, include_lowest = True)

In [146]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v8.csv', index=False)

In [159]:
master_list_df_reconstituted_again['domestic_box_office'].min()

500000

In [160]:
master_list_df_reconstituted_again['domestic_box_office'].max()

936662225

In [162]:
bins =   [500000, 50000000, 100000000, 150000000, 200000000, 250000000, 300000000, 350000000,400000000, 450000000,500000000, 550000000,600000000, 650000000,700000000, 750000000,800000000, 850000000,900000000, 950000000]

In [163]:
labels = ['50k-50m','50m-100m','100m-150m','150m-200m','200m-250m','250m-300m','300m-350m','350m-400m','400m-450m','450m-500m','500m-550m','550m-600m','600m-650m','650m-700m','700m-750m',
         '750m-800m','800m-850m','850m-900m','900m-950m']

In [166]:
master_list_df_reconstituted_again['domestic_box_office_bins'] = pd.cut(master_list_df_reconstituted_again['domestic_box_office'], bins = bins, labels = labels, right = False, include_lowest = True)

In [186]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v8.csv', index=False)

In [192]:
#just doing a check to see if we've got the year right, 

for index, row in master_list_df_reconstituted_again.iterrows():
    release_year = row['year']
    release_year = str(release_year)
    release_date = row['domestic_release_date']
    try:
        release_date[:4]
    except:
        print('no release date for {}'.format(row['number']))
    else:
        release_year_from_date = release_date[:4]
        if release_year != release_year_from_date:
            print('release year mismatch for {}'.format(row['number']))
            print(release_year)
            print(release_year_from_date)

print('done and done')

no release date for 148
no release date for 4054
no release date for 165
no release date for 3042
no release date for 4017
no release date for 3057
no release date for 4023
no release date for 1938
no release date for 3085
no release date for 3098
no release date for 5270
no release date for 3070
no release date for 4021
no release date for 4030
no release date for 927
done and done


In [79]:
#just adding a director column to help match titles better

import glob
from lxml import html

for index, row in master_list_df_reconstituted_again.iterrows():
    
    number = row['number']
    
    print(number)
    
    search_string = 'movie_pages/' + str(number).zfill(4) + '_*.html'

    for filename in glob.glob(search_string):

        with open(filename, encoding='utf-8') as filex:

            sourcex = filex.read()

            treex = html.document_fromstring(sourcex)

            try:
                treex.xpath("//td[@itemprop='director']//span[1]/text()")[0]
            except:
                print('no director for movie number {}'.format(number))
                director = ''
            else:
                director = treex.xpath("//td[@itemprop='director']//span[1]/text()")[0]
                
            try:
                treex.xpath("//td[@itemprop='director']//span[1]/text()")[1]
            except:
                pass
            else:
                director_two = treex.xpath("//td[@itemprop='director']//span[1]/text()")[1]
                director = director + ',' + director_two
                print('multiple directors for movie number {}'.format(number))

            master_list_df_reconstituted_again.loc[index,'director'] = director

print('done and done')

3867
multiple directors for movie number 3867
164
12
1834
851
3023
2895
3986
3074
217
3899
2927
837
3918
87
1786
2994
2903
1798
3104
2991
800
173
868
908
2896
5344
874
4015
5177
72
2987
1777
5503
5118
902
39
52
1753
69
2905
830
1877
40
2906
multiple directors for movie number 2906
22
833
18
3874
3014
103
130
160
2936
2956
129
3007
3100
797
multiple directors for movie number 797
176
3992
3858
3037
2977
3024
102
3907
1781
1737
2911
2935
2899
98
191
2909
1856
3953
68
3003
1754
1828
1787
1790
multiple directors for movie number 1790
64
2986
7
106
5434
3941
28
multiple directors for movie number 28
933
809
5062
848
5183
1772
63
3004
multiple directors for movie number 3004
3924
3022
3939
3047
2978
1766
1801
812
1760
1823
24
3131
182
875
790
834
multiple directors for movie number 834
872
1905
2941
889
126
3947
141
62
3875
131
3905
1768
23
793
multiple directors for movie number 793
869
1897
862
1807
3909
multiple directors for movie number 3909
2928
1784
132
60
2988
3029
1740
multiple dire

1762
5345
5348
1886
5437
97
2995
994
1821
4030
5342
1985
972
3933
1941
836
135
3134
17
3120
3072
1937
927
214
5429
867
multiple directors for movie number 867
921
104
850
3019
903
1736
2982
3943
4035
1835
156
2949
4008
3923
880
4042
197
4050
done and done


In [80]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v8.csv', index=False)

In [1]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v8.csv')

In [2]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries',
       'released_in_india_2nd_check', 'domestic_box_office_ranked',
       'production_budget_bins', 'domestic_box_office_bins', 'director'],
      dtype='object')

In [3]:
master_list_df_reconstituted_again.head()

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,genre,production_method,creative_type,production_companies,production_countries,released_in_india_2nd_check,domestic_box_office_ranked,production_budget_bins,domestic_box_office_bins,director
0,3867,2017,Pirates of the Caribbean: Dead Men Tell No Tales,172558876,230000000,2017-05-26,Walt Disney,PG-13,Pirates of the Caribbean,Based on Theme Park Ride,Adventure,Live Action,Fantasy,"Walt Disney Pictures,Jerry Bruckheimer Films",United States,yes,19.0,225m-250m,150m-200m,"Joachim Ronnin,Espen Sandberg"
1,164,2013,The East,2274649,6500000,2013-05-31,Fox Searchlight,PG-13,,Original Screenplay,Drama,Live Action,Contemporary Fiction,Scott Free Films,United States,yes,157.0,100k-25m,50k-50m,
2,12,2013,World War Z,202359711,190000000,2013-06-21,Paramount Pictures,PG-13,World War Z,Based on Fiction Book/Short Story,Action,Live Action,Science Fiction,"Skydance Productions,Hemisphere Media Capital,...",United States,yes,13.0,175m-200m,200m-250m,
3,1834,2015,Ricki and the Flash,26839498,18000000,2015-08-07,Sony Pictures,PG-13,,Original Screenplay,Drama,Live Action,Contemporary Fiction,"Marc Platt Productions,Badwill Entertainment ,...",United States,yes,86.0,100k-25m,50k-50m,Jonathan Demme
4,851,2014,Transcendence,23022309,100000000,2014-04-18,Warner Bros.,PG-13,,Original Screenplay,Thriller/Suspense,Live Action,Science Fiction,"Straight Up Films,DMG Entertainment",United States,yes,105.0,100m-125m,50k-50m,


In [4]:
#taking the password from a text file
with open('passwords_ignore.txt') as f:
    password = f.readline()

In [5]:
login_string = 'postgresql://postgres:' + password + '@localhost/imdb2'

In [6]:
from imdb import IMDb

ia = IMDb('s3', login_string, adultSearch=False)

In [113]:
#REWORKED CODE

import unidecode
import re

def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&",","]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    data = unidecode.unidecode(data)
    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

master_list_df_reconstituted_again['genres_imdb'] = None

for index,row in master_list_df_reconstituted_again.iterrows():
# for index,row in master_list_df_reconstituted_again.iloc[74:75].iterrows():
    
#     print('\n')
#     print(index)
    number = row['number']
    
    title = row['title']
    title_simplified = simplify_title(title)
    title_simplified_words = title_simplified.split(' ')
#     print('this is title_simplified_words {}'.format(title_simplified_words))

    year = row['year']
    year = str(year)
#     print('this is year {}'.format(year))
    
    director = row['director']
    
    if str(director) != 'nan':
        director_simplified = simplify_title(director)
        director_simplified_words = director_simplified.split(' ')
    else:
        director_simplified_words = ['ZZZZZZ']
    
#     print('this is director_simplified_words {}'.format(director_simplified_words))
    
    results  = ia.search_movie(title,results = 200)

    for result in results:
        result_ID = result.movieID
        movie = ia.get_movie(result_ID)
        try:
            movie['kind']
        except:
            pass
        else:
            kindx = movie['kind']
            if kindx == 'movie': #it could be tv episode too, so filtering them out

                titlex = movie['title']
                titlex_string_simplified = simplify_title(titlex)
    #             print('this is titlex_string_simplified {}'.format(titlex_string_simplified))

                try:
                    movie['year']
                except:
                    yearx = '0000'
                else:
                    yearx = movie['year']
                    yearx = str(yearx)
    #             print('this is yearx {}'.format(yearx))

                try:
                    movie['director']
                except:
                    directors_string_simplified = 'XXXXXX'
                else:
                    director_list = movie['director']
                    director_name_list = []

                    for director_item in director_list:
                        director_id = director_item.personID
                        director_object = ia.get_person(director_id)
                        director_name = director_object['name']
                        director_name_list.append(director_name)

                    directors_string = ",".join(director_name_list)
                    directors_string_simplified = simplify_title(directors_string)

    #             print('this is directors_string_simplified {}'.format(directors_string_simplified))

                true_count = [(all(substring in directors_string_simplified for substring in director_simplified_words)),
                              (all(substring in titlex_string_simplified for substring in title_simplified_words)),
                              (yearx == year)].count(True)

                if true_count == 3:
                    genres_imdb_raw = movie['genres']
                    genres_imdb = ','.join(genres_imdb_raw)
                    master_list_df_reconstituted_again.loc[index,'genres_imdb'] = genres_imdb
                    break

                elif true_count == 2:
                    genres_imdb_raw = movie['genres']
                    genres_imdb = ','.join(genres_imdb_raw)
                    master_list_df_reconstituted_again.loc[index,'genres_imdb'] = genres_imdb
    #                     print('only 2 of 3 conditions true for title "{}" and number {}'.format(title,number))
    #                     print('\n')
                    break
    
    if pd.isnull(master_list_df_reconstituted_again.loc[index,'genres_imdb']):
        print('no genres listed for {}, the number for the movie is {}, the index is {}'.format(title, number, index))
        print('\n')
        
print('\n')
print('done and done')

no genres listed for Unfriended, the number for the movie is 1812, the index is 149


no genres listed for Ratchet and Clank, the number for the movie is 3030, the index is 620


no genres listed for Planes: Fire and Rescue, the number for the movie is 835, the index is 621


no genres listed for Norman: The Moderate Rise and Tragic Fall of a New York Fixer, the number for the movie is 4000, the index is 639


no genres listed for The Met: Live in HD - Aida, the number for the movie is 165, the index is 655


no genres listed for 2013 Oscar Shorts, the number for the movie is 178, the index is 657


no genres listed for 2014 Oscar Shorts, the number for the movie is 959, the index is 682


no genres listed for 2018 Oscar Shorts, the number for the movie is 5126, the index is 703


no genres listed for The Music of Strangers: Yo-Yo Ma and the Silk Road Ensemble, the number for the movie is 3103, the index is 745


no genres listed for 56 Up, the number for the movie is 208, the index is

In [138]:
#so for a number of these movies, i searched manually on the website and got their imdb IDs, i can use these imdb IDs with the local database to retrieve genre information
#there are some documentaries in this list too, will decide later on whether to keep them or not
#will use this dict later on to input genres into the main dataframe

manual_genres_input = {
'1812':{'title':"Unfriended",'imdbID':"1691917",'genres':''},
'3030':{'title':"Ratchet and Clank",'imdbID':"2865120",'genres':''},
'0835':{'title':"Planes: Fire and Rescue",'imdbID':"2980706",'genres':''},
'4000':{'title':"Norman: The Moderate Rise and Tragic Fall of a New York Fixer",'imdbID':"4191702",'genres':''},
'3103':{'title':"The Music of Strangers: Yo-Yo Ma and the Silk Road Ensemble",'imdbID':"3549206",'genres':''},
'0208':{'title':"56 Up",'imdbID':"2147134",'genres':''},
'0920':{'title':"Tyler Perry's The Single Moms Club",'imdbID':"2465140",'genres':''},
'0184':{'title':"L'attentat",'imdbID':"0787442",'genres':''},
'1994':{'title':"Deli Man: The Movie",'imdbID':"4239548",'genres':''},
'0035':{'title':"Disney Planes",'imdbID':"1691917",'genres':''},
'0968':{'title':"Men, Women and Children",'imdbID':"3179568",'genres':''},
'0927':{'title':"Island of Lemurs: Madagascar",'imdbID':"3231010",'genres':''}
}

In [140]:
for key, value in manual_genres_input.items():
    imdb_id = manual_genres_input[key]['imdbID']
    moviex = ia.get_movie(imdb_id)
    moviex_genres_raw = moviex['genres']
    moviex_genres = ','.join(moviex_genres_raw)
    manual_genres_input[key]['genres'] = moviex_genres
    
print('done and done')

done and done


In [141]:
manual_genres_input

{'0035': {'genres': 'adventure,animation,comedy',
  'imdbID': '1691917',
  'title': 'Disney Planes'},
 '0184': {'genres': 'drama', 'imdbID': '0787442', 'title': "L'attentat"},
 '0208': {'genres': 'documentary', 'imdbID': '2147134', 'title': '56 Up'},
 '0835': {'genres': 'adventure,animation,comedy',
  'imdbID': '2980706',
  'title': 'Planes: Fire and Rescue'},
 '0920': {'genres': 'comedy,drama',
  'imdbID': '2465140',
  'title': "Tyler Perry's The Single Moms Club"},
 '0927': {'genres': 'adventure,biography,documentary',
  'imdbID': '3231010',
  'title': 'Island of Lemurs: Madagascar'},
 '0968': {'genres': 'comedy,drama',
  'imdbID': '3179568',
  'title': 'Men, Women and Children'},
 '1812': {'genres': 'adventure,animation,comedy',
  'imdbID': '1691917',
  'title': 'Unfriended'},
 '1994': {'genres': 'documentary',
  'imdbID': '4239548',
  'title': 'Deli Man: The Movie'},
 '3030': {'genres': 'action,adventure,animation',
  'imdbID': '2865120',
  'title': 'Ratchet and Clank'},
 '3103': {

In [142]:
master_list_df_reconstituted_again.dtypes

number                           int64
year                             int64
title                           object
domestic_box_office              int64
production_budget                int64
domestic_release_date           object
domestic_distributor            object
mpaa_rating                     object
franchise                       object
source                          object
genre                           object
production_method               object
creative_type                   object
production_companies            object
production_countries            object
released_in_india_2nd_check     object
domestic_box_office_ranked     float64
production_budget_bins          object
domestic_box_office_bins        object
director                        object
genres_imdb                     object
dtype: object

In [143]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v8.csv', index=False)

In [147]:
for key, value in manual_genres_input.items():
    genres_imdb = manual_genres_input[key]['genres']
    number = int(key)
    master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb'] = genres_imdb

print('done and done')

done and done


In [148]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == 968]['genres_imdb']

1033    comedy,drama
Name: genres_imdb, dtype: object

In [149]:
#deleting rows from dataframe which cant possibly be movies

delete_row_dict = {
'0165':{'title':"The Met: Live in HD - Aida"},
'0178':{'title':"2013 Oscar Shorts"},
'0959':{'title':"2014 Oscar Shorts"},
'5126':{'title':"2018 Oscar Shorts"},
'1917':{'title':"2015 Oscar Shorts"},
'4023':{'title':"Mayweather vs. McGregor"},
'1938':{'title':"Game of Thrones: The IMAX Experience"},
'4027':{'title':"2017 Oscar Shorts"},
'3085':{'title':"The Met: Live in HD - Madama Butterfly"},
'3098':{'title':"The Harry Potter IMAX Marathon"},
'1947':{'title':"The Met: Live in HD - Tannhauser"},
'3086':{'title':"The Met: Live in HD - Manon Lescaut"}
}

In [150]:
delete_row_list = [165,178,959,5126,1917,4023,1938,4027,3085,3098,1947,3086]

In [151]:
master_list_df_reconstituted_again = master_list_df_reconstituted_again[~(master_list_df_reconstituted_again['number'].isin(delete_row_list))]

In [152]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v9.csv', index=False)

In [153]:
genres_imdb_list = list(master_list_df_reconstituted_again['genres_imdb'].unique())
print(len(genres_imdb_list))
genres_imdb_list = [x for x in genres_imdb_list if str(x) != 'nan']
final_list = []
for genres_imdb in genres_imdb_list:
    list_to_add = [x.strip() for x in genres_imdb.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))
print(len(final_list))

197
23


In [154]:
final_list

['',
 'musical',
 'action',
 'sci-fi',
 'war',
 'documentary',
 'news',
 'music',
 'adventure',
 'sport',
 'horror',
 'drama',
 'history',
 'fantasy',
 'thriller',
 'family',
 'western',
 'animation',
 'biography',
 'comedy',
 'romance',
 'mystery',
 'crime']

In [167]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genres_imdb']=='']

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_method,creative_type,production_companies,production_countries,released_in_india_2nd_check,domestic_box_office_ranked,production_budget_bins,domestic_box_office_bins,director,genres_imdb
354,975,2014,Tracks,509815,0,2014-09-19,Weinstein Co.,PG-13,,Based on Real Life Events,...,Live Action,Dramatization,See-Saw Films,United States,yes,188.0,no info,50k-50m,John Curran,
804,5443,2018,Boundaries,701828,0,2018-06-22,Sony Pictures Classics,R,,Original Screenplay,...,Live Action,Contemporary Fiction,"Automatik,Oddfellows Entertainment,Stage 6 Fil...","Canada,United States",no,97.0,no info,50k-50m,Shana Feste,


In [168]:
manual_genres_input = {
'0975':{'title':"Tracks",'imdbID':"2167266",'genres':''},
'5443':{'title':"Boundaries",'imdbID':"5686062",'genres':''}
}

for key, value in manual_genres_input.items():
    imdb_id = manual_genres_input[key]['imdbID']
    moviex = ia.get_movie(imdb_id)
    moviex_genres_raw = moviex['genres']
    moviex_genres = ','.join(moviex_genres_raw)
    manual_genres_input[key]['genres'] = moviex_genres

for key, value in manual_genres_input.items():
    genres_imdb = manual_genres_input[key]['genres']
    number = int(key)
    master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb'] = genres_imdb

print('done and done')

done and done


In [165]:
len_df = len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genres_imdb'] == None])
len_df

0

In [180]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v9.csv', index=False)

In [183]:
#3080 stands for a row in the dataframe for '2016 Oscar Shorts' , am guessing it's a compilation of shorts nominated for the oscar and they're shown together as one package
#not counting this as a movie

delete_row_list_2 = [3080]

master_list_df_reconstituted_again = master_list_df_reconstituted_again[~(master_list_df_reconstituted_again['number'].isin(delete_row_list_2))]

master_list_df_reconstituted_again.to_csv('data/india_release_check_v9.csv', index=False)

In [209]:
#RUN THIS AT NIGHT

from imdb import IMDb

import unidecode
import re
from random import randint
import time

ib = IMDb(accessSystem='http', adultSearch=False)

def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&",","]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    data = unidecode.unidecode(data)
    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

master_list_df_reconstituted_again['genres_imdb_check'] = None

for index,row in master_list_df_reconstituted_again.iterrows():
    number = row['number']
    print('page number is {}'.format(number))
    
    title = row['title']
    title_simplified = simplify_title(title)
    title_simplified_words = title_simplified.split(' ')

    year = row['year']
    year = str(year)
    
    title_year = title + " " + year
    
    director = row['director']
    
    if str(director) != 'nan':
        director_simplified = simplify_title(director)
        director_simplified_words = director_simplified.split(' ')
    else:
        director_simplified_words = ['ZZZZZZ']
    
    results  = ib.search_movie(title_year)

    delay = 5
    time.sleep(delay)

    for result in results:
        result_ID = result.movieID
        movie = ia.get_movie(result_ID)  #using ia, the local postgres db instead

        try:
            movie['kind']
        except:
            pass
        else:
            kindx = movie['kind']
            if kindx == 'movie': #it could be tv episode too, so filtering them out

                titlex = movie['title']
                titlex_string_simplified = simplify_title(titlex)

                try:
                    movie['year']
                except:
                    yearx = '0000'
                else:
                    yearx = movie['year']
                    yearx = str(yearx)

                try:
                    movie['director']
                except:
                    directors_string_simplified = 'XXXXXX'
                else:
                    director_list = movie['director']
                    director_name_list = []

                    for director_item in director_list:
                        director_id = director_item.personID
                        director_object = ia.get_person(director_id)   #using ia instead of ib, that is local db instead of website

                        director_name = director_object['name']
                        director_name_list.append(director_name)

                    directors_string = ",".join(director_name_list)
                    directors_string_simplified = simplify_title(directors_string)

                true_count = [(all(substring in directors_string_simplified for substring in director_simplified_words)),
                              (all(substring in titlex_string_simplified for substring in title_simplified_words)),
                              (yearx == year)].count(True)

                if true_count == 3:
                    genres_imdb_raw = movie['genres']
                    genres_imdb = ','.join(genres_imdb_raw)
                    master_list_df_reconstituted_again.loc[index,'genres_imdb_check'] = genres_imdb
                    break

                elif true_count == 2:
                    genres_imdb_raw = movie['genres']
                    genres_imdb = ','.join(genres_imdb_raw)
                    master_list_df_reconstituted_again.loc[index,'genres_imdb_check'] = genres_imdb
                    break
    
    if pd.isnull(master_list_df_reconstituted_again.loc[index,'genres_imdb_check']):
        print('no genres listed for {}, the number for the movie is {}, the index is {}'.format(title, number, index))
        print('\n')
        
print('\n')
print('done and done')

master_list_df_reconstituted_again.to_csv('data/india_release_check_v10.csv', index=False)

page number is 3867
page number is 164
page number is 12
page number is 1834
page number is 851
page number is 3023
page number is 2895
page number is 3986
page number is 3074
no genres listed for Green Room, the number for the movie is 3074, the index is 8


page number is 217
page number is 3899
page number is 2927
page number is 837
page number is 3918
page number is 87
page number is 1786
page number is 2994
page number is 2903
page number is 1798
page number is 3104
page number is 2991
page number is 800
page number is 173
page number is 868
no genres listed for Begin Again, the number for the movie is 868, the index is 23


page number is 908
page number is 2896
page number is 5344
page number is 874
page number is 4015
no genres listed for Sleight, the number for the movie is 4015, the index is 28


page number is 5177
page number is 72
page number is 2987
page number is 1777
page number is 5503
page number is 5118
page number is 902
page number is 39
page number is 52
page numb

page number is 57
page number is 3964
page number is 2904
page number is 2951
page number is 5268
page number is 88
page number is 3969
page number is 865
page number is 81
page number is 56
page number is 918
page number is 1802
page number is 3944
page number is 11
page number is 112
page number is 864
page number is 1775
page number is 3910
page number is 871
page number is 1792
page number is 5439
page number is 20
page number is 824
page number is 2890
page number is 3976
page number is 5176
page number is 1747
page number is 110
page number is 5338
page number is 3066
page number is 185
page number is 5181
page number is 876
page number is 893
page number is 5507
page number is 1902
page number is 3977
page number is 5173
page number is 78
page number is 82
page number is 1859
page number is 1750
page number is 2968
page number is 3861
page number is 117
page number is 1806
page number is 3970
page number is 2892
page number is 905
page number is 5515
page number is 1758
page num

page number is 1837
page number is 924
page number is 5272
page number is 49
page number is 4081
page number is 2917
page number is 4065
page number is 5189
page number is 4060
page number is 3035
page number is 1850
page number is 3103
no genres listed for The Music of Strangers: Yo-Yo Ma and the Silk Road Ensemble, the number for the movie is 3103, the index is 745


page number is 1932
no genres listed for 99 Homes, the number for the movie is 1932, the index is 746


page number is 3949
page number is 194
page number is 5185
page number is 1939
page number is 208
no genres listed for 56 Up, the number for the movie is 208, the index is 751


page number is 5347
page number is 900
page number is 2973
page number is 4053
page number is 1005
page number is 174
page number is 166
no genres listed for Filly Brown, the number for the movie is 166, the index is 758


page number is 74
page number is 3967
page number is 3039
page number is 142
no genres listed for Frances Ha, the number fo

page number is 135
page number is 3134
page number is 17
page number is 3120
page number is 3072
page number is 1937
page number is 927
no genres listed for Island of Lemurs: Madagascar, the number for the movie is 927, the index is 1056


page number is 214
page number is 5429
page number is 867
page number is 921
page number is 104
page number is 850
page number is 3019
page number is 903
page number is 1736
no genres listed for Star Wars Ep. VII: The Force Awakens, the number for the movie is 1736, the index is 1065


page number is 2982
page number is 3943
page number is 4035
page number is 1835
page number is 156
no genres listed for No, the number for the movie is 156, the index is 1070


page number is 2949
page number is 4008
page number is 3923
page number is 880
page number is 4042
page number is 197
page number is 4050


done and done


In [223]:
#just filling in the genres that haven't been got
#went to the imdb website directly to deal with this

#below are the page numbers and the imdb urls
#will be using the imdb id contained within the url to get the genre information

correction_dict_again = {
'3074':"https://www.imdb.com/title/tt4062536/?ref_=nv_sr_1",
'868':"https://www.imdb.com/title/tt1980929/?ref_=nv_sr_1",
'4015':"https://www.imdb.com/title/tt4573516/?ref_=nv_sr_1",
'160':"https://www.imdb.com/title/tt1491044/?ref_=nv_sr_1",
'933':"https://www.imdb.com/title/tt2170299/?ref_=nv_sr_1",
'1812':"https://www.imdb.com/title/tt3713166/?ref_=nv_sr_2",
'119':"https://www.imdb.com/title/tt1853739/?ref_=nv_sr_1",
'5259':"https://www.imdb.com/title/tt2231461/?ref_=nv_sr_1",
'883':"https://www.imdb.com/title/tt1798709/?ref_=nv_sr_1",
'3058':"https://www.imdb.com/title/tt4501454/?ref_=nv_sr_1",
'963':"https://www.imdb.com/title/tt2692904/?ref_=nv_sr_1",
'885':"https://www.imdb.com/title/tt2388715/?ref_=nv_sr_1",
'1855':"https://www.imdb.com/title/tt3235888/?ref_=nv_sr_1",
'114':"https://www.imdb.com/title/tt1935179/?ref_=nv_sr_1",
'1832':"https://www.imdb.com/title/tt3316960/?ref_=nv_sr_1",
'916':"https://www.imdb.com/title/tt2458776/?ref_=nv_sr_7",
'835':"https://www.imdb.com/title/tt2980706/?ref_=nv_sr_1",
'3062':"https://www.imdb.com/title/tt0790770/?ref_=nv_sr_1",
'3145':"https://www.imdb.com/title/tt3139764/?ref_=nv_sr_1",
'4000':"https://www.imdb.com/title/tt4191702/?ref_=nv_sr_1",
'3137':"https://www.imdb.com/title/tt3859304/?ref_=nv_sr_3",
'1866':"https://www.imdb.com/title/tt0903657/?ref_=nv_sr_1",
'977':"https://www.imdb.com/title/tt2012665/?ref_=nv_sr_1",
'153':"https://www.imdb.com/title/tt1389096/?ref_=nv_sr_1",
'3140':"https://www.imdb.com/title/tt4632316/?ref_=nv_sr_5",
'1961':"https://www.imdb.com/title/tt3518012/?ref_=nv_sr_2",
'3103':"https://www.imdb.com/title/tt3549206/?ref_=nv_sr_1",
'1932':"https://www.imdb.com/title/tt2891174/?ref_=nv_sr_1",
'208':"https://www.imdb.com/title/tt2147134/?ref_=nv_sr_1",
'166':"https://www.imdb.com/title/tt1869425/?ref_=nv_sr_1",
'142':"https://www.imdb.com/title/tt2347569/?ref_=nv_sr_1",
'3057':"https://www.imdb.com/title/tt2401878/?ref_=nv_sr_1",
'1946':"https://www.imdb.com/title/tt4157220/?ref_=nv_sr_1",
'3136':"https://www.imdb.com/title/tt4113346/?ref_=nv_sr_4",
'1971':"https://www.imdb.com/title/tt3264102/?ref_=nv_sr_2",
'920':"https://www.imdb.com/title/tt2465140/?ref_=nv_sr_1",
'184':"https://www.imdb.com/title/tt0787442/?ref_=fn_al_tt_1",
'4031':"https://www.imdb.com/title/tt4778988/?ref_=nv_sr_1",
'1910':"https://www.imdb.com/title/tt2933544/?ref_=nv_sr_1",
'4011':"https://www.imdb.com/title/tt4420704/?ref_=nv_sr_2",
'3857':"https://www.imdb.com/title/tt2527336/?ref_=nv_sr_1",
'1841':"https://www.imdb.com/title/tt3774114/?ref_=nv_sr_1",
'980':"https://www.imdb.com/title/tt2479800/?ref_=nv_sr_1",
'1994':"https://www.imdb.com/title/tt4239548/?ref_=nv_sr_1",
'3902':"https://www.imdb.com/title/tt4425200/?ref_=nv_sr_2",
'925':"https://www.imdb.com/title/tt1967545/?ref_=nv_sr_1",
'186':"https://www.imdb.com/title/tt1595656/?ref_=nv_sr_1",
'5193':"https://www.imdb.com/title/tt6186232/?ref_=nv_sr_2",
'163':"https://www.imdb.com/title/tt2294677/?ref_=nv_sr_6",
'1985':"https://www.imdb.com/title/tt2788716/?ref_=nv_sr_1",
'1941':"https://www.imdb.com/title/tt1445208/?ref_=nv_sr_1",
'927':"https://www.imdb.com/title/tt3231010/?ref_=fn_al_tt_1",
'1736':"https://www.imdb.com/title/tt2488496/?ref_=nv_sr_1",
'156':"https://www.imdb.com/title/tt2059255/?ref_=nm_flmg_prd_11"
}

In [224]:
import re

for key, value in correction_dict_again.items():
    number = int(key)
    imdb_id = re.findall(r"https:\/\/www\.imdb\.com\/title\/tt(.*)\/\?ref.*", value)[0]
    moviex = ia.get_movie(imdb_id)
    genres_imdb_raw = moviex['genres']
    genres_imdb_check = ','.join(genres_imdb_raw)
    
    master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb_check'] = genres_imdb_check

print('done and done')

done and done


In [227]:
for index,row in master_list_df_reconstituted_again.iterrows():
    genres_imdb = row['genres_imdb']
    genres_imdb_check = row['genres_imdb_check']
    
    number = row['number']
    title = row['title']
    
    if genres_imdb != genres_imdb_check:
        print('Genres mismatch. The page number is {},title is "{}" and the index number is {}'.format(number,title,index))
        
print('done and done')

Genres mismatch. The page number is 2991,title is "The Witch" and the index number is 20
Genres mismatch. The page number is 98,title is "Homefront" and the index number is 72
Genres mismatch. The page number is 1754,title is "Home" and the index number is 79
Genres mismatch. The page number is 1787,title is "Sisters" and the index number is 81
Genres mismatch. The page number is 5434,title is "Adrift" and the index number is 87
Genres mismatch. The page number is 60,title is "Last Vegas" and the index number is 136
Genres mismatch. The page number is 1812,title is "Unfriended" and the index number is 149
Genres mismatch. The page number is 5262,title is "Truth or Dare" and the index number is 182
Genres mismatch. The page number is 3033,title is "The Darkness" and the index number is 192
Genres mismatch. The page number is 3866,title is "Coco" and the index number is 195
Genres mismatch. The page number is 3887,title is "Split" and the index number is 199
Genres mismatch. The page num

In [229]:
correction_dict_v3 = {
'2991':{'title':"The Witch", 'imdb_url':'https://www.imdb.com/title/tt4263482/?ref_=fn_al_tt_1'},
'98':{'title':"Homefront", 'imdb_url':'https://www.imdb.com/title/tt2312718/?ref_=nv_sr_1'},
'1754':{'title':"Home", 'imdb_url':'https://www.imdb.com/title/tt2224026/?ref_=fn_al_tt_2'},
'1787':{'title':"Sisters", 'imdb_url':'https://www.imdb.com/title/tt6258050/?ref_=nv_sr_2'},
'5434':{'title':"Adrift", 'imdb_url':'https://www.imdb.com/title/tt6306064/?ref_=nv_sr_1'},
'60':{'title':"Last Vegas", 'imdb_url':'https://www.imdb.com/title/tt1204975/?ref_=nv_sr_1'},
'1812':{'title':"Unfriended", 'imdb_url':'https://www.imdb.com/title/tt3713166/?ref_=nv_sr_2'},
'5262':{'title':"Truth or Dare", 'imdb_url':'https://www.imdb.com/title/tt6772950/?ref_=nv_sr_1'},
'3033':{'title':"The Darkness", 'imdb_url':'https://www.imdb.com/title/tt1878841/?ref_=nv_sr_1'},
'3866':{'title':"Coco", 'imdb_url':'https://www.imdb.com/title/tt2380307/?ref_=nv_sr_1'},
'3887':{'title':"Split", 'imdb_url':'https://www.imdb.com/title/tt4972582/?ref_=nv_sr_1'},
'8':{'title':"Gravity", 'imdb_url':'https://www.imdb.com/title/tt1454468/?ref_=nv_sr_2'},
'1836':{'title':"Ex Machina", 'imdb_url':'https://www.imdb.com/title/tt0470752/?ref_=nv_sr_1'},
'2939':{'title':"Lights Out", 'imdb_url':'https://www.imdb.com/title/tt4786282/?ref_=nv_sr_1'},
'820':{'title':"Hercules", 'imdb_url':'https://www.imdb.com/title/tt1267297/?ref_=nv_sr_1'},
'1751':{'title':"San Andreas", 'imdb_url':'https://www.imdb.com/title/tt2126355/?ref_=nv_sr_1'},
'85':{'title':"The Call", 'imdb_url':'https://www.imdb.com/title/tt1911644/?ref_=nv_sr_2'},
'3952':{'title':"Sleepless", 'imdb_url':'https://www.imdb.com/title/tt2072233/?ref_=nv_sr_2'},
'975':{'title':"Tracks", 'imdb_url':'https://www.imdb.com/title/tt2167266/?ref_=nv_sr_1'},
'202':{'title':"Generation Iron", 'imdb_url':'https://www.imdb.com/title/tt2205904/?ref_=nv_sr_2'},
'918':{'title':"Addicted", 'imdb_url':'https://www.imdb.com/title/tt2205401/?ref_=nv_sr_1'},
'3944':{'title':"The Circle", 'imdb_url':'https://www.imdb.com/title/tt4287320/?ref_=nv_sr_2'},
'3976':{'title':"Gold", 'imdb_url':'https://www.imdb.com/title/tt1800302/?ref_=nv_sr_4'},
'1747':{'title':"Cinderella", 'imdb_url':'https://www.imdb.com/title/tt1661199/?ref_=nv_sr_1'},
'1825':{'title':"Max", 'imdb_url':'https://www.imdb.com/title/tt3369806/?ref_=nv_sr_3'},
'4080':{'title':"Wilson", 'imdb_url':'https://www.imdb.com/title/tt1781058/?ref_=nv_sr_1'},
'3071':{'title':"Demolition", 'imdb_url':'https://www.imdb.com/title/tt1172049/?ref_=nv_sr_2'},
'1793':{'title':"The Visit", 'imdb_url':'https://www.imdb.com/title/tt3567288/?ref_=nv_sr_1'},
'2990':{'title':"The Forest", 'imdb_url':'https://www.imdb.com/title/tt3387542/?ref_=nv_sr_1'},
'4054':{'title':"Inhumans", 'imdb_url':'https://www.imdb.com/title/tt4154858/?ref_=nv_sr_1'},
'4070':{'title':"Landline", 'imdb_url':'https://www.imdb.com/title/tt5737862/?ref_=nv_sr_1'},
'1796':{'title':"Spotlight", 'imdb_url':'https://www.imdb.com/title/tt1895587/?ref_=nv_sr_1'},
'1926':{'title':"Cake", 'imdb_url':'https://www.imdb.com/title/tt3442006/?ref_=nv_sr_2'},
'5349':{'title':"Beast", 'imdb_url':'https://www.imdb.com/title/tt5628302/?ref_=nv_sr_3'},
'5272':{'title':"The Rider", 'imdb_url':'https://www.imdb.com/title/tt6217608/?ref_=nv_sr_2'},
'3991':{'title':"Let There Be Light", 'imdb_url':'https://www.imdb.com/title/tt5804314/?ref_=nv_sr_1'},
'3107':{'title':"Lazer Team", 'imdb_url':'https://www.imdb.com/title/tt3864024/?ref_=nv_sr_1'},
'3136':{'title':"City of Gold", 'imdb_url':'https://www.imdb.com/title/tt4113346/?ref_=nv_sr_4'},
'818':{'title':"Paddington", 'imdb_url':'https://www.imdb.com/title/tt1109624/?ref_=nv_sr_2'},
'3988':{'title':"Stronger", 'imdb_url':'https://www.imdb.com/title/tt3881784/?ref_=nv_sr_1'},
'3998':{'title':"Lowriders", 'imdb_url':'https://www.imdb.com/title/tt1366338/?ref_=nv_sr_1'},
'1944':{'title':"Black Sea", 'imdb_url':'https://www.imdb.com/title/tt2261331/?ref_=nv_sr_1'},
'1820':{'title':"The Night Before", 'imdb_url':'https://www.imdb.com/title/tt3530002/?ref_=nv_sr_1'},
'1940':{'title':"Meet the Patels", 'imdb_url':'https://www.imdb.com/title/tt2378401/?ref_=nv_sr_1'},
'3025':{'title':"Unforgettable", 'imdb_url':'https://www.imdb.com/title/tt3462710/?ref_=nv_sr_2'},
'985':{'title':"The Song", 'imdb_url':'https://www.imdb.com/title/tt2517044/?ref_=nv_sr_4'},
'4063':{'title':"Step", 'imdb_url':'https://www.imdb.com/title/tt5758404/?ref_=nv_sr_3'},
'4051':{'title':"The Comedian", 'imdb_url':'https://www.imdb.com/title/tt1967614/?ref_=nv_sr_1'},
'99':{'title':"The Family", 'imdb_url':'https://www.imdb.com/title/tt2404311/?ref_=nv_sr_2'},
'1887':{'title':"True Story", 'imdb_url':'https://www.imdb.com/title/tt2273657/?ref_=nv_sr_4'},
'189':{'title':"Disconnect", 'imdb_url':'https://www.imdb.com/title/tt1433811/?ref_=nv_sr_1'},
'5348':{'title':"The Seagull", 'imdb_url':'https://www.imdb.com/title/tt4682136/?ref_=nv_sr_1'},
'836':{'title':"Ride Along", 'imdb_url':'https://www.imdb.com/title/tt1408253/?ref_=nv_sr_2'},
'135':{'title':"Emperor", 'imdb_url':'https://www.imdb.com/title/tt2103264/?ref_=nv_sr_7'},
'903':{'title':"Big Eyes", 'imdb_url':'https://www.imdb.com/title/tt1126590/?ref_=nv_sr_1'},
'1835':{'title':"Criminal", 'imdb_url':'https://www.imdb.com/title/tt3014866/?ref_=nv_sr_4'},
'4008':{'title':"The Wall", 'imdb_url':'https://www.imdb.com/title/tt4218696/?ref_=nv_sr_1'},
'3923':{'title':"The Star", 'imdb_url':'https://www.imdb.com/title/tt4587656/?ref_=nv_sr_1'},
'880':{'title':"Wild", 'imdb_url':'https://www.imdb.com/title/tt2305051/?ref_=nv_sr_6'},
'197':{'title':"Phantom", 'imdb_url':'https://www.imdb.com/title/tt1922685/?ref_=nm_flmg_wr_2'},
'4050':{'title':"Lucky", 'imdb_url':'https://www.imdb.com/title/tt5859238/?ref_=nv_sr_3'}
}

In [230]:
import re

for key, value in correction_dict_v3.items():
    number = int(key)
    imdb_url = correction_dict_v3[key]['imdb_url']
    imdb_id = re.findall(r"https:\/\/www\.imdb\.com\/title\/tt(.*)\/\?ref.*", imdb_url)[0]
    moviex = ia.get_movie(imdb_id)
    genres_imdb_raw = moviex['genres']
    genres_imdb_check = ','.join(genres_imdb_raw)
    
    master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb_check'] = genres_imdb_check
    master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb'] = genres_imdb_check

print('done and done')

done and done


In [231]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v10.csv', index=False)

In [232]:
for index,row in master_list_df_reconstituted_again.iterrows():
    genres_imdb = row['genres_imdb']
    genres_imdb_check = row['genres_imdb_check']
    
    number = row['number']
    title = row['title']
    
    if genres_imdb != genres_imdb_check:
        print('Genres mismatch. The page number is {},title is "{}" and the index number is {}'.format(number,title,index))
        
print('done and done')

done and done


In [233]:
#deleting some foreign movies that have crept into the list
#and also movies that are actually just promotional releases in theatres for tv shows, such as inhumans

# 5127	2017	Una mujer fantástica -- spanish
# 4019	2017	Una mujer fantástica -- spanish 
# 125	2013	La Vie d'Adèle – Chapitres 1 & 2 
# 184	2013	L'attentat
# 4054	2017	Inhumans

delete_row_list_2 = [5127, 4019,125,184,4054]

master_list_df_reconstituted_again = master_list_df_reconstituted_again[~(master_list_df_reconstituted_again['number'].isin(delete_row_list_2))]

master_list_df_reconstituted_again.to_csv('data/india_release_check_v10.csv', index=False)

In [236]:
#just checking if any titles are duplicates

titles = master_list_df_reconstituted_again['title']
master_list_df_reconstituted_again[titles.isin(titles[titles.duplicated()])]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,domestic_box_office_ranked,production_budget_bins,domestic_box_office_bins,director,genres_imdb,genres_imdb_check
805,5271,2017,You Were Never Really Here,2523610,0,2017-04-06,Amazon Studios,R,,Based on Fiction Book/Short Story,...,Contemporary Fiction,,"France,United Kingdom,United States",yes,148.5,no info,50k-50m,Lynne Ramsay,"drama,mystery,thriller","drama,mystery,thriller"
949,3990,2017,You Were Never Really Here,2523610,0,2017-04-06,Amazon Studios,R,,Based on Fiction Book/Short Story,...,Contemporary Fiction,,"France,United Kingdom,United States",yes,148.5,no info,50k-50m,Lynne Ramsay,"drama,mystery,thriller","drama,mystery,thriller"


In [237]:
delete_row_list_3 = [5271]

master_list_df_reconstituted_again = master_list_df_reconstituted_again[~(master_list_df_reconstituted_again['number'].isin(delete_row_list_3))]

master_list_df_reconstituted_again.to_csv('data/india_release_check_v10.csv', index=False)

In [1]:
#START HERE

correction_dict_4 = {
'3074':"https://www.imdb.com/title/tt4062536/?ref_=nv_sr_1",
'868':"https://www.imdb.com/title/tt1980929/?ref_=nv_sr_1",
'4015':"https://www.imdb.com/title/tt4573516/?ref_=nv_sr_1",
'160':"https://www.imdb.com/title/tt1491044/?ref_=nv_sr_1",
'933':"https://www.imdb.com/title/tt2170299/?ref_=nv_sr_1",
'1812':"https://www.imdb.com/title/tt3713166/?ref_=nv_sr_2",
'119':"https://www.imdb.com/title/tt1853739/?ref_=nv_sr_1",
'5259':"https://www.imdb.com/title/tt2231461/?ref_=nv_sr_1",
'883':"https://www.imdb.com/title/tt1798709/?ref_=nv_sr_1",
'3058':"https://www.imdb.com/title/tt4501454/?ref_=nv_sr_1",
'963':"https://www.imdb.com/title/tt2692904/?ref_=nv_sr_1",
'885':"https://www.imdb.com/title/tt2388715/?ref_=nv_sr_1",
'1855':"https://www.imdb.com/title/tt3235888/?ref_=nv_sr_1",
'114':"https://www.imdb.com/title/tt1935179/?ref_=nv_sr_1",
'1832':"https://www.imdb.com/title/tt3316960/?ref_=nv_sr_1",
'916':"https://www.imdb.com/title/tt2458776/?ref_=nv_sr_7",
'835':"https://www.imdb.com/title/tt2980706/?ref_=nv_sr_1",
'3062':"https://www.imdb.com/title/tt0790770/?ref_=nv_sr_1",
'3145':"https://www.imdb.com/title/tt3139764/?ref_=nv_sr_1",
'4000':"https://www.imdb.com/title/tt4191702/?ref_=nv_sr_1",
'3137':"https://www.imdb.com/title/tt3859304/?ref_=nv_sr_3",
'1866':"https://www.imdb.com/title/tt0903657/?ref_=nv_sr_1",
'977':"https://www.imdb.com/title/tt2012665/?ref_=nv_sr_1",
'153':"https://www.imdb.com/title/tt1389096/?ref_=nv_sr_1",
'3140':"https://www.imdb.com/title/tt4632316/?ref_=nv_sr_5",
'1961':"https://www.imdb.com/title/tt3518012/?ref_=nv_sr_2",
'3103':"https://www.imdb.com/title/tt3549206/?ref_=nv_sr_1",
'1932':"https://www.imdb.com/title/tt2891174/?ref_=nv_sr_1",
'208':"https://www.imdb.com/title/tt2147134/?ref_=nv_sr_1",
'166':"https://www.imdb.com/title/tt1869425/?ref_=nv_sr_1",
'142':"https://www.imdb.com/title/tt2347569/?ref_=nv_sr_1",
'3057':"https://www.imdb.com/title/tt2401878/?ref_=nv_sr_1",
'1946':"https://www.imdb.com/title/tt4157220/?ref_=nv_sr_1",
'3136':"https://www.imdb.com/title/tt4113346/?ref_=nv_sr_4",
'1971':"https://www.imdb.com/title/tt3264102/?ref_=nv_sr_2",
'920':"https://www.imdb.com/title/tt2465140/?ref_=nv_sr_1",
'184':"https://www.imdb.com/title/tt0787442/?ref_=fn_al_tt_1",
'4031':"https://www.imdb.com/title/tt4778988/?ref_=nv_sr_1",
'1910':"https://www.imdb.com/title/tt2933544/?ref_=nv_sr_1",
'4011':"https://www.imdb.com/title/tt4420704/?ref_=nv_sr_2",
'3857':"https://www.imdb.com/title/tt2527336/?ref_=nv_sr_1",
'1841':"https://www.imdb.com/title/tt3774114/?ref_=nv_sr_1",
'980':"https://www.imdb.com/title/tt2479800/?ref_=nv_sr_1",
'1994':"https://www.imdb.com/title/tt4239548/?ref_=nv_sr_1",
'3902':"https://www.imdb.com/title/tt4425200/?ref_=nv_sr_2",
'925':"https://www.imdb.com/title/tt1967545/?ref_=nv_sr_1",
'186':"https://www.imdb.com/title/tt1595656/?ref_=nv_sr_1",
'5193':"https://www.imdb.com/title/tt6186232/?ref_=nv_sr_2",
'163':"https://www.imdb.com/title/tt2294677/?ref_=nv_sr_6",
'1985':"https://www.imdb.com/title/tt2788716/?ref_=nv_sr_1",
'1941':"https://www.imdb.com/title/tt1445208/?ref_=nv_sr_1",
'927':"https://www.imdb.com/title/tt3231010/?ref_=fn_al_tt_1",
'1736':"https://www.imdb.com/title/tt2488496/?ref_=nv_sr_1",
'156':"https://www.imdb.com/title/tt2059255/?ref_=nm_flmg_prd_11",
'2991':"https://www.imdb.com/title/tt4263482/?ref_=fn_al_tt_1",
'98':"https://www.imdb.com/title/tt2312718/?ref_=nv_sr_1",
'1754':"https://www.imdb.com/title/tt2224026/?ref_=fn_al_tt_2",
'1787':"https://www.imdb.com/title/tt6258050/?ref_=nv_sr_2",
'5434':"https://www.imdb.com/title/tt6306064/?ref_=nv_sr_1",
'60':"https://www.imdb.com/title/tt1204975/?ref_=nv_sr_1",
'1812':"https://www.imdb.com/title/tt3713166/?ref_=nv_sr_2",
'5262':"https://www.imdb.com/title/tt6772950/?ref_=nv_sr_1",
'3033':"https://www.imdb.com/title/tt1878841/?ref_=nv_sr_1",
'3866':"https://www.imdb.com/title/tt2380307/?ref_=nv_sr_1",
'3887':"https://www.imdb.com/title/tt4972582/?ref_=nv_sr_1",
'8':"https://www.imdb.com/title/tt1454468/?ref_=nv_sr_2",
'1836':"https://www.imdb.com/title/tt0470752/?ref_=nv_sr_1",
'2939':"https://www.imdb.com/title/tt4786282/?ref_=nv_sr_1",
'820':"https://www.imdb.com/title/tt1267297/?ref_=nv_sr_1",
'1751':"https://www.imdb.com/title/tt2126355/?ref_=nv_sr_1",
'85':"https://www.imdb.com/title/tt1911644/?ref_=nv_sr_2",
'3952':"https://www.imdb.com/title/tt2072233/?ref_=nv_sr_2",
'975':"https://www.imdb.com/title/tt2167266/?ref_=nv_sr_1",
'202':"https://www.imdb.com/title/tt2205904/?ref_=nv_sr_2",
'918':"https://www.imdb.com/title/tt2205401/?ref_=nv_sr_1",
'3944':"https://www.imdb.com/title/tt4287320/?ref_=nv_sr_2",
'3976':"https://www.imdb.com/title/tt1800302/?ref_=nv_sr_4",
'1747':"https://www.imdb.com/title/tt1661199/?ref_=nv_sr_1",
'1825':"https://www.imdb.com/title/tt3369806/?ref_=nv_sr_3",
'4080':"https://www.imdb.com/title/tt1781058/?ref_=nv_sr_1",
'3071':"https://www.imdb.com/title/tt1172049/?ref_=nv_sr_2",
'1793':"https://www.imdb.com/title/tt3567288/?ref_=nv_sr_1",
'2990':"https://www.imdb.com/title/tt3387542/?ref_=nv_sr_1",
'4054':"https://www.imdb.com/title/tt4154858/?ref_=nv_sr_1",
'4070':"https://www.imdb.com/title/tt5737862/?ref_=nv_sr_1",
'1796':"https://www.imdb.com/title/tt1895587/?ref_=nv_sr_1",
'1926':"https://www.imdb.com/title/tt3442006/?ref_=nv_sr_2",
'5349':"https://www.imdb.com/title/tt5628302/?ref_=nv_sr_3",
'5272':"https://www.imdb.com/title/tt6217608/?ref_=nv_sr_2",
'3991':"https://www.imdb.com/title/tt5804314/?ref_=nv_sr_1",
'3107':"https://www.imdb.com/title/tt3864024/?ref_=nv_sr_1",
'3136':"https://www.imdb.com/title/tt4113346/?ref_=nv_sr_4",
'818':"https://www.imdb.com/title/tt1109624/?ref_=nv_sr_2",
'3988':"https://www.imdb.com/title/tt3881784/?ref_=nv_sr_1",
'3998':"https://www.imdb.com/title/tt1366338/?ref_=nv_sr_1",
'1944':"https://www.imdb.com/title/tt2261331/?ref_=nv_sr_1",
'1820':"https://www.imdb.com/title/tt3530002/?ref_=nv_sr_1",
'1940':"https://www.imdb.com/title/tt2378401/?ref_=nv_sr_1",
'3025':"https://www.imdb.com/title/tt3462710/?ref_=nv_sr_2",
'985':"https://www.imdb.com/title/tt2517044/?ref_=nv_sr_4",
'4063':"https://www.imdb.com/title/tt5758404/?ref_=nv_sr_3",
'4051':"https://www.imdb.com/title/tt1967614/?ref_=nv_sr_1",
'99':"https://www.imdb.com/title/tt2404311/?ref_=nv_sr_2",
'1887':"https://www.imdb.com/title/tt2273657/?ref_=nv_sr_4",
'189':"https://www.imdb.com/title/tt1433811/?ref_=nv_sr_1",
'5348':"https://www.imdb.com/title/tt4682136/?ref_=nv_sr_1",
'836':"https://www.imdb.com/title/tt1408253/?ref_=nv_sr_2",
'135':"https://www.imdb.com/title/tt2103264/?ref_=nv_sr_7",
'903':"https://www.imdb.com/title/tt1126590/?ref_=nv_sr_1",
'1835':"https://www.imdb.com/title/tt3014866/?ref_=nv_sr_4",
'4008':"https://www.imdb.com/title/tt4218696/?ref_=nv_sr_1",
'3923':"https://www.imdb.com/title/tt4587656/?ref_=nv_sr_1",
'880':"https://www.imdb.com/title/tt2305051/?ref_=nv_sr_6",
'197':"https://www.imdb.com/title/tt1922685/?ref_=nm_flmg_wr_2",
'4050':"https://www.imdb.com/title/tt5859238/?ref_=nv_sr_3"
}

In [2]:
correction_dict_4_keys = list(correction_dict_4.keys())

In [3]:
from imdb import IMDb

with open('passwords_ignore.txt') as f:
    password = f.readline()

login_string = 'postgresql://postgres:' + password + '@localhost/imdb2'

ia = IMDb('s3', login_string, adultSearch=False)

In [4]:
ib = IMDb(accessSystem='http', adultSearch=False)

In [5]:
from imdb import IMDb
import unidecode
import re
from random import randint
import time

def simplify_title(data):

    separator_list = ["/","-","–","—",")","(","\\",":","[","]","&",","]

    for separator in separator_list:
        if separator in data:
            data = data.replace(separator," ")

    data = unidecode.unidecode(data)
    data = re.sub(r'[^a-zA-Z0-9]', ' ', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    data = data.lower()

    return data

# master_list_df_reconstituted_again['id_imdb'] = None
# master_list_df_reconstituted_again['languages_imdb'] = None
# master_list_df_reconstituted_again['genres_imdb_3'] = None

for index,row in master_list_df_reconstituted_again.iterrows():
#     print('\n')
    number = row['number']
    print('\n')
    print('page number is {}'.format(number))
    number_string = str(number)
    
    title = row['title']
    
    
    try:

        if number_string in correction_dict_4_keys:

            url_imdb = correction_dict_4[number_string]
            id_imdb = re.findall(r"https:\/\/www\.imdb\.com\/title\/tt(.*)\/\?ref.*", url_imdb)[0]

            movie_http = ib.get_movie(id_imdb)
            
            delay = randint(5,9)
            time.sleep(delay)


            languages_imdb_raw = movie_http['languages']
            languages_imdb = ','.join(languages_imdb_raw)
            if 'English' not in languages_imdb_raw:
                print('movie for page number {} and title "{}" not in English'.format(number,title))
            elif 'English' != languages_imdb_raw[0]:
                print('movie for page number {} and title "{}" does not have English listed as first language'.format(number,title))

    #         master_list_df_reconstituted_again.loc[index,'languages_imdb'] = languages_imdb
    #         master_list_df_reconstituted_again.loc[index,'id_imdb'] = id_imdb
    #         movie = ia.get_movie(id_imdb)
    #         genres_imdb_raw = movie['genres']
    #         genres_imdb = ','.join(genres_imdb_raw)
    #         master_list_df_reconstituted_again.loc[index,'genres_imdb_3'] = genres_imdb

        else:

            title_simplified = simplify_title(title)
            title_simplified_words = title_simplified.split(' ')

            year = row['year']
            year = str(year)

            title_year = title + " " + year

            director = row['director']

            if str(director) != 'nan':
                director_simplified = simplify_title(director)
                director_simplified_words = director_simplified.split(' ')
            else:
                director_simplified_words = ['ZZZZZZ']

            results  = ib.search_movie(title_year)

            for result in results:

                result_ID = result.movieID
                movie = ia.get_movie(result_ID)  #using ia, the local postgres db instead

                try:
                    movie['kind']
                except:
                    pass
                else:
                    kindx = movie['kind']

                    if kindx == 'movie': #it could be tv episode too, so filtering them out

                        titlex = movie['title']
                        titlex_string_simplified = simplify_title(titlex)

                        try:
                            movie['year']
                        except:
                            yearx = '0000'
                        else:
                            yearx = movie['year']
                            yearx = str(yearx)

                        try:
                            movie['director']
                        except:
                            directors_string_simplified = 'XXXXXX'
                        else:
                            director_list = movie['director']
                            director_name_list = []

                            for director_item in director_list:
                                director_id = director_item.personID
                                director_object = ia.get_person(director_id)   #using ia instead of ib, that is local db instead of website

                                director_name = director_object['name']
                                director_name_list.append(director_name)

                            directors_string = ",".join(director_name_list)
                            directors_string_simplified = simplify_title(directors_string)

                        true_count = [(all(substring in directors_string_simplified for substring in director_simplified_words)),
                                    (all(substring in titlex_string_simplified for substring in title_simplified_words)),
                                    (yearx == year)].count(True)

                        if true_count >= 2:

                            movie_http = ib.get_movie(result_ID)

                            languages_imdb_raw = movie_http['languages']
                            if 'English' not in languages_imdb_raw:
                                print('movie for page number {} and title "{}" not in English'.format(number,title))
                            elif 'English' != languages_imdb_raw[0]:
                                print('movie for page number {} and title "{}" does not have English listed as first language'.format(number,title))
    #                         languages_imdb = ','.join(languages_imdb_raw)
    #                         master_list_df_reconstituted_again.loc[index,'languages_imdb'] = languages_imdb

    #                         master_list_df_reconstituted_again.loc[index,'id_imdb'] = result_ID

    #                         genres_imdb_raw = movie['genres']
    #                         genres_imdb = ','.join(genres_imdb_raw)
    #                         master_list_df_reconstituted_again.loc[index,'genres_imdb_3'] = genres_imdb

                            break

    #                     elif true_count == 2:

    #                         movie_http = ib.get_movie(result_ID)

    #                         languages_imdb_raw = movie_http['languages']
    #                         languages_imdb = ','.join(languages_imdb_raw)
    #                         master_list_df_reconstituted_again.loc[index,'languages_imdb'] = languages_imdb

    #                         master_list_df_reconstituted_again.loc[index,'id_imdb'] = result_ID

    #                         genres_imdb_raw = movie['genres']
    #                         genres_imdb = ','.join(genres_imdb_raw)
    #                         master_list_df_reconstituted_again.loc[index,'genres_imdb_3'] = genres_imdb

    #                         break
    except:
        print('error for movie of page number {} and title "{}"'.format(number,title))
        continue
        
#     if pd.isnull(master_list_df_reconstituted_again.loc[index,'languages_imdb']):
#         print('no languages listed for {}, the number for the movie is {}, the index is {}'.format(title, number, index))

#     if pd.isnull(master_list_df_reconstituted_again.loc[index,'id_imdb']):
#         print('no imdb id listed for {}, the number for the movie is {}, the index is {}'.format(title, number, index))

#     if pd.isnull(master_list_df_reconstituted_again.loc[index,'genres_imdb_3']):
#         print('no imdb id listed for {}, the number for the movie is {}, the index is {}'.format(title, number, index))

print('\n')
print('done and done')

# master_list_df_reconstituted_again.to_csv('data/india_release_check_v11.csv', index=False)

NameError: name 'master_list_df_reconstituted_again' is not defined

In [64]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v11.csv', index=False)

In [2]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v11.csv')

master_list_df_reconstituted_again.head()

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,domestic_box_office_ranked,production_budget_bins,domestic_box_office_bins,director,genres_imdb,genres_imdb_check
0,3867,2017,Pirates of the Caribbean: Dead Men Tell No Tales,172558876,230000000,2017-05-26,Walt Disney,PG-13,Pirates of the Caribbean,Based on Theme Park Ride,...,Fantasy,"Walt Disney Pictures,Jerry Bruckheimer",United States,yes,19.0,225m-250m,150m-200m,"Joachim Ronnin,Espen Sandberg","action,adventure,fantasy","action,adventure,fantasy"
1,164,2013,The East,2274649,6500000,2013-05-31,Fox Searchlight,PG-13,,Original Screenplay,...,Contemporary Fiction,Scott Free Films,United States,yes,157.0,100k-25m,50k-50m,Zal Batmanglij,"adventure,drama,thriller","adventure,drama,thriller"
2,12,2013,World War Z,202359711,190000000,2013-06-21,Paramount Pictures,PG-13,World War Z,Based on Fiction Book/Short Story,...,Science Fiction,"Skydance Productions,Hemisphere Media Capital,...",United States,yes,13.0,175m-200m,200m-250m,Marc Forster,"action,adventure,horror","action,adventure,horror"
3,1834,2015,Ricki and the Flash,26839498,18000000,2015-08-07,Sony Pictures,PG-13,,Original Screenplay,...,Contemporary Fiction,"Marc Platt Productions,Badwill Entertainment ,...",United States,yes,86.0,100k-25m,50k-50m,Jonathan Demme,"comedy,drama,music","comedy,drama,music"
4,851,2014,Transcendence,23022309,100000000,2014-04-18,Warner Bros.,PG-13,,Original Screenplay,...,Science Fiction,"Straight Up Films,DMG Entertainment",United States,yes,105.0,100m-125m,50k-50m,Wally Pfister,"drama,mystery,romance","drama,mystery,romance"


In [None]:
#do it again looking at whether english is the first language in the list, if it isn't, make it known

In [3]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries',
       'released_in_india_2nd_check', 'domestic_box_office_ranked',
       'production_budget_bins', 'domestic_box_office_bins', 'director',
       'genres_imdb', 'genres_imdb_check'],
      dtype='object')

In [4]:
master_list_df_reconstituted_again.drop(['domestic_box_office_ranked','genres_imdb'], axis=1, inplace=True)

In [5]:
master_list_df_reconstituted_again.columns

Index(['number', 'year', 'title', 'domestic_box_office', 'production_budget',
       'domestic_release_date', 'domestic_distributor', 'mpaa_rating',
       'franchise', 'source', 'genre', 'production_method', 'creative_type',
       'production_companies', 'production_countries',
       'released_in_india_2nd_check', 'production_budget_bins',
       'domestic_box_office_bins', 'director', 'genres_imdb_check'],
      dtype='object')

In [6]:
master_list_df_reconstituted_again.shape

(1059, 20)

In [7]:
master_list_df_reconstituted_again.dtypes

number                          int64
year                            int64
title                          object
domestic_box_office             int64
production_budget               int64
domestic_release_date          object
domestic_distributor           object
mpaa_rating                    object
franchise                      object
source                         object
genre                          object
production_method              object
creative_type                  object
production_companies           object
production_countries           object
released_in_india_2nd_check    object
production_budget_bins         object
domestic_box_office_bins       object
director                       object
genres_imdb_check              object
dtype: object

In [8]:
just_august_df = pd.read_csv('data/india_release_just_aug.csv')

In [9]:
just_august_df.dtypes

number                           int64
year                             int64
title                           object
domestic_box_office              int64
production_budget                int64
domestic_release_date           object
domestic_distributor            object
mpaa_rating                     object
franchise                      float64
source                          object
genre                           object
production_method               object
creative_type                   object
production_companies            object
production_countries            object
released_in_india_2nd_check     object
production_budget_bins          object
domestic_box_office_bins        object
director                        object
genres_imdb_check               object
dtype: object

In [10]:
just_august_df['franchise']

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
19   NaN
20   NaN
21   NaN
22   NaN
Name: franchise, dtype: float64

In [11]:
master_list_df_reconstituted_again = pd.concat([master_list_df_reconstituted_again, just_august_df], ignore_index=True)

In [12]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v11.csv', index=False)

In [None]:
#ok so having english not listed first doesnt mean that the movie's primary language isnt english, case in point - operation finale from 2018 is an english movie but has french listed first
#but put that check in anyway

In [6]:
#START HERE

correction_dict_4 = {
'3074':"https://www.imdb.com/title/tt4062536/?ref_=nv_sr_1",
'868':"https://www.imdb.com/title/tt1980929/?ref_=nv_sr_1",
'4015':"https://www.imdb.com/title/tt4573516/?ref_=nv_sr_1",
'160':"https://www.imdb.com/title/tt1491044/?ref_=nv_sr_1",
'933':"https://www.imdb.com/title/tt2170299/?ref_=nv_sr_1",
'1812':"https://www.imdb.com/title/tt3713166/?ref_=nv_sr_2",
'119':"https://www.imdb.com/title/tt1853739/?ref_=nv_sr_1",
'5259':"https://www.imdb.com/title/tt2231461/?ref_=nv_sr_1",
'883':"https://www.imdb.com/title/tt1798709/?ref_=nv_sr_1",
'3058':"https://www.imdb.com/title/tt4501454/?ref_=nv_sr_1",
'963':"https://www.imdb.com/title/tt2692904/?ref_=nv_sr_1",
'885':"https://www.imdb.com/title/tt2388715/?ref_=nv_sr_1",
'1855':"https://www.imdb.com/title/tt3235888/?ref_=nv_sr_1",
'114':"https://www.imdb.com/title/tt1935179/?ref_=nv_sr_1",
'1832':"https://www.imdb.com/title/tt3316960/?ref_=nv_sr_1",
'916':"https://www.imdb.com/title/tt2458776/?ref_=nv_sr_7",
'835':"https://www.imdb.com/title/tt2980706/?ref_=nv_sr_1",
'3062':"https://www.imdb.com/title/tt0790770/?ref_=nv_sr_1",
'3145':"https://www.imdb.com/title/tt3139764/?ref_=nv_sr_1",
'4000':"https://www.imdb.com/title/tt4191702/?ref_=nv_sr_1",
'3137':"https://www.imdb.com/title/tt3859304/?ref_=nv_sr_3",
'1866':"https://www.imdb.com/title/tt0903657/?ref_=nv_sr_1",
'977':"https://www.imdb.com/title/tt2012665/?ref_=nv_sr_1",
'153':"https://www.imdb.com/title/tt1389096/?ref_=nv_sr_1",
'3140':"https://www.imdb.com/title/tt4632316/?ref_=nv_sr_5",
'1961':"https://www.imdb.com/title/tt3518012/?ref_=nv_sr_2",
'3103':"https://www.imdb.com/title/tt3549206/?ref_=nv_sr_1",
'1932':"https://www.imdb.com/title/tt2891174/?ref_=nv_sr_1",
'208':"https://www.imdb.com/title/tt2147134/?ref_=nv_sr_1",
'166':"https://www.imdb.com/title/tt1869425/?ref_=nv_sr_1",
'142':"https://www.imdb.com/title/tt2347569/?ref_=nv_sr_1",
'3057':"https://www.imdb.com/title/tt2401878/?ref_=nv_sr_1",
'1946':"https://www.imdb.com/title/tt4157220/?ref_=nv_sr_1",
'3136':"https://www.imdb.com/title/tt4113346/?ref_=nv_sr_4",
'1971':"https://www.imdb.com/title/tt3264102/?ref_=nv_sr_2",
'920':"https://www.imdb.com/title/tt2465140/?ref_=nv_sr_1",
'184':"https://www.imdb.com/title/tt0787442/?ref_=fn_al_tt_1",
'4031':"https://www.imdb.com/title/tt4778988/?ref_=nv_sr_1",
'1910':"https://www.imdb.com/title/tt2933544/?ref_=nv_sr_1",
'4011':"https://www.imdb.com/title/tt4420704/?ref_=nv_sr_2",
'3857':"https://www.imdb.com/title/tt2527336/?ref_=nv_sr_1",
'1841':"https://www.imdb.com/title/tt3774114/?ref_=nv_sr_1",
'980':"https://www.imdb.com/title/tt2479800/?ref_=nv_sr_1",
'1994':"https://www.imdb.com/title/tt4239548/?ref_=nv_sr_1",
'3902':"https://www.imdb.com/title/tt4425200/?ref_=nv_sr_2",
'925':"https://www.imdb.com/title/tt1967545/?ref_=nv_sr_1",
'186':"https://www.imdb.com/title/tt1595656/?ref_=nv_sr_1",
'5193':"https://www.imdb.com/title/tt6186232/?ref_=nv_sr_2",
'163':"https://www.imdb.com/title/tt2294677/?ref_=nv_sr_6",
'1985':"https://www.imdb.com/title/tt2788716/?ref_=nv_sr_1",
'1941':"https://www.imdb.com/title/tt1445208/?ref_=nv_sr_1",
'927':"https://www.imdb.com/title/tt3231010/?ref_=fn_al_tt_1",
'1736':"https://www.imdb.com/title/tt2488496/?ref_=nv_sr_1",
'156':"https://www.imdb.com/title/tt2059255/?ref_=nm_flmg_prd_11",
'2991':"https://www.imdb.com/title/tt4263482/?ref_=fn_al_tt_1",
'98':"https://www.imdb.com/title/tt2312718/?ref_=nv_sr_1",
'1754':"https://www.imdb.com/title/tt2224026/?ref_=fn_al_tt_2",
'1787':"https://www.imdb.com/title/tt6258050/?ref_=nv_sr_2",
'5434':"https://www.imdb.com/title/tt6306064/?ref_=nv_sr_1",
'60':"https://www.imdb.com/title/tt1204975/?ref_=nv_sr_1",
'1812':"https://www.imdb.com/title/tt3713166/?ref_=nv_sr_2",
'5262':"https://www.imdb.com/title/tt6772950/?ref_=nv_sr_1",
'3033':"https://www.imdb.com/title/tt1878841/?ref_=nv_sr_1",
'3866':"https://www.imdb.com/title/tt2380307/?ref_=nv_sr_1",
'3887':"https://www.imdb.com/title/tt4972582/?ref_=nv_sr_1",
'8':"https://www.imdb.com/title/tt1454468/?ref_=nv_sr_2",
'1836':"https://www.imdb.com/title/tt0470752/?ref_=nv_sr_1",
'2939':"https://www.imdb.com/title/tt4786282/?ref_=nv_sr_1",
'820':"https://www.imdb.com/title/tt1267297/?ref_=nv_sr_1",
'1751':"https://www.imdb.com/title/tt2126355/?ref_=nv_sr_1",
'85':"https://www.imdb.com/title/tt1911644/?ref_=nv_sr_2",
'3952':"https://www.imdb.com/title/tt2072233/?ref_=nv_sr_2",
'975':"https://www.imdb.com/title/tt2167266/?ref_=nv_sr_1",
'202':"https://www.imdb.com/title/tt2205904/?ref_=nv_sr_2",
'918':"https://www.imdb.com/title/tt2205401/?ref_=nv_sr_1",
'3944':"https://www.imdb.com/title/tt4287320/?ref_=nv_sr_2",
'3976':"https://www.imdb.com/title/tt1800302/?ref_=nv_sr_4",
'1747':"https://www.imdb.com/title/tt1661199/?ref_=nv_sr_1",
'1825':"https://www.imdb.com/title/tt3369806/?ref_=nv_sr_3",
'4080':"https://www.imdb.com/title/tt1781058/?ref_=nv_sr_1",
'3071':"https://www.imdb.com/title/tt1172049/?ref_=nv_sr_2",
'1793':"https://www.imdb.com/title/tt3567288/?ref_=nv_sr_1",
'2990':"https://www.imdb.com/title/tt3387542/?ref_=nv_sr_1",
'4054':"https://www.imdb.com/title/tt4154858/?ref_=nv_sr_1",
'4070':"https://www.imdb.com/title/tt5737862/?ref_=nv_sr_1",
'1796':"https://www.imdb.com/title/tt1895587/?ref_=nv_sr_1",
'1926':"https://www.imdb.com/title/tt3442006/?ref_=nv_sr_2",
'5349':"https://www.imdb.com/title/tt5628302/?ref_=nv_sr_3",
'5272':"https://www.imdb.com/title/tt6217608/?ref_=nv_sr_2",
'3991':"https://www.imdb.com/title/tt5804314/?ref_=nv_sr_1",
'3107':"https://www.imdb.com/title/tt3864024/?ref_=nv_sr_1",
'3136':"https://www.imdb.com/title/tt4113346/?ref_=nv_sr_4",
'818':"https://www.imdb.com/title/tt1109624/?ref_=nv_sr_2",
'3988':"https://www.imdb.com/title/tt3881784/?ref_=nv_sr_1",
'3998':"https://www.imdb.com/title/tt1366338/?ref_=nv_sr_1",
'1944':"https://www.imdb.com/title/tt2261331/?ref_=nv_sr_1",
'1820':"https://www.imdb.com/title/tt3530002/?ref_=nv_sr_1",
'1940':"https://www.imdb.com/title/tt2378401/?ref_=nv_sr_1",
'3025':"https://www.imdb.com/title/tt3462710/?ref_=nv_sr_2",
'985':"https://www.imdb.com/title/tt2517044/?ref_=nv_sr_4",
'4063':"https://www.imdb.com/title/tt5758404/?ref_=nv_sr_3",
'4051':"https://www.imdb.com/title/tt1967614/?ref_=nv_sr_1",
'99':"https://www.imdb.com/title/tt2404311/?ref_=nv_sr_2",
'1887':"https://www.imdb.com/title/tt2273657/?ref_=nv_sr_4",
'189':"https://www.imdb.com/title/tt1433811/?ref_=nv_sr_1",
'5348':"https://www.imdb.com/title/tt4682136/?ref_=nv_sr_1",
'836':"https://www.imdb.com/title/tt1408253/?ref_=nv_sr_2",
'135':"https://www.imdb.com/title/tt2103264/?ref_=nv_sr_7",
'903':"https://www.imdb.com/title/tt1126590/?ref_=nv_sr_1",
'1835':"https://www.imdb.com/title/tt3014866/?ref_=nv_sr_4",
'4008':"https://www.imdb.com/title/tt4218696/?ref_=nv_sr_1",
'3923':"https://www.imdb.com/title/tt4587656/?ref_=nv_sr_1",
'880':"https://www.imdb.com/title/tt2305051/?ref_=nv_sr_6",
'197':"https://www.imdb.com/title/tt1922685/?ref_=nm_flmg_wr_2",
'4050':"https://www.imdb.com/title/tt5859238/?ref_=nv_sr_3"
}

In [7]:
correction_dict_4_keys = list(correction_dict_4.keys())

In [8]:
#delete the movies that don't have english listed


In [9]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv("data/india_release_check_v11.csv")

In [10]:
from imdb import IMDb

with open('passwords_ignore.txt') as f:
    password = f.readline()

login_string = 'postgresql://postgres:' + password + '@localhost/imdb2'

ia = IMDb('s3', login_string, adultSearch=False)

In [11]:
ib = IMDb(accessSystem='http', adultSearch=False)

In [None]:

master_list_df_reconstituted_again['genres_imdb_final'] = None
master_list_df_reconstituted_again['languages_imdb'] = None
master_list_df_reconstituted_again['id_imdb'] = None

for index,row in master_list_df_reconstituted_again.iterrows():

    number = row['number']
    print('\n')
    print('page number is {}'.format(number))
    number_string = str(number)

    title = row['title']

    try:
        if number_string in correction_dict_4_keys:

            url_imdb = correction_dict_4[number_string]
            # print('got here')
            id_imdb = re.findall(r"https:\/\/www\.imdb\.com\/title\/tt(.*)\/\?ref.*", url_imdb)[0]

            movie_http = ib.get_movie(id_imdb)

            delay = randint(5,9)
            time.sleep(delay)

            languages_imdb_raw = movie_http['languages']
            languages_imdb = ','.join(languages_imdb_raw)

            master_list_df_reconstituted_again.loc[index,'languages_imdb'] = languages_imdb

            master_list_df_reconstituted_again.loc[index,'id_imdb'] = id_imdb

            movie = ia.get_movie(id_imdb)
            genres_imdb_raw = movie['genres']
            genres_imdb = ','.join(genres_imdb_raw)
            master_list_df_reconstituted_again.loc[index,'genres_imdb_final'] = genres_imdb

        else:
            # print('am here now')

            year = row['year']
            year = str(year)

            director = row['director']

            if ',' in director:
                director = re.findall(r"(.*)\,.*",director)[0]
            elif str(director) == 'nan':
                director = 'ZZZZZZ'

            results  = ib.search_movie(title,results = 20)

            delay = randint(5,9)
            time.sleep(delay)

            title_director_year = []
            title_director = []
            title_year = []
            director_year = []

            result_counter = -1

            for result in results:
                result_counter += 1

                result_ID = result.movieID
                movie = ia.get_movie(result_ID)  #using ia, the local postgres db here instead of the main website

                try:
                    movie['kind']
                except:
                    pass
                else:
                    kindx = movie['kind']

                    if kindx == 'movie': #it could be tv episode too, so filtering them out

                        titlex = movie['title']

                        try:
                            movie['year']
                        except:
                            yearx = '0000'
                        else:
                            yearx = movie['year']
                            yearx = str(yearx)

                        try:
                            movie['director']
                        except:
                            directorx = 'XXXXXX'
                        else:
                            director_list = movie['director']
                            director_item = director_list[0]
                            director_id = director_item.personID
                            director_object = ia.get_person(director_id)
                            directorx = director_object['name']

                        if ((fuzz.token_sort_ratio(director,directorx) >= 85) and (fuzz.token_sort_ratio(title,titlex) >= 85) and
                        (yearx == year)):
                            title_director_year.append(result_counter)
                        elif ((fuzz.token_sort_ratio(director,directorx) >= 85) and (fuzz.token_sort_ratio(title,titlex) >= 85)):
                            title_director.append(result_counter)
                        elif ((fuzz.token_sort_ratio(title,titlex) >= 85) and (yearx == year)):
                            title_year.append(result_counter)
                        # elif ((fuzz.token_sort_ratio(director,directorx) >= 85) and (yearx == year)):
                        #     director_year.append(result_counter)

                continue

            def fillin_data(listx):

                index_number = listx[0]

                result_ID = results[index_number].movieID

                master_list_df_reconstituted_again.loc[index,'id_imdb'] = result_ID

                movie_http = ib.get_movie(result_ID)

                delay = randint(5,9)
                time.sleep(delay)

                languages_imdb_raw = movie_http['languages']
                languages_imdb = ','.join(languages_imdb_raw)
                master_list_df_reconstituted_again.loc[index,'languages_imdb'] = languages_imdb

                movie = ia.get_movie(result_ID)

                genres_imdb_raw = movie['genres']
                genres_imdb = ','.join(genres_imdb_raw)
                master_list_df_reconstituted_again.loc[index,'genres_imdb_final'] = genres_imdb


            if len(title_director_year) > 0:
                fillin_data(title_director_year)
            else:
                if len(title_director) > 0:
                    fillin_data(title_director)
                else:
                    if len(title_year) > 0:
                        fillin_data(title_year)
                    # else:
                    #     if len(director_year) > 0:
                    #         fillin_data(director_year)

    except:
        print('error for movie of page number {} and title "{}"'.format(number,title))
        continue

print('\n')
print('done and done')

master_list_df_reconstituted_again.to_csv('data/india_release_check_v12.csv', index=False)



page number is 3867
error for movie of page number 3867 and title "Pirates of the Caribbean: Dead Men Tell No Tales"


page number is 164
error for movie of page number 164 and title "The East"


page number is 12
error for movie of page number 12 and title "World War Z"


page number is 1834
error for movie of page number 1834 and title "Ricki and the Flash"


page number is 851
error for movie of page number 851 and title "Transcendence"


page number is 3023
error for movie of page number 3023 and title "The Birth of a Nation"


page number is 2895
error for movie of page number 2895 and title "Suicide Squad"


page number is 3986
error for movie of page number 3986 and title "My Cousin Rachel"


page number is 3074
got here


page number is 217
error for movie of page number 217 and title "The Reluctant Fundamentalist"


page number is 3899
error for movie of page number 3899 and title "Baywatch"


page number is 2927
error for movie of page number 2927 and title "The BFG"


page

2018-10-18 00:17:53,883 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': 104, 'errmsg': 'Connection reset by peer', 'url': 'http://www.imdb.com/title/tt6306064/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': ConnectionResetError(104, 'Connection reset by peer')},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 231, in retrieve_unicode
    response = uopener.open(url)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py"



page number is 3941
error for movie of page number 3941 and title "Logan Lucky"


page number is 28
error for movie of page number 28 and title "Cloudy with a Chance of Meatballs 2"


page number is 933
got here


page number is 809
error for movie of page number 809 and title "Noah"


page number is 5062
error for movie of page number 5062 and title "Insidious: The Last Key"


page number is 848
error for movie of page number 848 and title "Horrible Bosses 2"


page number is 5183
error for movie of page number 5183 and title "Isle of Dogs"


page number is 1772
error for movie of page number 1772 and title "Creed"


page number is 63
error for movie of page number 63 and title "Escape Plan"


page number is 3004
error for movie of page number 3004 and title "Whiskey Tango Foxtrot"


page number is 3924
error for movie of page number 3924 and title "The Mountain Between Us"


page number is 3022
error for movie of page number 3022 and title "The Edge of Seventeen"


page number is 3



page number is 2919
error for movie of page number 2919 and title "Sully"


page number is 1843
error for movie of page number 1843 and title "Victor Frankenstein"


page number is 3866
got here


page number is 53
error for movie of page number 53 and title "2 Guns"


page number is 892
error for movie of page number 892 and title "3 Days to Kill"


page number is 3903
error for movie of page number 3903 and title "Ghost in the Shell"


page number is 3887
got here


In [None]:
#dont know why but the browser started hanging , just copied the whole thing into a python script and ran it on the command line


In [1]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v12.csv')

In [11]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genres_imdb_final'].isnull()]['number']

155     1811
209     2970
388      112
504     2921
590     2932
630     4004
632      108
670      952
736     3035
789      170
830      911
835     3936
844       65
850       48
855     1759
861     2913
956      995
996     2969
999       35
1006      96
1020      97
1034    3120
1039    5429
Name: number, dtype: int64

In [3]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['id_imdb'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb
155,1811,2015,The Longest Ride,37446117,34000000,2015-04-10,20th Century Fox,PG-13,,Based on Fiction Book/Short Story,...,"Fox 2000 Pictures,Temple Hill Entertainment",United States,yes,25m-50m,50k-50m,"George Tillman, Jr","drama,romance",,,
209,2970,2016,13 Hours: The Secret Soldiers of Benghazi,52853219,50000000,2016-01-15,Paramount Pictures,R,,Based on Factual Book/Article,...,"Paramount Pictures,3 Arts Entertainment,Bay Films",United States,yes,50m-75m,50m-100m,Michael Bay,"action,drama,history",,,
388,112,2013,Kevin Hart: Let Me Explain,32244051,2500000,2013-07-03,Lionsgate,R,,Based on Real Life Events,...,,United States,yes,100k-25m,50k-50m,,"comedy,documentary",,,
504,2921,2016,Ghostbusters,128350574,144000000,2016-07-15,Sony Pictures,PG-13,Ghostbusters,Remake,...,"Columbia Pictures,Village Roadshow Productions...",United States,yes,125m-150m,100m-150m,Paul Feig,"action,comedy,fantasy",,,
590,2932,2016,The Divergent Series: Allegiant,66184051,110000000,2016-03-18,Lionsgate,PG-13,The Divergent Series,Based on Fiction Book/Short Story,...,"Summit Entertainment,Red Wagon Entertainment",United States,yes,100m-125m,50m-100m,Robert Schwentke,"action,adventure,mystery",,,
630,4004,2017,An Inconvenient Sequel,3496795,1000000,2017-07-28,Paramount Vantage,PG,An Inconvenient Truth,Based on Real Life Events,...,,United States,yes,100k-25m,50k-50m,"Bonni Cohen,Jon Shenk",documentary,,,
632,108,2013,21 and Over,25682380,13000000,2013-03-01,Relativity,R,,Original Screenplay,...,"Relativity Media,SkyLand Entertainment,Virgin,...",United States,yes,100k-25m,50k-50m,"Jon Lucas,Scott Moore",comedy,,,
670,952,2014,The Fluffy Movie,2827393,0,2014-07-25,Open Road,PG-13,,Based on Real Life Events,...,"Gulfstream Pictures,FluffyShop,ArsonHouse Prod...",United States,no,no info,50k-50m,Manny Rodriguez,comedy,,,
736,3035,2016,The Beatles: Eight Days a Week,2934445,0,2016-09-16,Abramorama Films,Not Rated,,Based on Real Life Events,...,"White Horse Pictures,Imagine Entertainment,App...",United States,no,no info,50k-50m,Ron Howard,"documentary,music",,,
789,170,2013,Girl Most Likely,1378591,0,2013-07-19,Roadside Attractions,PG-13,,Original Screenplay,...,"Maven Pictures,Anonymous Content,Ambush Entert...",United States,no,no info,50k-50m,"Robert Pulcini,Shari Springer Berman",comedy,,,


In [4]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['languages_imdb'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb
155,1811,2015,The Longest Ride,37446117,34000000,2015-04-10,20th Century Fox,PG-13,,Based on Fiction Book/Short Story,...,"Fox 2000 Pictures,Temple Hill Entertainment",United States,yes,25m-50m,50k-50m,"George Tillman, Jr","drama,romance",,,
209,2970,2016,13 Hours: The Secret Soldiers of Benghazi,52853219,50000000,2016-01-15,Paramount Pictures,R,,Based on Factual Book/Article,...,"Paramount Pictures,3 Arts Entertainment,Bay Films",United States,yes,50m-75m,50m-100m,Michael Bay,"action,drama,history",,,
388,112,2013,Kevin Hart: Let Me Explain,32244051,2500000,2013-07-03,Lionsgate,R,,Based on Real Life Events,...,,United States,yes,100k-25m,50k-50m,,"comedy,documentary",,,
504,2921,2016,Ghostbusters,128350574,144000000,2016-07-15,Sony Pictures,PG-13,Ghostbusters,Remake,...,"Columbia Pictures,Village Roadshow Productions...",United States,yes,125m-150m,100m-150m,Paul Feig,"action,comedy,fantasy",,,
590,2932,2016,The Divergent Series: Allegiant,66184051,110000000,2016-03-18,Lionsgate,PG-13,The Divergent Series,Based on Fiction Book/Short Story,...,"Summit Entertainment,Red Wagon Entertainment",United States,yes,100m-125m,50m-100m,Robert Schwentke,"action,adventure,mystery",,,
630,4004,2017,An Inconvenient Sequel,3496795,1000000,2017-07-28,Paramount Vantage,PG,An Inconvenient Truth,Based on Real Life Events,...,,United States,yes,100k-25m,50k-50m,"Bonni Cohen,Jon Shenk",documentary,,,
632,108,2013,21 and Over,25682380,13000000,2013-03-01,Relativity,R,,Original Screenplay,...,"Relativity Media,SkyLand Entertainment,Virgin,...",United States,yes,100k-25m,50k-50m,"Jon Lucas,Scott Moore",comedy,,,
670,952,2014,The Fluffy Movie,2827393,0,2014-07-25,Open Road,PG-13,,Based on Real Life Events,...,"Gulfstream Pictures,FluffyShop,ArsonHouse Prod...",United States,no,no info,50k-50m,Manny Rodriguez,comedy,,,
736,3035,2016,The Beatles: Eight Days a Week,2934445,0,2016-09-16,Abramorama Films,Not Rated,,Based on Real Life Events,...,"White Horse Pictures,Imagine Entertainment,App...",United States,no,no info,50k-50m,Ron Howard,"documentary,music",,,
789,170,2013,Girl Most Likely,1378591,0,2013-07-19,Roadside Attractions,PG-13,,Original Screenplay,...,"Maven Pictures,Anonymous Content,Ambush Entert...",United States,no,no info,50k-50m,"Robert Pulcini,Shari Springer Berman",comedy,,,


In [5]:
master_list_df_reconstituted_again.dtypes

number                           int64
year                             int64
title                           object
domestic_box_office              int64
production_budget                int64
domestic_release_date           object
domestic_distributor            object
mpaa_rating                     object
franchise                       object
source                          object
genre                           object
production_method               object
creative_type                   object
production_companies            object
production_countries            object
released_in_india_2nd_check     object
production_budget_bins          object
domestic_box_office_bins        object
director                        object
genres_imdb_check               object
genres_imdb_final               object
languages_imdb                  object
id_imdb                        float64
dtype: object

In [14]:
fill_dict = {
"1811":"https://www.imdb.com/title/tt2726560/?ref_=nv_sr_1",
"2970":"https://www.imdb.com/title/tt4172430/?ref_=nv_sr_1",
"112":"https://www.imdb.com/title/tt2609912/?ref_=nv_sr_1",
"2921":"https://www.imdb.com/title/tt1289401/?ref_=nv_sr_2",
"2932":"https://www.imdb.com/title/tt3410834/?ref_=nv_sr_2",
"4004":"https://www.imdb.com/title/tt6322922/?ref_=nv_sr_1",
"108":"https://www.imdb.com/title/tt1711425/?ref_=nv_sr_1",
"952":"https://www.imdb.com/title/tt3532608/?ref_=nv_sr_1",
"3035":"https://www.imdb.com/title/tt2531318/?ref_=nv_sr_1",
"170":"https://www.imdb.com/title/tt1698648/?ref_=nv_sr_1",
"911":"https://www.imdb.com/title/tt0884726/?ref_=nv_sr_1",
"3936":"https://www.imdb.com/title/tt6217804/?ref_=nv_sr_1",
"65":"https://www.imdb.com/title/tt2378281/?ref_=nv_sr_1",
"48":"https://www.imdb.com/title/tt3063516/?ref_=nv_sr_1",
"1759":"https://www.imdb.com/title/tt2908446/?ref_=nv_sr_1",
"2913":"https://www.imdb.com/title/tt3065204/?ref_=nv_sr_1",
"995":"https://www.imdb.com/title/tt1986769/?ref_=nv_sr_1",
"2969":"https://www.imdb.com/title/tt5325452/?ref_=nv_sr_2",
"35":"https://www.imdb.com/title/tt1691917/?ref_=nv_sr_1",
"96":"https://www.imdb.com/title/tt2070862/?ref_=ttfc_fc_tt",
"97":"https://www.imdb.com/title/tt2609758/?ref_=nv_sr_1",
"3120":"https://www.imdb.com/title/tt4700756/?ref_=nm_flmg_cin_3",
"5429":"https://www.imdb.com/title/tt5164214/?ref_=nv_sr_1"
}

In [15]:
from imdb import IMDb

with open('passwords_ignore.txt') as f:
    password = f.readline()

login_string = 'postgresql://postgres:' + password + '@localhost/imdb2'

ia = IMDb('s3', login_string, adultSearch = False)

ib = IMDb(accessSystem = 'http', adultSearch = False)

In [17]:
import re
from random import randint
import time

for key, value in fill_dict.items():
    
    print(key)
    number = int(key)
    id_imdb = re.findall(r"https:\/\/www\.imdb\.com\/title\/tt(.*)\/\?ref.*", value)[0]

    try:
        movie_http = ib.get_movie(id_imdb)

        delay = randint(5,9)
        time.sleep(delay)

        languages_imdb_raw = movie_http['languages']
        languages_imdb = ','.join(languages_imdb_raw)

        master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'languages_imdb'] = languages_imdb

        master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'id_imdb'] = id_imdb

        movie = ia.get_movie(id_imdb)
        genres_imdb_raw = movie['genres']
        genres_imdb = ','.join(genres_imdb_raw)
        master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb_final'] = genres_imdb
    
    except:
        print('error for page number {}'.format(key))
        
print('done and done')

1811
2970
112
2921
2932
4004
108
952
3035
170
911
3936
65
48
1759
2913
995
2969
35
96
97
3120
error for page number 3120
5429
done and done


In [18]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v12.csv', index=False)

In [19]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genres_imdb_final'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb
1034,3120,2016,Believe,890303,3500000,2016-12-02,Smith Global Media,PG,,Original Screenplay,...,"Power of 3,Smith Global Management",United States,no,100k-25m,50k-50m,Billy Dickson,drama,,,4700760.0


In [20]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['id_imdb'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb


In [21]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['languages_imdb'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb
1034,3120,2016,Believe,890303,3500000,2016-12-02,Smith Global Media,PG,,Original Screenplay,...,"Power of 3,Smith Global Management",United States,no,100k-25m,50k-50m,Billy Dickson,drama,,,4700760.0


In [27]:
movie_http = ib.get_movie('4700756') 

master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == 3120,'languages_imdb'] = 'English'
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == 3120,'id_imdb'] = '4700756'
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == 3120,'genres_imdb_final'] = movie_http['genres']

print('done and done')

done and done


In [28]:
master_list_df_reconstituted_again['id_imdb'] = master_list_df_reconstituted_again['id_imdb'].astype(int)

In [29]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genres_imdb_final'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb


In [30]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['languages_imdb'].isnull()]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb


In [31]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v12.csv', index=False)

In [32]:
master_list_df_reconstituted_again.loc[~(master_list_df_reconstituted_again['languages_imdb'].str.contains('English', na=False, regex=False))]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb
691,3069,2016,The Eagle Huntress,3168664,0,2016-11-02,Sony Pictures Classics,G,,Based on Real Life Events,...,"Sony Pictures Classics,19340 Productions,Artem...","Mongolia,United Kingdom,United States",no,no info,50k-50m,Otto Bell,"adventure,documentary,sport","adventure,documentary,sport",Kazakh,3882074
695,3133,2016,A Tale of Love and Darkness,572212,0,2016-08-19,Focus World,PG-13,,Based on Factual Book/Article,...,"Movie Plus,Ram Bergman Productions,Keshet,Avi ...","Israel,United States",no,no info,50k-50m,Natalie Portman,"biography,drama,history","biography,drama,history",Hebrew,1135989
787,3012,2016,No Manches Frida,11528613,0,2016-09-02,Lionsgate,PG-13,,Remake,...,"Pantelion Films,Lionsgate,Alcon Entertainment,...",United States,no,no info,50k-50m,Nacho G. Velilla,comedy,comedy,Spanish,5259966
839,1001,2014,A Girl Walks Home Alone at Night,544032,0,2014-11-21,Kino Lorber,Not Rated,,Original Screenplay,...,"Say Ahh Productions,Black Light District,Logan...",United States,no,no info,50k-50m,Ana Lily Amirpour,"drama,horror","drama,horror",Persian,2326554
920,4011,2017,KEDi,2834262,0,2017-02-10,Oscilloscope Pictures,Not Rated,,Based on Real Life Events,...,,"Turkey,United States",no,no info,50k-50m,Ceyda Torun,documentary,documentary,Turkish,4420704
1051,156,2013,No,2341226,0,2013-02-15,Sony Pictures Classics,R,,Based on Real Life Events,...,"Fabula,Participant Media,Funny Balloons","Chile,France,United States",no,no info,50k-50m,Pablo Larrain,drama,drama,Spanish,2059255


In [33]:
#removing the movies which aren't in English

master_list_df_reconstituted_again = master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['languages_imdb'].str.contains('English', na=False, regex=False)]

In [34]:
#went through the list of titles, deleting two titles which are confirmed Spanish-lanaguage movies-- 'Ladrones' and 'No se Aceptan Devoluciones'-- even if english is listed in the imdb languages

master_list_df_reconstituted_again = master_list_df_reconstituted_again.loc[~((master_list_df_reconstituted_again['number'] == 65) | (master_list_df_reconstituted_again['number'] == 1906))]

In [35]:
master_list_df_reconstituted_again.shape

(1074, 23)

In [36]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v13.csv', index=False)

In [37]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genres_imdb_final'] != master_list_df_reconstituted_again['genres_imdb_check']]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_check,genres_imdb_final,languages_imdb,id_imdb
315,3980,2017,The Promise,8224288,90000000,2017-04-21,Open Road,PG-13,,Original Screenplay,...,"Survival Pictures,Mike Medavoy Productions","Spain,United States",yes,75m-100m,50k-50m,Terry George,"drama,horror,thriller","drama,history","English,Armenian,German,French,Turkish",4776998
637,19,2013,The Hangover 3,112200072,103000000,2013-05-23,Warner Bros.,R,Hangover,Original Screenplay,...,"Warner Bros.,Legendary Pictures,Green Hat Film...",United States,yes,100m-125m,100m-150m,Todd Phillips,"comedy,crime",comedy,English,1119646
976,4021,2017,Jeepers Creepers 3,2235162,0,,Weinstein Co.,Not Rated,Jeepers Creepers,Original Screenplay,...,"American Zoetrope,Odyssey Media,The Cartel",United States,no,no info,50k-50m,Victor Salva,"action,horror,mystery","horror,mystery",English,263488
1034,3120,2016,Believe,890303,3500000,2016-12-02,Smith Global Media,PG,,Original Screenplay,...,"Power of 3,Smith Global Management",United States,no,100k-25m,50k-50m,Billy Dickson,drama,Drama,English,4700756
1070,5582,2018,A.X.L.,6500473,0,2018-08-24,Global Road,PG,,Based on Short Film,...,"Global Road Entertainment,Lakeshore Entertainm...",United States,no,no info,50k-50m,Oliver Daly,,"adventure,family,sci-fi",English,5709188


In [38]:
master_list_df_reconstituted_again.drop(['genres_imdb_check'], axis = 1, inplace = True)

In [2]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == 3120,'genres_imdb_final'] = 'drama'

In [52]:
production_company_list = list(master_list_df_reconstituted_again['production_companies'].unique())
production_company_list = [x for x in production_company_list if str(x) != 'nan']
final_list = []
for production_company in production_company_list:
    list_to_add = [x.strip() for x in production_company.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))

for production_company_A in final_list:
    for production_company_B in final_list:
        if production_company_B in production_company_A:
            if production_company_A != production_company_B:
                print('this item in list "{}" as a string contains another item "{}"'.format(production_company_A,production_company_B))    

this item in list "LLC Primeredian Entertainment" as a string contains another item "LLC"
this item in list "Park Pictures Features" as a string contains another item "Park Pictures"
this item in list "Peak Distribution Partners LLC" as a string contains another item "LLC"
this item in list "Film 44" as a string contains another item "Film 4"
this item in list "KGB Media" as a string contains another item "KG"
this item in list "Jellyfish Bloom" as a string contains another item "Bloom"
this item in list "PalmStar Media Capital" as a string contains another item "PalmStar Media"
this item in list "Sony Pictures Animation" as a string contains another item "Sony Pictures"
this item in list "Scholastic Entertainment Inc." as a string contains another item "Inc."
this item in list "Dentsu Inc." as a string contains another item "Inc."
this item in list "Riche-Ludwig" as a string contains another item "Riche"
this item in list "3DTK Inc." as a string contains another item "Inc."
this item 

In [55]:
value_replace_dict = {
"Park Pictures Features":"Park Pictures",
"PalmStar Media Capital":"PalmStar Media",
"Jerry Bruckheimer Films":"Jerry Bruckheimer",
"Rat Pack Film Produktion":"Rat Pack",
"Spring Creek Prod":"Spring Creek",
"Constantin Film International":"Constantin Film",
"Pathe Distribution":"Pathe",
"Image Nation Abu Dhabi":"Image Nation",
"Hasbro Studios":"Hasbro"
}

master_list_df_reconstituted_again['production_companies'].replace(value_replace_dict, inplace=True, regex=True)

In [56]:
production_company_list = list(master_list_df_reconstituted_again['production_companies'].unique())
production_company_list = [x for x in production_company_list if str(x) != 'nan']
final_list = []
for production_company in production_company_list:
    list_to_add = [x.strip() for x in production_company.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))

for production_company_A in final_list:
    for production_company_B in final_list:
        if production_company_B in production_company_A:
            if production_company_A != production_company_B:
                print('this item in list "{}" as a string contains another item "{}"'.format(production_company_A,production_company_B))    

this item in list "LLC Primeredian Entertainment" as a string contains another item "LLC"
this item in list "Peak Distribution Partners LLC" as a string contains another item "LLC"
this item in list "Film 44" as a string contains another item "Film 4"
this item in list "KGB Media" as a string contains another item "KG"
this item in list "Jellyfish Bloom" as a string contains another item "Bloom"
this item in list "Sony Pictures Animation" as a string contains another item "Sony Pictures"
this item in list "Scholastic Entertainment Inc." as a string contains another item "Inc."
this item in list "Dentsu Inc." as a string contains another item "Inc."
this item in list "Riche-Ludwig" as a string contains another item "Riche"
this item in list "3DTK Inc." as a string contains another item "Inc."
this item in list "20th Century Fox Animation" as a string contains another item "20th Century Fox"
this item in list "87Eleven Inc." as a string contains another item "Inc."
this item in list "Son

In [9]:
production_company_list = list(master_list_df_reconstituted_again['production_companies'].unique())
production_company_list = [x for x in production_company_list if str(x) != 'nan']
final_list = []
for production_company in production_company_list:
    list_to_add = [x.strip() for x in production_company.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))
final_list = [x for x in final_list if x != 'LLC']
final_list = [x for x in final_list if x != 'Inc.']

with open('data/production_company_list.txt', 'w') as f:
    for item in final_list:
        f.write("%s\n" % item)
# final_list

In [17]:
#removing releases of stand-up performances such as that of Kevin Hart

master_list_df_reconstituted_again = master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['genre'] != 'Concert/Performance']

In [45]:
#1746 - fifty shades grey, don't know why but bookmyshow list says that it was released in India. All news reports from the time say it was banned

correction_list = [1746]

In [46]:
for numberx in correction_list:
    master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == numberx, 'released_in_india_2nd_check'] = 'no'

print('done and done')

done and done


In [47]:
master_list_df_reconstituted_again.to_csv('data/india_release_check_v15.csv', index=False)

In [16]:
# import pandas as pd

# master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v14.csv')

In [61]:
master_list_df_reconstituted_again.dtypes

number                          int64
year                            int64
title                          object
domestic_box_office             int64
production_budget               int64
domestic_release_date          object
domestic_distributor           object
mpaa_rating                    object
franchise                      object
source                         object
genre                          object
production_method              object
creative_type                  object
production_companies           object
production_countries           object
released_in_india_2nd_check    object
production_budget_bins         object
domestic_box_office_bins       object
director                       object
genres_imdb_final              object
languages_imdb                 object
id_imdb                         int64
dtype: object

In [64]:
master_list_df_reconstituted_again.shape

(1069, 22)

In [67]:
master_list_df_reconstituted_again_yes_documentaries = master_list_df_reconstituted_again[(master_list_df_reconstituted_again['genre'] == 'Documentary') |
                                                                                        (master_list_df_reconstituted_again['genres_imdb_final'].str.contains('documentary', na=False, regex=False))]

In [68]:
master_list_df_reconstituted_again_yes_documentaries.shape

(64, 22)

In [69]:
master_list_df_reconstituted_again_no_documentaries = master_list_df_reconstituted_again[~((master_list_df_reconstituted_again['genre'] == 'Documentary') |
                                                                                        (master_list_df_reconstituted_again['genres_imdb_final'].str.contains('documentary', na=False, regex=False)))]

In [70]:
master_list_df_reconstituted_again_no_documentaries.shape

(1005, 22)

In [71]:
master_list_df_reconstituted_again_no_documentaries.to_csv('data/india_release_check_v16.csv', index=False)

In [72]:
master_list_df_reconstituted_again_yes_documentaries.to_csv('data/india_release_check_v16_documentaries.csv', index=False)

In [1]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v16.csv')

In [2]:
master_list_df_reconstituted_again.shape

(1005, 22)

In [5]:
category = 'genre' #this is the genre categorisation from the-numbers.com

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

category_name_list = master_list_df_reconstituted_again[category].unique()
category_name_list = [x for x in category_name_list if str(x) != 'nan']

for category_name in category_name_list:
    
    df = df.append({"category": category_name,
                    '2013':None,
                    '2014':None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) & 
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again[category] == category_name])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)
    
    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not',ascending=False)

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
1,Drama,13.0,17,23.0,13.0,28,26.0,120,328,28.9,27.9,36.5,20.3,46.7,74.3,36.6
5,Comedy,8.0,8,10.0,8.0,10,5.0,49,161,27.6,32.0,30.3,23.5,40.0,33.3,30.4
3,Thriller/Suspense,5.0,1,1.0,2.0,2,8.0,19,121,18.5,5.9,5.0,10.0,10.0,47.1,15.7
6,Romantic Comedy,1.0,5,1.0,1.0,0,3.0,11,34,12.5,55.6,25.0,20.0,0.0,75.0,32.4
4,Horror,1.0,2,0.0,0.0,3,0.0,6,65,12.5,20.0,0.0,0.0,21.4,0.0,9.2
0,Adventure,1.0,1,0.0,2.0,0,1.0,5,123,4.8,4.2,0.0,7.4,0.0,9.1,4.1
2,Action,1.0,0,1.0,1.0,0,1.0,4,132,4.0,0.0,5.0,4.5,0.0,4.5,3.0
8,Black Comedy,1.0,0,0.0,1.0,1,1.0,4,18,25.0,0.0,0.0,33.3,16.7,50.0,22.2
7,Musical,0.0,0,0.0,1.0,0,0.0,1,12,0.0,0.0,0.0,33.3,0.0,0.0,8.3
9,Western,0.0,1,0.0,0.0,0,,1,9,0.0,100.0,0.0,0.0,0.0,,11.1


In [6]:
category = 'creative_type'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

category_name_list = master_list_df_reconstituted_again[category].unique()
category_name_list = [x for x in category_name_list if str(x) != 'nan']

for category_name in category_name_list:
    
    df = df.append({"category": category_name,
                    '2013':None,
                    '2014':None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) & 
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again[category] == category_name])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)
    
    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not',ascending=False)

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
1,Contemporary Fiction,24,24.0,20.0,14.0,30.0,29.0,141,487,26.7,30.4,23.0,18.4,34.5,42.6,29.0
3,Dramatization,1,3.0,10.0,9.0,10.0,11.0,44,139,6.2,17.6,35.7,30.0,32.3,64.7,31.7
6,Historical Fiction,2,4.0,4.0,1.0,4.0,2.0,17,88,16.7,25.0,25.0,4.2,25.0,50.0,19.3
0,Fantasy,1,2.0,2.0,1.0,0.0,1.0,7,83,5.9,8.3,13.3,5.0,0.0,50.0,8.4
2,Science Fiction,1,2.0,0.0,2.0,0.0,2.0,7,101,6.2,11.8,0.0,10.5,0.0,16.7,6.9
5,Kids Fiction,1,0.0,0.0,1.0,0.0,0.0,2,72,7.7,0.0,0.0,6.2,0.0,0.0,2.8
4,Super Hero,0,0.0,0.0,1.0,0.0,0.0,1,31,0.0,0.0,0.0,12.5,0.0,0.0,3.2
7,Multiple Creative Types,1,0.0,,,,,1,2,100.0,0.0,,,,,50.0
8,Factual,0,,,,1.0,,1,2,0.0,,,,100.0,,50.0


In [7]:
category = 'source'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

category_name_list = master_list_df_reconstituted_again[category].unique()
category_name_list = [x for x in category_name_list if str(x) != 'nan']

for category_name in category_name_list:
    
#     print(category_name)
    
    df = df.append({"category": category_name,
                    '2013':None,
                    '2014':None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) & 
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again[category] == category_name])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)
    
    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not',ascending=False)

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
1,Original Screenplay,20.0,27.0,23.0,16.0,27.0,27.0,140,522,22.2,31.0,22.8,16.7,30.3,45.8,26.8
3,Based on Real Life Events,1.0,1.0,4.0,4.0,7.0,7.0,24,89,5.9,12.5,28.6,19.0,38.9,63.6,27.0
2,Based on Fiction Book/Short Story,4.0,3.0,2.0,1.0,7.0,5.0,22,157,16.7,10.0,8.0,3.1,26.9,25.0,14.0
8,Based on Factual Book/Article,1.0,3.0,4.0,3.0,4.0,1.0,16,62,14.3,25.0,26.7,30.0,28.6,25.0,25.8
15,Based on Play,3.0,0.0,2.0,0.0,,,5,12,60.0,0.0,66.7,0.0,,,41.7
14,Based on Short Film,0.0,1.0,1.0,1.0,,1.0,4,7,0.0,100.0,50.0,50.0,,100.0,57.1
4,Based on Comic/Graphic Novel,2.0,0.0,0.0,2.0,0.0,0.0,4,51,15.4,0.0,0.0,22.2,0.0,0.0,7.8
11,Based on Religious Text,,0.0,,0.0,0.0,2.0,2,7,,0.0,,0.0,0.0,100.0,28.6
18,Based on Song,,,,,,1.0,1,1,,,,,,100.0,100.0
5,Remake,0.0,0.0,0.0,0.0,0.0,1.0,1,26,0.0,0.0,0.0,0.0,0.0,25.0,3.8


In [8]:
category = 'mpaa_rating'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

category_name_list = master_list_df_reconstituted_again[category].unique()

for category_name in category_name_list:
    
    df = df.append({"category": category_name,
                    '2013':None,
                    '2014':None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) & #should I change this to str.contains()
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again[category] == category_name])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not',ascending=False)

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
1,R,19.0,15.0,17,18.0,26,23.0,118,445,23.8,21.7,21.0,20.9,32.9,46.0,26.5
0,PG-13,11.0,17.0,12,5.0,7,12.0,64,406,15.9,22.4,16.9,6.4,10.6,26.1,15.8
2,PG,1.0,3.0,6,5.0,9,9.0,33,144,5.0,11.1,24.0,17.9,33.3,52.9,22.9
4,Not Rated,,,1,1.0,3,1.0,6,6,,,100.0,100.0,100.0,100.0,100.0
3,G,0.0,0.0,0,,0,,0,4,0.0,0.0,0.0,,0.0,,0.0


In [9]:
labels = ['50k-50m','50m-100m','100m-150m','150m-200m','200m-250m','250m-300m','300m-350m','350m-400m','400m-450m','450m-500m','500m-550m','550m-600m','600m-650m','650m-700m','700m-750m',
         '750m-800m','800m-850m','850m-900m','900m-950m']

category = 'domestic_box_office_bins'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

for category_name in labels:
    
    df = df.append({"category": category_name,
                    '2013': None,
                    '2014': None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)])
    
    if movies_in_category_name_total != 0:

        movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
        movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

        df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
        df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
        df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

        for year in range(2013,2019):

            year_string = str(year)

            year_percent_string = str(year) + ' %'

            movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                                (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                                (master_list_df_reconstituted_again['year'] == year)])

            movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                           (master_list_df_reconstituted_again['year'] == year) ])
            if movies_in_category_name != 0:
                movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
                movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)

                df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
                df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent

df

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
0,50k-50m,29.0,33.0,34.0,28.0,43.0,42.0,209.0,662.0,28.4,30.8,27.4,21.9,35.2,53.2,31.6
1,50m-100m,2.0,1.0,1.0,1.0,0.0,2.0,7.0,164.0,6.1,3.0,3.8,2.9,0.0,12.5,4.3
2,100m-150m,0.0,1.0,0.0,0.0,2.0,1.0,4.0,69.0,0.0,7.7,0.0,0.0,16.7,12.5,5.8
3,150m-200m,0.0,0.0,1.0,0.0,0.0,0.0,1.0,40.0,0.0,0.0,8.3,0.0,0.0,0.0,2.5
4,200m-250m,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,250m-300m,0.0,0.0,0.0,0.0,0.0,,0.0,10.0,0.0,0.0,0.0,0.0,0.0,,0.0
6,300m-350m,,0.0,0.0,0.0,0.0,0.0,0.0,10.0,,0.0,0.0,0.0,0.0,0.0,0.0
7,350m-400m,0.0,0.0,0.0,0.0,0.0,,0.0,8.0,0.0,0.0,0.0,0.0,0.0,,0.0
8,400m-450m,0.0,,,0.0,0.0,0.0,0.0,7.0,0.0,,,0.0,0.0,0.0,0.0
9,450m-500m,,,0.0,0.0,,,0.0,2.0,,,0.0,0.0,,,0.0


In [82]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                      (master_list_df_reconstituted_again['domestic_box_office_bins'] == '50m-100m')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
657,1805,2015,War Room,67790117,3000000,2015-08-28,Sony Pictures,PG,,Original Screenplay,...,Contemporary Fiction,"Tri-Star Pictures,Faithstep Films,Provident Fi...",United States,no,100k-25m,50m-100m,Alex Kendrick,drama,English,3832914
796,5175,2018,I Can Only Imagine,83482352,7000000,2018-03-16,Roadside Attractions,PG,,Based on Song,...,Dramatization,"Lionsgate,Erwin Brothers Entertainment,South W...",United States,no,100k-25m,50m-100m,"Andrew Erwin,Jon Erwin","biography,drama,family",English,6450186
827,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848
922,5340,2018,Overboard,50316123,12000000,2018-05-04,Lionsgate,PG-13,,Remake,...,Contemporary Fiction,"Pantelion Films,Metro-Goldwyn-Mayer Pictures,3...",United States,no,100k-25m,50m-100m,Rob Greenberg,"comedy,romance","English,Norwegian,Spanish,French",1563742
999,96,2013,Tyler Perry's Temptation,51975354,0,2013-03-29,Lionsgate,PG-13,,Based on Play,...,Contemporary Fiction,"Lionsgate,TPS Company",United States,no,no info,50m-100m,Tyler Perry,"drama,thriller",English,2070862
1013,97,2013,Tyler Perry's A Madea Christmas,52543354,25000000,2013-12-13,Lionsgate,PG-13,Madea,Based on Play,...,Contemporary Fiction,"Lionsgate,Tyler Perry Studios",United States,no,25m-50m,50m-100m,Tyler Perry,"comedy,drama",English,2609758
1033,867,2014,Think Like a Man Too,65028687,24000000,2014-06-20,Sony Pictures,PG-13,Think Like a Man,Based on Factual Book/Article,...,Contemporary Fiction,"Screen Gems,LStar Capital,Will Packer Productions",United States,no,100k-25m,50m-100m,"Tim Story,Keith Merryman","comedy,romance",English,2239832


In [4]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                      (master_list_df_reconstituted_again['domestic_box_office_bins'] == '100m-150m')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
636,3906,2017,Girls Trip,115108515,28000000,2017-07-21,Universal,R,,Original Screenplay,...,Contemporary Fiction,"Will Packer Productions,Universal Pictures,Per...",United States,no,25m-50m,100m-150m,Malcolm D. Lee,"comedy,drama",English,3564472
641,5119,2018,Fifty Shades Freed,100407760,55000000,2018-02-09,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Michael De Luca,Universal Pictures,Perfect Wor...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance,thriller",English,4477536
677,3880,2017,Fifty Shades Darker,114434010,55000000,2017-02-10,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Universal Pictures,Perfect World Pictures,Mich...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance",English,4465564
959,836,2014,Ride Along,134202565,25000000,2014-01-17,Universal,PG-13,Ride Along,Original Screenplay,...,Contemporary Fiction,"Cube Vision,Rainforest Films,Relativity Media",United States,no,25m-50m,100m-150m,Tim Story,"action,comedy,crime","English,Ukrainian",1408253


In [5]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                      (master_list_df_reconstituted_again['domestic_box_office_bins'] == '150m-200m')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
466,1746,2015,Fifty Shades of Grey,166167230,40000000,2015-02-13,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Focus Features,Michael De Luca,Trigger Street ...",United States,no,25m-50m,150m-200m,Sam Taylor-Johnson,"drama,romance,thriller",English,2322441


In [84]:
#of all the movies where you have production budget information, find the ones with the biggest production budgets that weren't released in India

labels = ['no info','100k-25m','25m-50m','50m-75m','75m-100m','100m-125m','125m-150m','150m-175m','175m-200m','200m-225m','225m-250m','250m-275m', '275m-300m','300m-325m','325m-350m']

category = 'production_budget_bins'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

labels_culled = [x for x in labels if x != 'no info']

for category_name in labels_culled:
    
    df = df.append({"category": category_name,
                    '2013': None,
                    '2014': None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent

df            
# df.sort_values('total not',ascending=False)

#you should have a caveat that these are all the movies that earned at least $500,000 at the US box office

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
0,100k-25m,12.0,17.0,9.0,14.0,9.0,7.0,68,311,20.3,26.6,17.0,18.9,22.0,35.0,21.9
1,25m-50m,5.0,1.0,1.0,2.0,2.0,1.0,12,162,14.3,4.0,2.9,6.5,8.7,7.7,7.4
2,50m-75m,0.0,0.0,0.0,0.0,1.0,1.0,2,93,0.0,0.0,0.0,0.0,8.3,12.5,2.2
3,75m-100m,0.0,0.0,0.0,0.0,0.0,0.0,0,36,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100m-125m,0.0,0.0,0.0,0.0,0.0,0.0,0,35,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,125m-150m,0.0,0.0,0.0,0.0,0.0,0.0,0,30,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,150m-175m,0.0,0.0,0.0,0.0,0.0,0.0,0,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,175m-200m,0.0,0.0,0.0,0.0,0.0,,0,26,0.0,0.0,0.0,0.0,0.0,,0.0
8,200m-225m,0.0,0.0,0.0,0.0,0.0,0.0,0,13,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,225m-250m,0.0,,,,0.0,,0,2,0.0,,,,0.0,,0.0


In [85]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                      (master_list_df_reconstituted_again['production_budget_bins'] == '25m-50m')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
474,1746,2015,Fifty Shades of Grey,166167230,40000000,2015-02-13,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Focus Features,Michael De Luca,Trigger Street ...",United States,no,25m-50m,150m-200m,Sam Taylor-Johnson,"drama,romance,thriller",English,2322441
649,3906,2017,Girls Trip,115108515,28000000,2017-07-21,Universal,R,,Original Screenplay,...,Contemporary Fiction,"Will Packer Productions,Universal Pictures,Per...",United States,no,25m-50m,100m-150m,Malcolm D. Lee,"comedy,drama",English,3564472
663,3996,2017,Tulip Fever,2455635,25000000,2017-09-01,Weinstein Co.,R,,Based on Fiction Book/Short Story,...,Historical Fiction,"Worldview Entertainment,Weinstein Company,Ruby...","United Kingdom,United States",no,25m-50m,50k-50m,Jason Chadwick,"drama,history,romance",English,491203
688,5261,2018,I Feel Pretty,48795601,32000000,2018-04-20,STX Entertainment,PG-13,,Original Screenplay,...,Contemporary Fiction,"Voltage Pictures,Wonderland Sound and Vision,H...",United States,no,25m-50m,50k-50m,"Abby Kohn,Marc Silverstein",comedy,English,6791096
702,133,2013,Paranoia,7388654,40000000,2013-08-16,Relativity,PG-13,,Original Screenplay,...,Contemporary Fiction,,United States,no,25m-50m,50k-50m,Robert Luketic,"drama,thriller",English,1413495
750,74,2013,Snowpiercer,4563029,40000000,2013-06-27,RADiUS-TWC,R,,Based on Comic/Graphic Novel,...,Science Fiction,"Moho Films,Opus Picture","Republic of Korea,United States",no,25m-50m,50k-50m,Joon-ho Bong,"action,drama,sci-fi","English,Korean,French,Japanese,Czech,German",1706620
827,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848
836,3021,2016,The Infiltrator,15436808,47500000,2016-07-13,Broad Green Pictures,R,,Based on Factual Book/Article,...,Dramatization,"Bank Leumi,Lipsync Productions,Good Films",United States,no,25m-50m,50k-50m,Brad Furman,"biography,crime,drama","English,Spanish",1355631
872,128,2013,Dead Man Down,10895295,30000000,2013-03-08,FilmDistrict,R,,Original Screenplay,...,Contemporary Fiction,"FilmDistrict,IM Global,WWE Studios,Original Fi...",United States,no,25m-50m,50k-50m,Niels Arden Oplev,"action,crime,drama","English,French,Albanian,Spanish,Hungarian",2101341
921,99,2013,The Family,36918811,30000000,2013-09-13,Relativity,R,,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Relativity Media,EuropaCorp,TF1 Film Productio...",United States,no,25m-50m,50k-50m,Luc Besson,"comedy,crime,thriller","English,French,Italian",2404311


In [86]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                      (master_list_df_reconstituted_again['production_budget_bins'] == '50m-75m')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
654,5119,2018,Fifty Shades Freed,100407760,55000000,2018-02-09,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Michael De Luca,Universal Pictures,Perfect Wor...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance,thriller",English,4477536
698,3880,2017,Fifty Shades Darker,114434010,55000000,2017-02-10,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Universal Pictures,Perfect World Pictures,Mich...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance",English,4465564


In [197]:
pd.set_option('display.max_rows', 500)

In [87]:
category = 'domestic_distributor'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

category_name_list = master_list_df_reconstituted_again[category].unique()
category_name_list = [x for x in category_name_list if str(x) != 'nan']

for category_name in category_name_list:
    
    df = df.append({"category": category_name,
                    '2013':None,
                    '2014':None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) & #.str.contains(category_name, na=False, regex=False) 
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again[category] == category_name])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name)  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category] == category_name) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not',ascending=False)

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
9,Roadside Attractions,5,3,2,3,5,5,23,30,62.5,100,100,50,83.3,100,76.7
18,Sony Pictures Classics,0,2,4,4,4,4,18,35,0,50,57.1,57.1,57.1,80,51.4
7,Lionsgate,3,0,1,3,3,5,15,83,20,0,7.7,13.6,25,55.6,18.1
5,A24,2,1,1,1,3,3,11,30,66.7,25,25,16.7,33.3,75,36.7
30,Pure Flix Entertainment,,0,2,1,2,2,7,10,,0,100,50,66.7,100,70
8,Universal,0,1,1,1,2,2,7,87,0,6.7,5.3,5.9,15.4,20,8
11,Weinstein Co.,2,1,1,0,3,,7,31,25,11.1,12.5,0,75,,22.6
41,Freestyle Releasing,1,1,3,1,,,6,7,100,50,100,100,,,85.7
3,Sony Pictures,0,3,1,0,1,1,6,87,0,16.7,7.1,0,7.1,10,6.9
12,Focus Features,0,1,2,1,0,2,6,33,0,16.7,33.3,14.3,0,66.7,18.2


In [88]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['domestic_distributor'] == 'Roadside Attractions')&
                                  (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
648,1866,2015,Love & Mercy,12551031,0,2015-06-05,Roadside Attractions,PG-13,,Based on Real Life Events,...,Dramatization,"River Road Entertainment,Battle Mountain Films",United States,no,no info,50k-50m,Bill Pohlad,"biography,drama,music",English,903657
664,4026,2017,Wonderstruck,1033632,0,2017-10-20,Roadside Attractions,PG,,Based on Fiction Book/Short Story,...,Historical Fiction,"Killer Films,Amazon Studios",United States,no,no info,50k-50m,Todd Haynes,"drama,mystery","Spanish,English",5208216
675,153,2013,Stand Up Guys,3310031,0,2013-02-01,Roadside Attractions,R,,Original Screenplay,...,Contemporary Fiction,"Lionsgate,Sidney Kimmel Entertainment,Lakeshor...",United States,no,no info,50k-50m,Fisher Stevens,"comedy,crime,thriller","English,Belarusian",1389096
721,5349,2018,Beast,800365,0,2018-05-11,Roadside Attractions,R,,Original Screenplay,...,Contemporary Fiction,"30 West,Roadside Attractions,Film 4,BFI,Agile ...",United States,no,no info,50k-50m,Michael Pearce,"crime,drama,mystery",English,5628302
756,1951,2015,Where Hope Grows,1156000,0,2015-05-15,Roadside Attractions,PG-13,,Original Screenplay,...,Contemporary Fiction,"Godspeed Pictures,Attic Light Films,The Matadors",United States,no,no info,50k-50m,Chris Dowling,"drama,family",English,3200980
778,947,2014,Words and Pictures,2171257,0,2014-05-23,Roadside Attractions,PG-13,,Original Screenplay,...,Contemporary Fiction,"Latitude Productions,Lascaux Films",United States,no,no info,50k-50m,Fred Schepisi,"comedy,drama,romance",English,2380331
785,170,2013,Girl Most Likely,1378591,0,2013-07-19,Roadside Attractions,PG-13,,Original Screenplay,...,Contemporary Fiction,"Maven Pictures,Anonymous Content,Ambush Entert...",United States,no,no info,50k-50m,"Robert Pulcini,Shari Springer Berman",comedy,"English,French,Romanian,Mandarin",1698648
796,5175,2018,I Can Only Imagine,83482352,7000000,2018-03-16,Roadside Attractions,PG,,Based on Song,...,Dramatization,"Lionsgate,Erwin Brothers Entertainment,South W...",United States,no,100k-25m,50m-100m,"Andrew Erwin,Jon Erwin","biography,drama,family",English,6450186
810,3988,2017,Stronger,4211129,0,2017-09-22,Roadside Attractions,R,,Based on Factual Book/Article,...,Dramatization,"Lionsgate,Mandeville Films,Roadside Attraction...",United States,no,no info,50k-50m,David Gordon Green,"biography,drama",English,3881784
830,157,2013,Much Ado About Nothing,4328850,0,2013-06-07,Roadside Attractions,PG-13,,Based on Play,...,Contemporary Fiction,Bellwether Films,United States,no,no info,50k-50m,Joss Whedon,"comedy,drama,romance",English,2094064


In [89]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['domestic_distributor'] == 'Sony Pictures Classics')&
                                  (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
635,3062,2016,Miles Ahead,2610896,0,2016-04-01,Sony Pictures Classics,R,,Based on Real Life Events,...,Dramatization,"Bifrost Pictures,Sony Pictures Classics,Miles ...",United States,no,no info,50k-50m,Don Cheadle,"biography,drama,music",English,790770
638,4000,2017,Norman: The Moderate Rise and Tragic Fall of a...,3814868,0,2017-04-14,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Tadmor Group,Cold Iron Pictures,Blackbird,Movi...","Israel,United States",no,no info,50k-50m,Joseph Cedar,"drama,thriller","English,Hebrew",4191702
643,3137,2016,The Bronze,615816,3500000,2016-03-18,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Duplass Brothers,Stage 6 Films",United States,no,100k-25m,50k-50m,Bryan Buckley,"comedy,drama,sport",English,3859304
693,958,2014,Love is Strange,2262555,0,2014-08-22,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Parts and Labor,Faliro House Productions,Film ...",United States,no,no info,50k-50m,Ira Sachs,"drama,romance","English,Russian",2639344
727,5272,2018,The Rider,2389150,0,2018-04-13,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Highwayman Films,Caviar",United States,no,no info,50k-50m,Chloe Zhao,"drama,western",English,6217608
729,4081,2017,Brigsby Bear,515765,0,2017-07-28,Sony Pictures Classics,PG-13,,Original Screenplay,...,Contemporary Fiction,"YL Pictures,Lonely Island,3311 Productions,Lor...",United States,no,no info,50k-50m,Dave McCary,"comedy,drama",English,5805752
735,1850,2015,Irrational Man,4030360,0,2015-07-17,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,Perdido,United States,no,no info,50k-50m,Woody Allen,"comedy,drama",English,3715320
766,1879,2015,Grandma,6980524,0,2015-08-21,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"1821 Pictures,Depth of Field",United States,no,no info,50k-50m,Paul Weitz,"comedy,drama",English,4270516
789,5443,2018,Boundaries,701828,0,2018-06-22,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Automatik,Oddfellows Entertainment,Stage 6 Fil...","Canada,United States",no,no info,50k-50m,Shana Feste,"comedy,drama",English,5686062
805,1929,2015,Infinitely Polar Bear,1430655,0,2015-06-19,Sony Pictures Classics,R,,Original Screenplay,...,Historical Fiction,"Paper Street Films,Park Pictures,Bad Robot,KGB...",United States,no,no info,50k-50m,Maya Forbes,"comedy,drama,romance",English,1969062


In [90]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['domestic_distributor'] == 'Lionsgate')&
                                  (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
658,3045,2016,Compadres,3127773,3000000,2016-04-22,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Pantelion Films,Lionsgate,Televisa Cine,Draco ...","Mexico,United States",no,100k-25m,50k-50m,Enrique Begne,"action,comedy,crime",English,3367294
741,1939,2015,Freeheld,546201,7000000,2015-10-02,Lionsgate,PG-13,,Based on Short Film,...,Dramatization,"Masproduction,Shiny Penny,Lieutenant Films,Hig...",United States,no,100k-25m,50k-50m,Peter Sollett,"biography,drama,romance",English,1658801
749,166,2013,Filly Brown,2850357,1250000,2013-04-19,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,Lionsgate,United States,no,100k-25m,50k-50m,"Youssef Delara,Michael D. Olmos","drama,music",English,1869425
764,3965,2017,The Glass Castle,17273059,0,2017-08-11,Lionsgate,PG-13,,Based on Factual Book/Article,...,Dramatization,"Lionsgate,Gil Netter Productions",United States,no,no info,50k-50m,Destin Daniel Cretton,"biography,drama",English,2378507
780,5179,2018,Acrimony,43549096,20000000,2018-03-30,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Tyler Perry Studios,Lionsgate",United States,no,100k-25m,50k-50m,Tyler Perry,thriller,English,6063050
794,5266,2018,Traffik,9186156,0,2018-04-20,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Hidden Empire Film Group,Third Eye Motion Pict...",United States,no,no info,50k-50m,Deon Taylor,thriller,English,5670152
829,3925,2017,How to Be a Latin Lover,32149404,10000000,2017-04-28,Lionsgate,PG-13,,Original Screenplay,...,Contemporary Fiction,"3Pas Studios,Pantelion Films,Lionsgate,Videocine",United States,no,100k-25m,50k-50m,Ken Marino,"comedy,drama","English,Spanish",4795124
831,3936,2017,Tyler Perry's Boo 2! A Madea Halloween,47319572,20000000,2017-10-20,Lionsgate,PG-13,Madea,Original Screenplay,...,Contemporary Fiction,"Lionsgate,Tyler Perry Studios",United States,no,100k-25m,50k-50m,Tyler Perry,"comedy,horror",English,6217804
839,3013,2016,Middle School: The Worst Years of My Life,20007149,8500000,2016-10-07,Lionsgate,PG,,Based on Fiction Book/Short Story,...,Kids Fiction,"CBS Films,Participant Media,James Patterson En...",United States,no,100k-25m,50k-50m,Steve Carr,"animation,comedy,family","English,Khmer",4981636
912,3036,2016,The Perfect Match,9669521,5000000,2016-03-11,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Lionsgate,Codeblack Films,Jorva Entertainment ...",United States,no,100k-25m,50k-50m,Bille Woodruff,"comedy,romance",English,4871980


In [39]:
domestic_distributor_list = list(master_list_df_reconstituted_again['domestic_distributor'].unique())
domestic_distributor_list = [x for x in domestic_distributor_list if str(x) != 'nan']


final_list = []
for domestic_distributor in domestic_distributor_list:
    list_to_add = [x.strip() for x in domestic_distributor.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))
final_list = [x for x in final_list if x != 'LLC']
final_list = [x for x in final_list if x != 'Inc.']

with open('data/domestic_distributor_list.txt', 'w') as f:
    for item in final_list:
        f.write("%s\n" % item)

#went into the domestics distributor list text file and made some changes by hand

In [40]:
distributor_dict = {
    
    "disney": ["Walt Disney"],
    
    "time_warner": ["Warner Bros."],

    "21st_century_fox": ["Fox Searchlight","20th Century Fox"],
    
    "nbc_universal": ["Focus Features","Universal","Gramercy"],
    
    "sony": ["Sony Pictures Classics","Sony Pictures"],
    
    "viacom": ["Paramount Pictures"],
    
    "lionsgate": ["Roadside Attractions","Lionsgate","Codeblack Entertainment"]
}


In [41]:
#break-up of movies not released in India by us distributor (note that in some cases, the international or indian rights might be with another distributor)
#also that the domestic distributor may not be the one associated with the parent company , split deals are apparently commonplace too

import re

category = 'domestic_distributor'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

for key, value in distributor_dict.items():

    category_name = key

    division_list = value

    master_list_df_reconstituted_again_dupli = master_list_df_reconstituted_again

    df = df.append({"category": category_name,
                    '2013': None,
                    '2014': None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)

    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True))])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):

        year_string = str(year)

        year_percent_string = str(year) + ' %'

        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True))  &
                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again_dupli['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                                       (master_list_df_reconstituted_again_dupli['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)

            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent

df.sort_values('total not',ascending=False)

#you should have a caveat that these are all the movies that earned at least $500,000 at the US box office

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
6,lionsgate,8,6,3,6,8,10,41,118,34.8,33.3,17.6,21.4,44.4,71.4,34.7
4,sony,0,5,5,4,5,5,24,122,0.0,22.7,23.8,16.0,23.8,33.3,19.7
3,nbc_universal,0,2,3,2,2,4,13,121,0.0,9.5,12.0,8.0,11.1,30.8,10.7
2,21st_century_fox,2,1,1,0,1,1,6,109,9.5,4.8,5.0,0.0,4.5,14.3,5.5
5,viacom,0,2,1,0,0,0,3,61,0.0,18.2,10.0,0.0,0.0,0.0,4.9
1,time_warner,0,1,0,1,0,0,2,111,0.0,4.3,0.0,5.0,0.0,0.0,1.8
0,disney,0,0,0,0,0,0,0,57,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [42]:
division_list = distributor_dict['time_warner']

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['domestic_distributor'].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
827,3020,2016,Keanu,20591853,15000000,2016-04-29,Warner Bros.,R,,Original Screenplay,...,Contemporary Fiction,"New Line Cinema,RatPac Entertainment,Dune Ente...",United States,no,100k-25m,50k-50m,Peter Atencio,"action,comedy,drama",English,4139124
860,953,2014,The Good Lie,2722209,0,2014-10-03,Warner Bros.,PG-13,,Original Screenplay,...,Contemporary Fiction,"Black Label Media,Imagine Entertainment,Relian...",United States,no,no info,50k-50m,Philippe Falardeau,"biography,drama",English,2652092


In [43]:
division_list = distributor_dict['viacom']

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['domestic_distributor'].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
828,906,2014,Top Five,25317379,12000000,2014-12-12,Paramount Pictures,R,,Original Screenplay,...,Contemporary Fiction,IACF,United States,no,100k-25m,50k-50m,Chris Rock,"comedy,romance",English,2784678
865,1863,2015,Scouts Guide to the Zombie Apocalypse,3703046,15000000,2015-10-30,Paramount Pictures,R,,Original Screenplay,...,Fantasy,"Paramount Pictures,Broken Road",United States,no,100k-25m,50k-50m,Christopher Landon,"action,comedy,horror",English,1727776
946,968,2014,"Men, Women and Children",705908,16000000,2014-10-01,Paramount Pictures,R,,Original Screenplay,...,Contemporary Fiction,Right of Way Films,United States,no,100k-25m,50k-50m,Jason Reitman,"comedy,drama",English,3179568


In [44]:
division_list = distributor_dict['21st_century_fox']

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['domestic_distributor'].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
684,5263,2018,Super Troopers 2,30617396,0,2018-04-20,20th Century Fox,R,Super Troopers,Original Screenplay,...,Contemporary Fiction,"Broken Lizard Industries,Fox Searchlight Pictu...",United States,no,no info,50k-50m,Jay Chandrasekhar,"comedy,crime,mystery",English,859635
698,4059,2017,Patti Cake$,800148,1000000,2017-08-18,Fox Searchlight,R,,Original Screenplay,...,Contemporary Fiction,"TSG Entertainment,Fox Searchlight Pictures,RT ...",United States,no,100k-25m,50k-50m,Geremy Jasper,"drama,music",English,6288250
740,121,2013,Enough Said,17550872,8000000,2013-09-18,Fox Searchlight,PG-13,,Original Screenplay,...,Contemporary Fiction,"Likely Story,TSG Entertainment",United States,no,100k-25m,50k-50m,Nicole Holofcener,"comedy,drama,romance",English,2390361
758,1898,2015,Mistress America,2500431,0,2015-08-14,Fox Searchlight,R,,Original Screenplay,...,Contemporary Fiction,"Fox Searchlight Pictures,RT Features",United States,no,no info,50k-50m,Noah Baumbach,"comedy,drama",English,2872462
861,140,2013,Stoker,1703125,12000000,2013-03-01,Fox Searchlight,R,,Original Screenplay,...,Contemporary Fiction,"Scott Free Films,Indian Paintbrush","United Kingdom,United States",no,100k-25m,50k-50m,Park Chan-wook,"drama,thriller","English,French,Italian",1682180
967,921,2014,The Pyramid,2756333,0,2014-12-05,20th Century Fox,R,,Original Screenplay,...,Fantasy,"Fox International Productions,Silvatar Media",United States,no,no info,50k-50m,Gregory Levasseur,"action,adventure,horror",English,2799166


In [91]:
 production_company_dict = {
"20th Century Fox":["20th Century Fox Animation"],
"Bloom":["Jellyfish Bloom"],
"Film 4":["Film 44"],
# "Inc.":["3DTK Inc.","87Eleven Inc.","Dentsu Inc.","Scholastic Entertainment Inc."],
"KG":["KGB Media"],
# "LLC":["Corner Piece Capital LLC","Dreamline Pictures LLC","Greater Productions LLC","Hammond Entertainment LLC","LLC Primeredian Entertainment","Peak Distribution Partners LLC"],
"Riche":["Riche-Ludwig"],
"Sony Pictures":["Sony Pictures Animation","Sony Pictures Classics"]
}

In [89]:
pd.set_option('display.max_rows', 500)

In [92]:
#movies from which production company do we get to see the least in India?
import re

production_company_list = list(master_list_df_reconstituted_again['production_companies'].unique())
production_company_list = [x for x in production_company_list if str(x) != 'nan']
final_list = []
for production_company in production_company_list:
    list_to_add = [x.strip() for x in production_company.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))
final_list = [x for x in final_list if x != 'LLC']
final_list = [x for x in final_list if x != 'Inc.']

category = 'production_companies'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

for category_name in final_list:

    if category_name in list(production_company_dict.keys()):
        avoid_list = production_company_dict[category_name]
        avoid_list = [re.escape(m) for m in avoid_list]
        master_list_df_reconstituted_again_dupli = master_list_df_reconstituted_again.loc[~(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(avoid_list)),na=False,regex=True))] #regex true here
    else:
        master_list_df_reconstituted_again_dupli = master_list_df_reconstituted_again
    
    df = df.append({"category": category_name,
                    '2013': None,
                    '2014': None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(category_name, na=False, regex=False)) &
                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(category_name, na=False, regex=False))])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(category_name, na=False, regex=False))  &
                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again_dupli['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(category_name, na=False, regex=False)) &
                                                                                       (master_list_df_reconstituted_again_dupli['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not',ascending=False)

#you should have a caveat that these are all the movies that earned at least $500,000 at the US box office


Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
1232,Lionsgate,5,0,0,2,5,2,14,45,45.5,0,0,15.4,71.4,50,31.1
421,Pure Flix,,1,3,1,2,2,9,11,,50,100,50,100,100,81.8
1108,IM Global,3,0,0,3,1,,7,17,50,0,0,60,50,,41.2
478,Roadside Attractions,1,,,2,2,1,6,8,50,,,66.7,100,100,75
463,Perfect World Pictures,,,,1,2,2,5,16,,,,16.7,40,40,31.2
36,Duplass Brothers,,2,2,1,,,5,5,,100,100,100,,,100
281,LD Entertainment,1,,,0,0,4,5,9,100,,,0,0,100,55.6
739,Universal Pictures,0,,0,1,2,2,5,52,0,,0,6.2,20,22.2,9.6
440,Anonymous Content,1,0,1,0,,2,4,11,50,0,33.3,0,,100,36.4
865,Affirm Films,,1,1,0,1,1,4,6,,100,100,0,100,100,66.7


In [93]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains('Lionsgate', na=False, regex=False))&
                                  (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
658,3045,2016,Compadres,3127773,3000000,2016-04-22,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Pantelion Films,Lionsgate,Televisa Cine,Draco ...","Mexico,United States",no,100k-25m,50k-50m,Enrique Begne,"action,comedy,crime",English,3367294
675,153,2013,Stand Up Guys,3310031,0,2013-02-01,Roadside Attractions,R,,Original Screenplay,...,Contemporary Fiction,"Lionsgate,Sidney Kimmel Entertainment,Lakeshor...",United States,no,no info,50k-50m,Fisher Stevens,"comedy,crime,thriller","English,Belarusian",1389096
712,4017,2017,Film Stars Don't Die in Liverpool,1024266,0,,Adler & Associates,R,,Based on Real Life Events,...,Dramatization,"Eon Productions,IM Global,Lionsgate,Synchronis...","United Kingdom,United States",no,no info,50k-50m,Paul McGuigan,"biography,drama,romance",English,5711148
749,166,2013,Filly Brown,2850357,1250000,2013-04-19,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,Lionsgate,United States,no,100k-25m,50k-50m,"Youssef Delara,Michael D. Olmos","drama,music",English,1869425
764,3965,2017,The Glass Castle,17273059,0,2017-08-11,Lionsgate,PG-13,,Based on Factual Book/Article,...,Dramatization,"Lionsgate,Gil Netter Productions",United States,no,no info,50k-50m,Destin Daniel Cretton,"biography,drama",English,2378507
780,5179,2018,Acrimony,43549096,20000000,2018-03-30,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Tyler Perry Studios,Lionsgate",United States,no,100k-25m,50k-50m,Tyler Perry,thriller,English,6063050
796,5175,2018,I Can Only Imagine,83482352,7000000,2018-03-16,Roadside Attractions,PG,,Based on Song,...,Dramatization,"Lionsgate,Erwin Brothers Entertainment,South W...",United States,no,100k-25m,50m-100m,"Andrew Erwin,Jon Erwin","biography,drama,family",English,6450186
810,3988,2017,Stronger,4211129,0,2017-09-22,Roadside Attractions,R,,Based on Factual Book/Article,...,Dramatization,"Lionsgate,Mandeville Films,Roadside Attraction...",United States,no,no info,50k-50m,David Gordon Green,"biography,drama",English,3881784
829,3925,2017,How to Be a Latin Lover,32149404,10000000,2017-04-28,Lionsgate,PG-13,,Original Screenplay,...,Contemporary Fiction,"3Pas Studios,Pantelion Films,Lionsgate,Videocine",United States,no,100k-25m,50k-50m,Ken Marino,"comedy,drama","English,Spanish",4795124
831,3936,2017,Tyler Perry's Boo 2! A Madea Halloween,47319572,20000000,2017-10-20,Lionsgate,PG-13,Madea,Original Screenplay,...,Contemporary Fiction,"Lionsgate,Tyler Perry Studios",United States,no,100k-25m,50k-50m,Tyler Perry,"comedy,horror",English,6217804


In [94]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains('Pure Flix', na=False, regex=False))&
                                  (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
639,1955,2015,Faith of Our Fathers,1004105,0,2015-07-01,Pure Flix / Samuel Goldwyn Films,PG-13,,Original Screenplay,...,Contemporary Fiction,"Downes Brothers Productions,Pure Flix,Oak Wate...",United States,no,no info,50k-50m,Carey Scott,drama,English,1322393
644,3968,2017,The Case for Christ,14678714,0,2017-04-07,Pure Flix Entertainment,PG,,Based on Real Life Events,...,Dramatization,"Pure Flix,Triple Horse Studios",United States,no,no info,50k-50m,Jon Gunn,"biography,drama,history",English,6113488
650,5125,2018,Samson,4719928,0,2018-02-16,Pure Flix Entertainment,PG-13,,Based on Religious Text,...,Dramatization,"Pure Flix,Boomtown Films","South Africa,United States",no,no info,50k-50m,Bruce Macdonald,"action,drama",English,6951892
701,1868,2015,Do You Believe?,12985600,2300000,2015-03-20,Pure Flix Entertainment,PG-13,,Original Screenplay,...,Contemporary Fiction,"Pure Flix,10 West Studios,Believe Entertainment",United States,no,100k-25m,50k-50m,Jonathan M. Gunn,drama,English,4056738
732,5189,2018,God's Not Dead: A Light in Darkness,5728940,0,2018-03-30,Pure Flix Entertainment,PG,God's Not Dead,Original Screenplay,...,Contemporary Fiction,"Pure Flix,GND Media Group",United States,no,no info,50k-50m,Michael Mason,drama,English,6652708
840,928,2014,Moms' Night Out,10429707,5000000,2014-05-09,Sony Pictures,PG,,Original Screenplay,...,Contemporary Fiction,"Tri-Star Pictures,Affirm Films,Provident Films...",United States,no,100k-25m,50k-50m,"Andrew Erwin,Jon Erwin",comedy,English,3014666
846,4029,2017,A Question of Faith,2587072,0,2017-09-29,Pure Flix Entertainment,PG,,Original Screenplay,...,Contemporary Fiction,"Pure Flix,Silver Linings Entertainment",United States,no,no info,50k-50m,Kevan Otto,drama,English,6054650
899,3089,2016,I'm Not Ashamed,2082980,1500000,2016-10-21,Pure Flix Entertainment,PG-13,,Based on Real Life Events,...,Dramatization,"Pure Flix,Visible Pictures",United States,no,100k-25m,50k-50m,Brian Baugh,"biography,drama",English,4950110
1006,1867,2015,Woodlawn,14394097,13000000,2015-10-16,Pure Flix Entertainment,PG,,Based on Real Life Events,...,Dramatization,"Pure Flix,Provident Films",United States,no,100k-25m,50k-50m,"The Erwin Brothers,The Erwin Brothers","drama,sport",English,4183692


In [95]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains('IM Global', na=False, regex=False))&
                                  (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
635,3062,2016,Miles Ahead,2610896,0,2016-04-01,Sony Pictures Classics,R,,Based on Real Life Events,...,Dramatization,"Bifrost Pictures,Sony Pictures Classics,Miles ...",United States,no,no info,50k-50m,Don Cheadle,"biography,drama,music",English,790770
666,3016,2016,Fifty Shades of Black,11686940,5000000,2016-01-29,Open Road,R,,Based on Movie,...,Contemporary Fiction,"IM Global,Open Road Films,Baby Way",United States,no,100k-25m,50k-50m,Michael Tiddes,comedy,English,4667094
696,188,2013,The Lords of Salem,1165881,1500000,2013-04-19,Anchor Bay Entertainment,R,,Original Screenplay,...,Fantasy,"Haunted Movie,IM Global","Canada,United Kingdom,United States",no,100k-25m,50k-50m,Rob Zombie,"horror,thriller",English,1731697
712,4017,2017,Film Stars Don't Die in Liverpool,1024266,0,,Adler & Associates,R,,Based on Real Life Events,...,Dramatization,"Eon Productions,IM Global,Lionsgate,Synchronis...","United Kingdom,United States",no,no info,50k-50m,Paul McGuigan,"biography,drama,romance",English,5711148
860,94,2013,A Haunted House,40041683,2500000,2013-01-11,Open Road,R,A Haunted House,Original Screenplay,...,Multiple Creative Types,"Open Road Films,IM Global,Endgame Entertainmen...",United States,no,100k-25m,50k-50m,Michel Tiddes,"comedy,fantasy","English,Spanish",2243537
872,128,2013,Dead Man Down,10895295,30000000,2013-03-08,FilmDistrict,R,,Original Screenplay,...,Contemporary Fiction,"FilmDistrict,IM Global,WWE Studios,Original Fi...",United States,no,25m-50m,50k-50m,Niels Arden Oplev,"action,crime,drama","English,French,Albanian,Spanish,Hungarian",2101341
967,3051,2016,Southside with You,6304223,0,2016-08-26,Roadside Attractions,PG-13,,Original Screenplay,...,Dramatization,"IM Global,Miramax Films,Roadside Attractions",United States,no,no info,50k-50m,Richard Tanne,"biography,drama,history",English,4258698


In [91]:
# production_company_list = list(master_list_df_reconstituted_again['production_companies'].unique())
# production_company_list = [x for x in production_company_list if (str(x) != 'nan') ]
# final_list = []
# for production_company in production_company_list:
#     list_to_add = [x.strip() for x in production_company.split(",")]
#     final_list.extend(list_to_add)
    
# final_list = list(set(final_list))
# final_list = [x for x in final_list if x != 'LLC']
# final_list = [x for x in final_list if x != 'Inc.']

# #movies from which production company do we get to see the least in India?

# category = 'production_companies'

# df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

# for category_name in final_list:
    
#     df = df.append({"category": category_name,
#                     '2013': None,
#                     '2014': None,
#                     '2015':None,
#                     '2016':None,
#                     '2017':None,
#                     '2018':None,
#                     'total not':None,
#                     'total':None,
#                     '2013 %':None,
#                     '2014 %':None,
#                     '2015 %':None,
#                     '2016 %':None,
#                     '2017 %':None,
#                     '2018 %':None,
#                     'total not %':None
#                     }, ignore_index=True)
    
#     movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False)) &
#                                                                             (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

#     movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False))])

#     movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
#     movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

#     df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
#     df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
#     df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

#     for year in range(2013,2019):
        
#         year_string = str(year)
        
#         year_percent_string = str(year) + ' %'
        
#         movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False))  &
#                                                                             (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
#                                                                             (master_list_df_reconstituted_again['year'] == year)])

#         movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False)) &
#                                                                                        (master_list_df_reconstituted_again['year'] == year) ])
#         if movies_in_category_name != 0:
#             movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
#             movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
#             df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
#             df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
# df.sort_values('total not',ascending=False)

# #you should have a caveat that these are all the movies that earned at least $500,000 at the US box office

In [11]:
studio_dict = {
    
    "disney": ["Walt Disney Pictures","Walt Disney Animation Studios","Disney Nature","DisneyToon Studios","Disney-Pixar","Marvel Studios","Lucasfilm","Touchstone Pictures","A&E Indiefilms"],
    
    "time_warner": ["Warner Bros.","Warner Animation Group","New Line Cinema","DC Films","DC Comics","Castle Rock Entertainment","CNN Films","HBO Films","HBO Documentary Films","Flagship Entertainment"],

    "21st_century_fox": ["Fox Searchlight Pictures","Twentieth Century Fox","20th Century Fox Animation","Fox International Productions","20th Century Fox","Fox 2000 Pictures","Blue Sky Studios",
                        "Regency Enterprises","New Regency"],
    
    "nbc_universal": ["Focus Features","Universal Pictures","Gramercy Pictures","Working Title Films","Bullwinkle Studios","DreamWorks Animation","Illumination Entertainment","Focus World","Oriental DreamWorks"],
    
    "sony": ["Sony Pictures Classics","Columbia Pictures","Sony Pictures","Sony Pictures Animation","Screen Gems","Affirm Films","Stage 6 Films","Tri-Star Pictures"],
    
    "viacom": ["Paramount Vantage","Paramount Pictures","Nickelodeon Films","MTV Films","Awesomeness Films","BET Films"],
    
    "lionsgate": ["Roadside Attractions","Lionsgate","Codeblack Films","Good Universe","Pantelion Films","Summit Entertainment","Summit Premiere"]
}

#the ownership of some of these studios have changed hands over the years, sticking to how the situation is now for simplicity's sake
#dreamworks pictures is an independent production company
#also allowing for studios where the parent company has a minority stake

In [97]:
#break-up of movies not released in India by conglomerate

import re

category = 'production_companies'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

for key, value in studio_dict.items():

    category_name = key

    division_list = value

    master_list_df_reconstituted_again_dupli = master_list_df_reconstituted_again

    df = df.append({"category": category_name,
                    '2013': None,
                    '2014': None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)

    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True))])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)

    df.loc[df['category'] == category_name, "total" ] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):

        year_string = str(year)

        year_percent_string = str(year) + ' %'

        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True))  &
                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again_dupli['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli[category].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                                       (master_list_df_reconstituted_again_dupli['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)

            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent

df.sort_values('total not',ascending=False)

#you should have a caveat that these are all the movies that earned at least $500,000 at the US box office

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
6,lionsgate,5,0,0,6,6,7,24,88,27.8,0.0,0.0,22.2,46.2,58.3,27.3
4,sony,0,2,2,3,3,2,12,81,0.0,33.3,25.0,13.6,17.6,16.7,14.8
3,nbc_universal,0,0,2,2,2,2,8,88,0.0,0.0,15.4,7.7,10.5,20.0,9.1
1,time_warner,0,1,0,2,1,0,4,81,0.0,25.0,0.0,9.1,6.7,0.0,4.9
2,21st_century_fox,0,1,1,0,1,1,4,72,0.0,12.5,5.9,0.0,5.3,20.0,5.6
5,viacom,0,1,1,1,0,0,3,47,0.0,50.0,14.3,6.2,0.0,0.0,6.4
0,disney,0,0,0,0,0,0,0,52,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [98]:
division_list = studio_dict['lionsgate']

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
658,3045,2016,Compadres,3127773,3000000,2016-04-22,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Pantelion Films,Lionsgate,Televisa Cine,Draco ...","Mexico,United States",no,100k-25m,50k-50m,Enrique Begne,"action,comedy,crime",English,3367294
675,153,2013,Stand Up Guys,3310031,0,2013-02-01,Roadside Attractions,R,,Original Screenplay,...,Contemporary Fiction,"Lionsgate,Sidney Kimmel Entertainment,Lakeshor...",United States,no,no info,50k-50m,Fisher Stevens,"comedy,crime,thriller","English,Belarusian",1389096
712,4017,2017,Film Stars Don't Die in Liverpool,1024266,0,,Adler & Associates,R,,Based on Real Life Events,...,Dramatization,"Eon Productions,IM Global,Lionsgate,Synchronis...","United Kingdom,United States",no,no info,50k-50m,Paul McGuigan,"biography,drama,romance",English,5711148
721,5349,2018,Beast,800365,0,2018-05-11,Roadside Attractions,R,,Original Screenplay,...,Contemporary Fiction,"30 West,Roadside Attractions,Film 4,BFI,Agile ...",United States,no,no info,50k-50m,Michael Pearce,"crime,drama,mystery",English,5628302
749,166,2013,Filly Brown,2850357,1250000,2013-04-19,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,Lionsgate,United States,no,100k-25m,50k-50m,"Youssef Delara,Michael D. Olmos","drama,music",English,1869425
764,3965,2017,The Glass Castle,17273059,0,2017-08-11,Lionsgate,PG-13,,Based on Factual Book/Article,...,Dramatization,"Lionsgate,Gil Netter Productions",United States,no,no info,50k-50m,Destin Daniel Cretton,"biography,drama",English,2378507
780,5179,2018,Acrimony,43549096,20000000,2018-03-30,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Tyler Perry Studios,Lionsgate",United States,no,100k-25m,50k-50m,Tyler Perry,thriller,English,6063050
794,5266,2018,Traffik,9186156,0,2018-04-20,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,"Hidden Empire Film Group,Third Eye Motion Pict...",United States,no,no info,50k-50m,Deon Taylor,thriller,English,5670152
796,5175,2018,I Can Only Imagine,83482352,7000000,2018-03-16,Roadside Attractions,PG,,Based on Song,...,Dramatization,"Lionsgate,Erwin Brothers Entertainment,South W...",United States,no,100k-25m,50m-100m,"Andrew Erwin,Jon Erwin","biography,drama,family",English,6450186
810,3988,2017,Stronger,4211129,0,2017-09-22,Roadside Attractions,R,,Based on Factual Book/Article,...,Dramatization,"Lionsgate,Mandeville Films,Roadside Attraction...",United States,no,no info,50k-50m,David Gordon Green,"biography,drama",English,3881784


In [99]:
division_list = studio_dict['sony']

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
635,3062,2016,Miles Ahead,2610896,0,2016-04-01,Sony Pictures Classics,R,,Based on Real Life Events,...,Dramatization,"Bifrost Pictures,Sony Pictures Classics,Miles ...",United States,no,no info,50k-50m,Don Cheadle,"biography,drama,music",English,790770
643,3137,2016,The Bronze,615816,3500000,2016-03-18,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Duplass Brothers,Stage 6 Films",United States,no,100k-25m,50k-50m,Bryan Buckley,"comedy,drama,sport",English,3859304
657,1805,2015,War Room,67790117,3000000,2015-08-28,Sony Pictures,PG,,Original Screenplay,...,Contemporary Fiction,"Tri-Star Pictures,Faithstep Films,Provident Fi...",United States,no,100k-25m,50m-100m,Alex Kendrick,drama,English,3832914
692,4002,2017,All Saints,5802208,2000000,2017-08-25,Sony Pictures,PG,,Based on Real Life Events,...,Dramatization,"Affirm Films,Provident Films",United States,no,100k-25m,50k-50m,Steve Gomer,drama,English,4663548
699,4043,2017,Professor Marston & The Wonder Women,1585362,0,2017-10-13,Annapurna Pictures,R,,Based on Real Life Events,...,Dramatization,"Annapurna Pictures,Stage 6 Films,Topple Entert...",United States,no,no info,50k-50m,Angela Robinson,"biography,drama",English,6133130
740,5185,2018,"Paul, Apostle of Christ",17547999,5000000,2018-03-23,Sony Pictures,PG-13,,Based on Religious Text,...,Dramatization,"Affirm Films,Giving Films,Outside Da Box Films...",United States,no,100k-25m,50k-50m,Andrew Hyatt,"drama,history",English,7388562
789,5443,2018,Boundaries,701828,0,2018-06-22,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Automatik,Oddfellows Entertainment,Stage 6 Fil...","Canada,United States",no,no info,50k-50m,Shana Feste,"comedy,drama",English,5686062
840,928,2014,Moms' Night Out,10429707,5000000,2014-05-09,Sony Pictures,PG,,Original Screenplay,...,Contemporary Fiction,"Tri-Star Pictures,Affirm Films,Provident Films...",United States,no,100k-25m,50k-50m,"Andrew Erwin,Jon Erwin",comedy,English,3014666
904,4091,2017,Novitiate,580346,0,2017-10-27,Sony Pictures Classics,R,,Original Screenplay,...,Historical Fiction,"Sony Pictures Classics,Maven Pictures",United States,no,no info,50k-50m,Maggie Betts,drama,"English,American Sign Language,Latin",4513316
920,1833,2015,The Lady in the Van,10021175,0,2015-01-15,Sony Pictures Classics,PG-13,,Based on Play,...,Dramatization,"Tri-Star Pictures,BBC Films",United States,no,no info,50k-50m,Nicholas Hytner,"biography,comedy,drama","English,French",3722070


In [100]:
division_list = studio_dict['nbc_universal']

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
474,1746,2015,Fifty Shades of Grey,166167230,40000000,2015-02-13,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Focus Features,Michael De Luca,Trigger Street ...",United States,no,25m-50m,150m-200m,Sam Taylor-Johnson,"drama,romance,thriller",English,2322441
649,3906,2017,Girls Trip,115108515,28000000,2017-07-21,Universal,R,,Original Screenplay,...,Contemporary Fiction,"Will Packer Productions,Universal Pictures,Per...",United States,no,25m-50m,100m-150m,Malcolm D. Lee,"comedy,drama",English,3564472
654,5119,2018,Fifty Shades Freed,100407760,55000000,2018-02-09,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Michael De Luca,Universal Pictures,Perfect Wor...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance,thriller",English,4477536
698,3880,2017,Fifty Shades Darker,114434010,55000000,2017-02-10,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Universal Pictures,Perfect World Pictures,Mich...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance",English,4465564
827,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848
901,1910,2015,5 Flights Up,1020921,0,2015-05-08,Focus Features,PG-13,,Original Screenplay,...,Contemporary Fiction,"Revelations Entertainment,Latitude Entertainme...",United States,no,no info,50k-50m,Richard Loncraine,drama,English,2933544
985,5341,2018,Breaking In,46383120,6000000,2018-05-11,Universal,PG-13,,Original Screenplay,...,Contemporary Fiction,"Will Packer Productions,Practical Pictures,Uni...",United States,no,100k-25m,50k-50m,James McTeigue,"action,crime,drama",English,7137846
1002,3046,2016,The Young Messiah,6469813,16800000,2016-03-11,Focus Features,PG-13,,Based on Comic/Graphic Novel,...,Dramatization,"Focus Features,Ocean Blue Entertainment,1492 P...",United States,no,100k-25m,50k-50m,Cyrus Nowrasteh,"drama,fantasy",English,1002563


In [15]:
all_big_studio_divisions = []

for key, value in studio_dict.items():
    all_big_studio_divisions.extend(value)

In [16]:
len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(all_big_studio_divisions)), na=False, regex=True)) &
                                        (master_list_df_reconstituted_again['released_in_india_2nd_check'] == 'yes')])

436

In [17]:
len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(all_big_studio_divisions)), na=False, regex=True)) &
                                        (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

53

In [18]:
len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(all_big_studio_divisions)), na=False, regex=True))])

489

In [19]:
len(master_list_df_reconstituted_again.loc[~(master_list_df_reconstituted_again['production_companies'].str.contains(('|'.join(all_big_studio_divisions)), na=False, regex=True))])

516

In [20]:
len(master_list_df_reconstituted_again)

1005

In [101]:
def not_released_number(categoryx, binx):

    for key, value in studio_dict.items():

        category_name = key

        division_list = value 

        master_list_df_reconstituted_again_dupli = master_list_df_reconstituted_again

        movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_list)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli[categoryx] == binx) ])

        print('For the "{}" studio cluster, the number of movies in the {} range of {} which were not released in India is {}'.format(key,binx,categoryx,movies_in_category_name_total_released_india_no))
        

not_released_number('domestic_box_office_bins','50m-100m')

For the "disney" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not released in India is 0
For the "time_warner" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not released in India is 0
For the "21st_century_fox" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not released in India is 0
For the "nbc_universal" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not released in India is 1
For the "sony" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not released in India is 2
For the "viacom" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not released in India is 0
For the "lionsgate" studio cluster, the number of movies in the 50m-100m range of domestic_box_office_bins which were not releas

In [102]:
division_listx = studio_dict["lionsgate"]

master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli['domestic_box_office_bins'] == '50m-100m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
796,5175,2018,I Can Only Imagine,83482352,7000000,2018-03-16,Roadside Attractions,PG,,Based on Song,...,Dramatization,"Lionsgate,Erwin Brothers Entertainment,South W...",United States,no,100k-25m,50m-100m,"Andrew Erwin,Jon Erwin","biography,drama,family",English,6450186
827,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848
922,5340,2018,Overboard,50316123,12000000,2018-05-04,Lionsgate,PG-13,,Remake,...,Contemporary Fiction,"Pantelion Films,Metro-Goldwyn-Mayer Pictures,3...",United States,no,100k-25m,50m-100m,Rob Greenberg,"comedy,romance","English,Norwegian,Spanish,French",1563742
999,96,2013,Tyler Perry's Temptation,51975354,0,2013-03-29,Lionsgate,PG-13,,Based on Play,...,Contemporary Fiction,"Lionsgate,TPS Company",United States,no,no info,50m-100m,Tyler Perry,"drama,thriller",English,2070862
1013,97,2013,Tyler Perry's A Madea Christmas,52543354,25000000,2013-12-13,Lionsgate,PG-13,Madea,Based on Play,...,Contemporary Fiction,"Lionsgate,Tyler Perry Studios",United States,no,25m-50m,50m-100m,Tyler Perry,"comedy,drama",English,2609758


In [103]:
division_listx = studio_dict["sony"]

master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli['domestic_box_office_bins'] == '50m-100m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
657,1805,2015,War Room,67790117,3000000,2015-08-28,Sony Pictures,PG,,Original Screenplay,...,Contemporary Fiction,"Tri-Star Pictures,Faithstep Films,Provident Fi...",United States,no,100k-25m,50m-100m,Alex Kendrick,drama,English,3832914
1033,867,2014,Think Like a Man Too,65028687,24000000,2014-06-20,Sony Pictures,PG-13,Think Like a Man,Based on Factual Book/Article,...,Contemporary Fiction,"Screen Gems,LStar Capital,Will Packer Productions",United States,no,100k-25m,50m-100m,"Tim Story,Keith Merryman","comedy,romance",English,2239832


In [13]:
division_listx = studio_dict["nbc_universal"]

master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again['domestic_box_office_bins'] == '50m-100m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
782,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848


In [104]:
not_released_number('domestic_box_office_bins','100m-150m')

For the "disney" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not released in India is 0
For the "time_warner" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not released in India is 0
For the "21st_century_fox" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not released in India is 0
For the "nbc_universal" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not released in India is 3
For the "sony" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not released in India is 0
For the "viacom" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not released in India is 0
For the "lionsgate" studio cluster, the number of movies in the 100m-150m range of domestic_box_office_bins which were not

In [105]:
division_listx = studio_dict["nbc_universal"]

master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli['domestic_box_office_bins'] == '100m-150m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
649,3906,2017,Girls Trip,115108515,28000000,2017-07-21,Universal,R,,Original Screenplay,...,Contemporary Fiction,"Will Packer Productions,Universal Pictures,Per...",United States,no,25m-50m,100m-150m,Malcolm D. Lee,"comedy,drama",English,3564472
654,5119,2018,Fifty Shades Freed,100407760,55000000,2018-02-09,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Michael De Luca,Universal Pictures,Perfect Wor...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance,thriller",English,4477536
698,3880,2017,Fifty Shades Darker,114434010,55000000,2017-02-10,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Universal Pictures,Perfect World Pictures,Mich...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance",English,4465564


In [106]:
not_released_number('production_budget_bins','50m-75m')

For the "disney" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 0
For the "time_warner" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 0
For the "21st_century_fox" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 0
For the "nbc_universal" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 2
For the "sony" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 0
For the "viacom" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 0
For the "lionsgate" studio cluster, the number of movies in the 50m-75m range of production_budget_bins which were not released in India is 0


In [107]:
division_listx = studio_dict["nbc_universal"]

master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli['production_budget_bins'] == '50m-75m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
654,5119,2018,Fifty Shades Freed,100407760,55000000,2018-02-09,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Michael De Luca,Universal Pictures,Perfect Wor...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance,thriller",English,4477536
698,3880,2017,Fifty Shades Darker,114434010,55000000,2017-02-10,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Universal Pictures,Perfect World Pictures,Mich...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance",English,4465564


In [108]:
not_released_number('production_budget_bins','25m-50m')

For the "disney" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 0
For the "time_warner" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 0
For the "21st_century_fox" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 0
For the "nbc_universal" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 3
For the "sony" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 0
For the "viacom" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 0
For the "lionsgate" studio cluster, the number of movies in the 25m-50m range of production_budget_bins which were not released in India is 2


In [109]:
division_listx = studio_dict["nbc_universal"]

master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli['production_budget_bins'] == '25m-50m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
474,1746,2015,Fifty Shades of Grey,166167230,40000000,2015-02-13,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Focus Features,Michael De Luca,Trigger Street ...",United States,no,25m-50m,150m-200m,Sam Taylor-Johnson,"drama,romance,thriller",English,2322441
649,3906,2017,Girls Trip,115108515,28000000,2017-07-21,Universal,R,,Original Screenplay,...,Contemporary Fiction,"Will Packer Productions,Universal Pictures,Per...",United States,no,25m-50m,100m-150m,Malcolm D. Lee,"comedy,drama",English,3564472
827,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848


In [110]:
division_listx = studio_dict["lionsgate"]

master_list_df_reconstituted_again_dupli.loc[(master_list_df_reconstituted_again_dupli['production_companies']\
                                                                                                            .str.contains(('|'.join(division_listx)), na=False, regex=True)) &
                                                                                                            (master_list_df_reconstituted_again_dupli['released_in_india_2nd_check'] != 'yes') &
                                                                                                            (master_list_df_reconstituted_again_dupli['production_budget_bins'] == '25m-50m') ]

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
827,2952,2016,Neighbors 2: Sorority Rising,55340730,35000000,2016-05-20,Universal,R,Neighbors,Original Screenplay,...,Contemporary Fiction,"Point Grey,Good Universe,Universal Pictures,Pe...",United States,no,25m-50m,50m-100m,Nicholas Stoller,comedy,English,4438848
1013,97,2013,Tyler Perry's A Madea Christmas,52543354,25000000,2013-12-13,Lionsgate,PG-13,Madea,Based on Play,...,Contemporary Fiction,"Lionsgate,Tyler Perry Studios",United States,no,25m-50m,50m-100m,Tyler Perry,"comedy,drama",English,2609758


In [6]:
#-- do a list for critically appreciated movies, either movies that have done well at film festivals, or that have won oscars or indepedent spirit awards
#since there's just 48 movies, just checked the csv manually instead of through code

#looking at the independent spirit awards, 48 movies with best feature, director or screenplay nominations from 2014 to 2018. 18 of those 48 were never released in India. That's around 38 percent 
#Put another way, 2 out of 5 critically acclaimed independent American movies (from our selection of 48 movies) never found their way to an Indian screen
#note that "call me by your name" had censorship concerns

#history of winners from https://s3.amazonaws.com/SA_SubForm_etc/2018_SA_NomsWinners_040218.pdf 

#find out how many of them had major studio backing in some form
    #find out which production companies these were

import pandas as pd

indy_award_nominees_df = pd.read_csv('data/independent_spirit_awards_nominees_2014_2018.csv')
indy_award_nominees_df.head()

Unnamed: 0,year,title,nominee,released_in_india
0,2014,12 Years A Slave,best_feature,yes
1,2014,All Is Lost,best_feature,yes
2,2014,Before Midnight,best_director_or_screenplay,yes
3,2014,Blue Jasmine,best_director_or_screenplay,yes
4,2014,Enough Said,best_director_or_screenplay,no


In [8]:
studio_dict = {
    
    "disney": ["Walt Disney Pictures","Walt Disney Animation Studios","Disney Nature","DisneyToon Studios","Disney-Pixar","Marvel Studios","Lucasfilm","Touchstone Pictures","A&E Indiefilms"],
    
    "time_warner": ["Warner Bros.","Warner Animation Group","New Line Cinema","DC Films","DC Comics","Castle Rock Entertainment","CNN Films","HBO Films","HBO Documentary Films","Flagship Entertainment"],

    "21st_century_fox": ["Fox Searchlight Pictures","Twentieth Century Fox","20th Century Fox Animation","Fox International Productions","20th Century Fox","Fox 2000 Pictures","Blue Sky Studios",
                        "Regency Enterprises","New Regency"],
    
    "nbc_universal": ["Focus Features","Universal Pictures","Gramercy Pictures","Working Title Films","Bullwinkle Studios","DreamWorks Animation","Illumination Entertainment","Focus World","Oriental DreamWorks"],
    
    "sony": ["Sony Pictures Classics","Columbia Pictures","Sony Pictures","Sony Pictures Animation","Screen Gems","Affirm Films","Stage 6 Films","Tri-Star Pictures"],
    
    "viacom": ["Paramount Vantage","Paramount Pictures","Nickelodeon Films","MTV Films","Awesomeness Films","BET Films"],
    
    "lionsgate": ["Roadside Attractions","Lionsgate","Codeblack Films","Good Universe","Pantelion Films","Summit Entertainment","Summit Premiere"]
}


distributor_dict = {
    
    "disney": ["Walt Disney"],
    
    "time_warner": ["Warner Bros."],

    "21st_century_fox": ["Fox Searchlight","20th Century Fox"],
    
    "nbc_universal": ["Focus Features","Universal","Gramercy"],
    
    "sony": ["Sony Pictures Classics","Sony Pictures"],
    
    "viacom": ["Paramount Pictures"],
    
    "lionsgate": ["Roadside Attractions","Lionsgate","Codeblack Entertainment"]
}

In [9]:
from fuzzywuzzy import fuzz

import pprint

pp = pprint.PrettyPrinter(indent=4)

division_count = {
    
    "disney": {'total': 0, 'released':0,'not_released':0},
    
    "time_warner": {'total': 0, 'released':0,'not_released':0},

    "21st_century_fox": {'total': 0, 'released':0,'not_released':0},
    
    "nbc_universal": {'total': 0, 'released':0,'not_released':0},
    
    "sony": {'total': 0, 'released':0,'not_released':0},
    
    "viacom": {'total': 0, 'released':0,'not_released':0},
    
    "lionsgate": {'total': 0, 'released':0,'not_released':0}
}

scraped_data_df = pd.read_csv('data/scraped_data.csv', converters={'number': str})

for index,row in indy_award_nominees_df.iterrows():
    
    title = row['title']
    print('\n')
    print(title)
    year = row['year']
    year = int(year)
    year_list = [year-2, year-1, year]
    released_in_india = row['released_in_india']
    
    for indexx, rowx in master_list_df_reconstituted_again.iterrows():
        titlex = rowx['title']
        yearx = rowx['year']
        yearx = int(yearx)
        
        production_companies = rowx['production_companies']
        
        if ((fuzz.token_sort_ratio(title, titlex) >= 85) and (yearx in year_list)):
            
            print("possible match for title '{}' from indy nominee list is '{}' from master list".format(title,titlex))
            
            for key, value in studio_dict.items():
                
                group_name = key
                
                division_list = value  
    
                if str(production_companies) != 'nan':
                
                    if any(x in production_companies for x in division_list):
                        
                        print('division of {} movie studio involved in title {}'.format(group_name,title))
                        
                        division_count[group_name]['total'] += 1

                        if released_in_india == 'yes':

                            division_count[group_name]['released'] += 1

                        else:

                            division_count[group_name]['not_released'] += 1
#                     else:
#                         print('no divisions of {} movie studio involved in title "{}"'.format(group_name,title))
                else:
                    
                    print('no production companies for title "{}" from indy nominee list'.format(title))
                    
            break
        
        else:
            
            continue
    
print('\n')
pp.pprint(division_count)
print('\n')
print('done and done')



12 Years A Slave
possible match for title '12 Years A Slave' from indy nominee list is '12 Years a Slave' from master list
division of 21st_century_fox movie studio involved in title 12 Years A Slave


All Is Lost
possible match for title 'All Is Lost' from indy nominee list is 'All is Lost' from master list
division of lionsgate movie studio involved in title All Is Lost


Before Midnight
possible match for title 'Before Midnight' from indy nominee list is 'Before Midnight' from master list
division of time_warner movie studio involved in title Before Midnight


Blue Jasmine
possible match for title 'Blue Jasmine' from indy nominee list is 'Blue Jasmine' from master list
division of sony movie studio involved in title Blue Jasmine


Enough Said
possible match for title 'Enough Said' from indy nominee list is 'Enough Said' from master list


Frances Ha
possible match for title 'Frances Ha' from indy nominee list is 'Frances Ha' from master list


Inside Llewyn Davis
possible match fo

In [10]:
from fuzzywuzzy import fuzz

import pprint

pp = pprint.PrettyPrinter(indent=4)

division_count = {
    
    "disney": {'total': 0, 'released':0,'not_released':0},
    
    "time_warner": {'total': 0, 'released':0,'not_released':0},

    "21st_century_fox": {'total': 0, 'released':0,'not_released':0},
    
    "nbc_universal": {'total': 0, 'released':0,'not_released':0},
    
    "sony": {'total': 0, 'released':0,'not_released':0},
    
    "viacom": {'total': 0, 'released':0,'not_released':0},
    
    "lionsgate": {'total': 0, 'released':0,'not_released':0}
}

scraped_data_df = pd.read_csv('data/scraped_data.csv', converters={'number': str})

for index,row in indy_award_nominees_df.iterrows():
    
    title = row['title']
    print('\n')
    print(title)
    year = row['year']
    year = int(year)
    year_list = [year-2, year-1, year]
    released_in_india = row['released_in_india']
    
    for indexx, rowx in scraped_data_df.iterrows():
        titlex = rowx['title']
        yearx = rowx['year']
        yearx = int(yearx)
        
        production_companies = rowx['production_companies']
        
        if ((fuzz.token_sort_ratio(title, titlex) >= 85) and (yearx in year_list)):
            
            print("possible match for title '{}' from indy nominee list is '{}' from master list".format(title,titlex))
            
            for key, value in studio_dict.items():
                
                group_name = key
                
                division_list = value  
    
                if str(production_companies) != 'nan':
                
                    if any(x in production_companies for x in division_list):
                        
                        print('division of {} movie studio involved in title {}'.format(group_name,title))
                        
                        division_count[group_name]['total'] += 1

                        if released_in_india == 'yes':

                            division_count[group_name]['released'] += 1

                        else:

                            division_count[group_name]['not_released'] += 1
#                     else:
#                         print('no divisions of {} movie studio involved in title "{}"'.format(group_name,title))
                else:
                    
                    print('no production companies for title "{}" from indy nominee list'.format(title))
                    
            break
        
        else:
            
            continue
    
print('\n')
pp.pprint(division_count)
print('\n')
print('done and done')



12 Years A Slave
possible match for title '12 Years A Slave' from indy nominee list is '12 Years a Slave' from master list
division of 21st_century_fox movie studio involved in title 12 Years A Slave


All Is Lost
possible match for title 'All Is Lost' from indy nominee list is 'All is Lost' from master list
division of lionsgate movie studio involved in title All Is Lost


Before Midnight
possible match for title 'Before Midnight' from indy nominee list is 'Before Midnight' from master list
division of time_warner movie studio involved in title Before Midnight


Blue Jasmine
possible match for title 'Blue Jasmine' from indy nominee list is 'Blue Jasmine' from master list
division of sony movie studio involved in title Blue Jasmine


Enough Said
possible match for title 'Enough Said' from indy nominee list is 'Enough Said' from master list


Frances Ha
possible match for title 'Frances Ha' from indy nominee list is 'Frances Ha' from master list


Inside Llewyn Davis
possible match fo

In [11]:
from fuzzywuzzy import fuzz

import pprint

pp = pprint.PrettyPrinter(indent=4)

division_count = {

    "disney": {'total': 0, 'released':0,'not_released':0},

    "time_warner": {'total': 0, 'released':0,'not_released':0},

    "21st_century_fox": {'total': 0, 'released':0,'not_released':0},

    "nbc_universal": {'total': 0, 'released':0,'not_released':0},

    "sony": {'total': 0, 'released':0,'not_released':0},

    "viacom": {'total': 0, 'released':0,'not_released':0},

    "lionsgate": {'total': 0, 'released':0,'not_released':0}
}

scraped_data_df = pd.read_csv('data/scraped_data.csv', converters={'number': str})

for index,row in indy_award_nominees_df.iterrows():

    title = row['title']
    print('\n')
    print(title)
    year = row['year']
    year = int(year)
    year_list = [year-2, year-1, year]
    released_in_india = row['released_in_india']

    for indexx, rowx in scraped_data_df.iterrows():
        titlex = rowx['title']
        yearx = rowx['year']
        yearx = int(yearx)

        domestic_distributor = rowx['domestic_distributor']

        if ((fuzz.token_sort_ratio(title, titlex) >= 85) and (yearx in year_list)):

            print("possible match for title '{}' from indy nominee list is '{}' from master list".format(title,titlex))
            
            indy_award_nominees_df.loc[index,'domestic_distributor'] = domestic_distributor

            for key, value in distributor_dict.items():

                group_name = key

                division_list = value

                if str(domestic_distributor) != 'nan':

                    if any(x in domestic_distributor for x in division_list):

                        print('division of {} movie studio involved in distribution of title "{}"'.format(group_name,title))
                        
                        indy_award_nominees_df.loc[index,'domestic_distributor'] = group_name

                        division_count[group_name]['total'] += 1

                        if released_in_india == 'yes':

                            division_count[group_name]['released'] += 1

                        else:

                            division_count[group_name]['not_released'] += 1

                        break
                        
                    else:
                        
                        continue
                        
                else:

                    print('no us distributors for title "{}" from indy nominee list'.format(title))

            break

        else:

            continue

print('\n')
pp.pprint(division_count)
print('\n')

indy_award_nominees_df.to_csv('data/independent_spirit_awards_nominees_with_distributor.csv', index=False)

print('done and done')



12 Years A Slave
possible match for title '12 Years A Slave' from indy nominee list is '12 Years a Slave' from master list
division of 21st_century_fox movie studio involved in distribution of title "12 Years A Slave"


All Is Lost
possible match for title 'All Is Lost' from indy nominee list is 'All is Lost' from master list
division of lionsgate movie studio involved in distribution of title "All Is Lost"


Before Midnight
possible match for title 'Before Midnight' from indy nominee list is 'Before Midnight' from master list
division of sony movie studio involved in distribution of title "Before Midnight"


Blue Jasmine
possible match for title 'Blue Jasmine' from indy nominee list is 'Blue Jasmine' from master list
division of sony movie studio involved in distribution of title "Blue Jasmine"


Enough Said
possible match for title 'Enough Said' from indy nominee list is 'Enough Said' from master list
division of 21st_century_fox movie studio involved in distribution of title "Enou

In [12]:
len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_2nd_check'] == 'yes'])

784

In [13]:
len(master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes'])

221

In [27]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_2nd_check'] == 'yes'].groupby('genre').size()

genre
Action               128
Adventure            118
Black Comedy          14
Comedy               112
Drama                208
Horror                59
Multiple Genres        1
Musical               11
Romantic Comedy       23
Thriller/Suspense    102
Western                8
dtype: int64

In [14]:
master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes'].groupby('genre').size()

genre
Action                 4
Adventure              5
Black Comedy           4
Comedy                49
Drama                120
Horror                 6
Multiple Genres        1
Musical                1
Romantic Comedy       11
Thriller/Suspense     19
Western                1
dtype: int64

In [15]:
master_list_df_reconstituted_again.dtypes

number                          int64
year                            int64
title                          object
domestic_box_office             int64
production_budget               int64
domestic_release_date          object
domestic_distributor           object
mpaa_rating                    object
franchise                      object
source                         object
genre                          object
production_method              object
creative_type                  object
production_companies           object
production_countries           object
released_in_india_2nd_check    object
production_budget_bins         object
domestic_box_office_bins       object
director                       object
genres_imdb_final              object
languages_imdb                 object
id_imdb                         int64
dtype: object

In [None]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_final'] == 'music')&
                                      (master_list_df_reconstituted_again['domestic_box_office_bins'])]

In [17]:
master_list_df_reconstituted_again['domestic_box_office_bins'].unique()

array(['150m-200m', '50k-50m', '200m-250m', '300m-350m', '50m-100m',
       '100m-150m', '500m-550m', '650m-700m', '250m-300m', '400m-450m',
       '350m-400m', '450m-500m', '600m-650m', '700m-750m', '900m-950m'], dtype=object)

In [23]:
for binx in master_list_df_reconstituted_again['domestic_box_office_bins'].unique():
    lengthx = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_final'].str.contains('music', na=False, regex=False))&
                                      (master_list_df_reconstituted_again['domestic_box_office_bins']==binx) &
                                        (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])
    
    print('for binx {}, number of movies not released in music genre is {}'.format(binx,lengthx))    

for binx 150m-200m, number of movies not released in music genre is 0
for binx 50k-50m, number of movies not released in music genre is 16
for binx 200m-250m, number of movies not released in music genre is 0
for binx 300m-350m, number of movies not released in music genre is 0
for binx 50m-100m, number of movies not released in music genre is 0
for binx 100m-150m, number of movies not released in music genre is 0
for binx 500m-550m, number of movies not released in music genre is 0
for binx 650m-700m, number of movies not released in music genre is 0
for binx 250m-300m, number of movies not released in music genre is 0
for binx 400m-450m, number of movies not released in music genre is 0
for binx 350m-400m, number of movies not released in music genre is 0
for binx 450m-500m, number of movies not released in music genre is 0
for binx 600m-650m, number of movies not released in music genre is 0
for binx 700m-750m, number of movies not released in music genre is 0
for binx 900m-950m, nu

In [24]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_final'].str.contains('music', na=False, regex=False))&
                                        (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')].sort_values('domestic_box_office',ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
895,5068,2018,Forever My Girl,16376066,3500000,2018-01-19,Roadside Attractions,PG,,Based on Fiction Book/Short Story,...,Contemporary Fiction,"LD Entertainment,Liddell Entertainment",United States,no,100k-25m,50k-50m,Bethany Ashton Wolf,"drama,music,romance",English,4103724
765,923,2014,Beyond the Lights,14618727,7000000,2014-11-14,Relativity,PG-13,,Original Screenplay,...,Contemporary Fiction,"Relativity Media,Undisputed Cinema,Homegrown P...",United States,no,100k-25m,50k-50m,Gina Prince-Bythewood,"drama,music,romance",English,3125324
635,1866,2015,Love & Mercy,12551031,0,2015-06-05,Roadside Attractions,PG-13,,Based on Real Life Events,...,Dramatization,"River Road Entertainment,Battle Mountain Films",United States,no,no info,50k-50m,Bill Pohlad,"biography,drama,music",English,903657
986,5586,2018,"Juliet, Naked",3423155,0,2018-08-17,Roadside Attractions,R,,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Apatow Productions,Bona Fide,Los Angeles Media...","United Kingdom,United States",no,no info,50k-50m,Jesse Peretz,"comedy,drama,music",English,5607096
716,166,2013,Filly Brown,2850357,1250000,2013-04-19,Lionsgate,R,,Original Screenplay,...,Contemporary Fiction,Lionsgate,United States,no,100k-25m,50k-50m,"Youssef Delara,Michael D. Olmos","drama,music",English,1869425
694,951,2014,The Identical,2827666,0,2014-09-05,Freestyle Releasing,PG,,Original Screenplay,...,Historical Fiction,"City of Peace Films,The Identical Production C...",United States,no,no info,50k-50m,Dustin Marcellino,"drama,music",English,2326574
625,3062,2016,Miles Ahead,2610896,0,2016-04-01,Sony Pictures Classics,R,,Based on Real Life Events,...,Dramatization,"Bifrost Pictures,Sony Pictures Classics,Miles ...",United States,no,no info,50k-50m,Don Cheadle,"biography,drama,music",English,790770
691,5441,2018,Hearts Beat Loud,2386254,2000000,2018-06-08,Gunpowder & Sky,PG-13,,Original Screenplay,...,Contemporary Fiction,"Burn Later,Houston King Productions,Park Pictu...",United States,no,100k-25m,50k-50m,Brett Haley,"drama,music",English,7158430
623,4025,2017,Brad's Status,2133158,0,2017-09-15,Annapurna Pictures,R,,Original Screenplay,...,Contemporary Fiction,,United States,no,no info,50k-50m,Mike White,"comedy,drama,music",English,5884230
797,143,2013,Unfinished Song,1698952,0,2013-06-21,Weinstein Co.,PG-13,,Original Screenplay,...,Contemporary Fiction,"Coolmore Productions,Aegis Film Fund,Film Hous...","Germany,United States",no,no info,50k-50m,Paul Andrew Williams,"comedy,drama,music",English,1047011


In [26]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_final'].str.contains('sport', na=False, regex=False))&
                                        (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')].sort_values('domestic_box_office', ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
945,1867,2015,Woodlawn,14394097,13000000,2015-10-16,Pure Flix Entertainment,PG,,Based on Real Life Events,...,Dramatization,"Pure Flix,Provident Films",United States,no,100k-25m,50k-50m,"The Erwin Brothers,The Erwin Brothers","drama,sport",English,4183692
879,5265,2018,The Miracle Season,10230620,0,2018-04-06,LD Entertainment,PG,,Based on Real Life Events,...,Dramatization,LD Entertainment,United States,no,no info,50k-50m,Sean McNamara,"drama,sport",English,5427194
937,168,2013,Home Run,2859955,1200000,2013-04-19,Samuel Goldwyn Films,PG-13,,Original Screenplay,...,Contemporary Fiction,"Impact Productions,Hero Productions",United States,no,100k-25m,50k-50m,David Boyd,"drama,sport",English,2051894
791,1924,2015,My All-American,2246000,20000000,2015-11-13,Clarius Entertainment,PG,,Based on Factual Book/Article,...,Dramatization,"Clarius Entertainment,Anthem Ventures,Anthem P...",United States,no,100k-25m,50k-50m,Angelo Pizzo,"biography,drama,sport",English,3719896
838,3091,2016,Greater,2000093,9000000,2016-08-26,Hammond Entertainment,PG,,Based on Real Life Events,...,Dramatization,"Hammond Entertainment LLC,Greater Productions LLC",United States,no,100k-25m,50k-50m,David Hunt,"biography,family,sport",English,2950418
889,4049,2017,Slamma Jamma,1687000,0,2017-03-24,Riverrain,PG,,Original Screenplay,...,Contemporary Fiction,,United States,no,no info,50k-50m,Timothy A. Chey,"drama,sport",English,5319866
630,3137,2016,The Bronze,615816,3500000,2016-03-18,Sony Pictures Classics,R,,Original Screenplay,...,Contemporary Fiction,"Duplass Brothers,Stage 6 Films",United States,no,100k-25m,50k-50m,Bryan Buckley,"comedy,drama,sport",English,3859304
715,1005,2014,23 Blast,549185,1000000,2014-10-24,Abramorama Films,PG-13,,Original Screenplay,...,Contemporary Fiction,,United States,no,100k-25m,50k-50m,Dylan Baker,"drama,family,sport",English,2304459


In [28]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_final'].str.contains('romance', na=False, regex=False))&
                                        (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')].sort_values('domestic_box_office', ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
466,1746,2015,Fifty Shades of Grey,166167230,40000000,2015-02-13,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Focus Features,Michael De Luca,Trigger Street ...",United States,no,25m-50m,150m-200m,Sam Taylor-Johnson,"drama,romance,thriller",English,2322441
677,3880,2017,Fifty Shades Darker,114434010,55000000,2017-02-10,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Universal Pictures,Perfect World Pictures,Mich...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance",English,4465564
641,5119,2018,Fifty Shades Freed,100407760,55000000,2018-02-09,Universal,R,Fifty Shades of Grey,Based on Fiction Book/Short Story,...,Contemporary Fiction,"Michael De Luca,Universal Pictures,Perfect Wor...",United States,no,50m-75m,100m-150m,James Foley,"drama,romance,thriller",English,4477536
966,867,2014,Think Like a Man Too,65028687,24000000,2014-06-20,Sony Pictures,PG-13,Think Like a Man,Based on Factual Book/Article,...,Contemporary Fiction,"Screen Gems,LStar Capital,Will Packer Productions",United States,no,100k-25m,50m-100m,"Tim Story,Keith Merryman","comedy,romance",English,2239832
868,5340,2018,Overboard,50316123,12000000,2018-05-04,Lionsgate,PG-13,,Remake,...,Contemporary Fiction,"Pantelion Films,Metro-Goldwyn-Mayer Pictures,3...",United States,no,100k-25m,50m-100m,Rob Greenberg,"comedy,romance","English,Norwegian,Spanish,French",1563742
779,1829,2015,Love the Coopers,26302731,18000000,2015-11-13,CBS Films,PG-13,,Original Screenplay,...,Contemporary Fiction,"CBS Films,Imagine Entertainment,Groundswell Pr...",United States,no,100k-25m,50k-50m,Jessie Nelson,"comedy,fantasy,romance",English,2279339
828,906,2014,Top Five,25317379,12000000,2014-12-12,Paramount Pictures,R,,Original Screenplay,...,Contemporary Fiction,IACF,United States,no,100k-25m,50k-50m,Chris Rock,"comedy,romance",English,2784678
740,121,2013,Enough Said,17550872,8000000,2013-09-18,Fox Searchlight,PG-13,,Original Screenplay,...,Contemporary Fiction,"Likely Story,TSG Entertainment",United States,no,100k-25m,50k-50m,Nicole Holofcener,"comedy,drama,romance",English,2390361
895,5068,2018,Forever My Girl,16376066,3500000,2018-01-19,Roadside Attractions,PG,,Based on Fiction Book/Short Story,...,Contemporary Fiction,"LD Entertainment,Liddell Entertainment",United States,no,100k-25m,50k-50m,Bethany Ashton Wolf,"drama,music,romance",English,4103724
898,917,2014,And So It Goes,15160801,18000000,2014-07-25,Clarius Entertainment,PG-13,,Original Screenplay,...,Contemporary Fiction,"Castle Rock Entertainment,Rob Reiner,Alan Grei...",United States,no,100k-25m,50k-50m,Rob Reiner,"comedy,drama,romance",English,2465146


In [31]:
master_list_df_reconstituted_again.sort_values('domestic_box_office', ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,creative_type,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb
972,1736,2015,Star Wars Ep. VII: The Force Awakens,936662225,306000000,2015-12-18,Walt Disney,PG-13,Star Wars,Original Screenplay,...,Science Fiction,"Lucasfilm,Bad Robot",United States,yes,300m-325m,900m-950m,J.J. Abrams,"action,adventure,fantasy",English,2488496
457,5117,2018,Black Panther,700059566,200000000,2018-02-16,Walt Disney,PG-13,Marvel Cinematic Universe,Based on Comic/Graphic Novel,...,Super Hero,Marvel Studios,United States,yes,200m-225m,700m-750m,Ryan Coogler,"action,adventure,sci-fi","Swahili,Nama,English,Xhosa,Korean",1825683
184,5257,2018,Avengers: Infinity War,678815482,300000000,2018-04-27,Walt Disney,PG-13,Marvel Cinematic Universe,Based on Comic/Graphic Novel,...,Super Hero,Marvel Studios,United States,yes,300m-325m,650m-700m,"Joe Russo,Anthony Russo","action,adventure,fantasy",English,4154756
67,1737,2015,Jurassic World,652270625,215000000,2015-06-12,Universal,PG-13,Jurassic Park,Based on Fiction Book/Short Story,...,Science Fiction,"Universal Pictures,Amblin Entertainment,Legend...",United States,yes,200m-225m,650m-700m,Colin Trevorrow,"action,adventure,sci-fi",English,369610
864,3857,2017,Star Wars Ep. VIII: The Last Jedi,620181382,317000000,2017-12-15,Walt Disney,PG-13,Star Wars,Original Screenplay,...,Science Fiction,"Lucasfilm,Walt Disney Pictures",United States,yes,300m-325m,600m-650m,Rian Johnson,"action,adventure,fantasy",English,2527336
359,5427,2018,Incredibles 2,606354358,200000000,2018-06-15,Walt Disney,PG,The Incredibles,Original Screenplay,...,Kids Fiction,Disney-Pixar,United States,yes,200m-225m,600m-650m,Brad Bird,"action,adventure,animation",English,3606756
593,2887,2016,Rogue One: A Star Wars Story,532177324,200000000,2016-12-16,Walt Disney,PG-13,Star Wars,Spin-Off,...,Science Fiction,Lucasfilm,United States,yes,200m-225m,500m-550m,Gareth Edwards,"action,adventure,sci-fi",English,3748528
60,3858,2017,Beauty and the Beast,504014165,160000000,2017-03-17,Walt Disney,PG,,Remake,...,Fantasy,"Walt Disney Pictures,Mandeville Films",United States,yes,150m-175m,500m-550m,Bill Condon,"family,fantasy,musical",English,2771200
529,2888,2016,Finding Dory,486295561,200000000,2016-06-17,Walt Disney,PG,Finding Nemo,Original Screenplay,...,Kids Fiction,Disney-Pixar,United States,yes,200m-225m,450m-500m,Andrew Stanton,"adventure,animation,comedy","English,Indonesian",2277860
298,1739,2015,Avengers: Age of Ultron,459005868,330600000,2015-05-01,Walt Disney,PG-13,Marvel Cinematic Universe,Based on Comic/Graphic Novel,...,Super Hero,Marvel Studios,United States,yes,325m-350m,450m-500m,Joss Whedon,"action,adventure,sci-fi","English,Korean",2395427


In [32]:
master_list_df_reconstituted_again['production_budget_bins'].unique()

array(['225m-250m', '100k-25m', '175m-200m', '100m-125m', 'no info',
       '50m-75m', '125m-150m', '25m-50m', '150m-175m', '75m-100m',
       '200m-225m', '250m-275m', '275m-300m', '300m-325m', '325m-350m'], dtype=object)

In [34]:
len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['production_budget_bins'] != 'no info')
                                           #&
#                                         (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')
                                      ])

753

In [111]:
master_list_df_reconstituted_again_yes = master_list_df_reconstituted_again[master_list_df_reconstituted_again['released_in_india_2nd_check'] == 'yes']

In [112]:
master_list_df_reconstituted_again_yes.to_csv('data/india_releases_yes_manual_check.csv', index=False)

In [6]:
from imdb import IMDb

with open('passwords_ignore.txt') as f:
    password = f.readline()

login_string = 'postgresql://postgres:' + password + '@localhost/imdb2'

ia = IMDb('s3', login_string, adultSearch=False)

In [16]:
resultsx  = ia.search_movie("Inconvenient Sequel",results = 30)

resultsx

[<Movie id:179239[s3] title:_L'inconveniente (None)_>,
 <Movie id:7529858[s3] title:_An Inconvenient Special (None)_>,
 <Movie id:4990898[s3] title:_El Inconveniente (None)_>,
 <Movie id:1664121[s3] title:_Inconvenient Truths (None)_>,
 <Movie id:1655563[s3] title:_An Inconvenient Egg (None)_>,
 <Movie id:1013982[s3] title:_An Inconvenient Lie (None)_>,
 <Movie id:497512[s3] title:_Inconvenience Store (None)_>,
 <Movie id:8355272[s3] title:_An Inconvenient Answer (None)_>,
 <Movie id:1428564[s3] title:_In Convenience (None)_>,
 <Movie id:805551[s3] title:_In Convenience (None)_>,
 <Movie id:165831[s3] title:_Inconvenienced (None)_>,
 <Movie id:8466576[s3] title:_An Inconvenient Ruth (None)_>,
 <Movie id:5031442[s3] title:_An Inconvenient Stay (None)_>,
 <Movie id:3141888[s3] title:_An Inconvenient Fish (None)_>,
 <Movie id:1391471[s3] title:_An Inconvenient Game (None)_>,
 <Movie id:1248130[s3] title:_An Inconvenient Head (None)_>,
 <Movie id:1283287[s3] title:_An Inconvenient Penguin 

In [7]:
moviez = ia.get_movie('4700756')
moviez['kind']

'movie'

In [10]:
moviez['genres']

['drama']

In [1]:
import pandas as pd

scraped_data_df = pd.read_csv('data/scraped_data.csv', converters={'number': str})

In [2]:
scraped_data_df.dtypes

number                      object
year                         int64
title                       object
domestic_box_office          int64
international_box_office     int64
production_budget            int64
domestic_release_date       object
domestic_distributor        object
mpaa_rating                 object
franchise                   object
source                      object
genre                       object
production_method           object
creative_type               object
production_companies        object
production_countries        object
dtype: object

In [5]:
len(scraped_data_df[scraped_data_df['domestic_box_office']>500000])

1077

In [1]:
from imdb import IMDb
from random import randint
import time

ib = IMDb(accessSystem = 'http', adultSearch = False)



In [4]:
movie_http = ib.get_movie('5013056')

In [5]:
movie_http['genres']

['Action', 'Drama', 'History', 'Thriller', 'War']

In [7]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v16.csv')

In [8]:
from imdb import IMDb
from random import randint
import time

ib = IMDb(accessSystem = 'http', adultSearch = False)

for index, row in master_list_df_reconstituted_again.iterrows():
    
    number = row['number']
    id_imdb = row['id_imdb']
    id_imdb = str(id_imdb)
    print(number)

    try:
        movie_http = ib.get_movie(id_imdb)

        delay = randint(6,9)
        time.sleep(delay)

        genres_raw = movie_http['genres']
        genres_imdb = ','.join(genres_raw)

        master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb_extended'] = genres_imdb
    
    except:
        print('error for page number {}'.format(number))

master_list_df_reconstituted_again.to_csv('data/india_release_check_v17.csv', index=False)

print('done and done')

3867
164
12
1834
851
3023
2895
3986
3074
217
3899
2927
837
3918
1786
2994
2903
1798
3104
2991
800
173
868
908
2896
5344
874
4015
5177
72
2987
1777
5503
5118
902
39
52
1753
69
2905
830
1877
40
2906
22
833
18
3874
3014
103
130
160
2936
2956
129


2018-11-01 16:09:25,347 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1814621/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

3007
3100
797
176
3992
3858
3037
2977
3024


2018-11-01 16:12:18,796 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1374989/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

102
3907
1781
1737
2911
2935
2899
98
191
2909
1856
3953
68
3003


2018-11-01 16:17:17,153 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1712261/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

error for page number 3003
1754
1828
1787
1790
64
2986
7
106
5434
3941
28
933
809
5062
848
5183
1772
63
3004
3924
3022
3939
3047
2978
1766
1801
812
1760
1823
24
3131
182
875
790
834
872
1905
2941
889
126
3947
141
62
3875


2018-11-01 16:33:17,774 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3450958/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

131
3905
1768
23
793
869
862
1807
3909
2928
1784
132
60
2988
3029
1740
915
139
827
2886
798
2979
84
5431
3911
1812
3942
1831
190
840
3973
1811
144
791
3119
2998
882
1842
838
895
3886
3881
954
792
31
1771
5432
2971
2922
1
1763
847
1880
55
2901
1873
1830
2947
5262
1803
5260
119
5257
9
1741
95
109
852
3033
2919
1843
3866
53
892
3903
3887
2972
3015
5436
3948
2931
8
3966
913
1858
2970
34
5259
802
930
3872
3891


2018-11-01 17:03:08,958 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3890160/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

3026
3958
796
1836
2997


2018-11-01 17:04:48,891 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1389139/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

5506
946
30


2018-11-01 17:05:45,983 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt0848537/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

2018-11-01 17:05:56,006 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "0848537" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 30
3077
2965
934
1826
67
2929
845
3869
177
25
2944
3997
5067
2967
5337
27
2939
2897
813
1742
1813
1875
3883
21
2933
3884
2940
841
794
1756
1915
134
861
883
151
888
1804
829
4
2964
3951
3908
150
805
801
105
1770
820
1853
879
3034
3882
1789


2018-11-01 17:22:32,695 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1823672/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

error for page number 1789
1773
3946
3018


2018-11-01 17:23:58,141 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2547584/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 17:24:28,286 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2547584" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 3018
1800
878
1872
3865
3058
5433
2955


2018-11-01 17:26:51,864 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1292566/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 2955
3972
937
963
825
1822
3945
5063
3863


2018-11-01 17:30:16,039 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3896198/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 17:30:46,073 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "3896198" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 3863
3043


2018-11-01 17:31:29,482 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2649554/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 17:31:39,506 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2649554" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 3043
884
5187


2018-11-01 17:32:30,494 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt7153766/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

2018-11-01 17:32:40,520 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "7153766" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 5187
2948


2018-11-01 17:32:56,547 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4094724/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

2018-11-01 17:33:06,571 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "4094724" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 2948
922


2018-11-01 17:33:25,595 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1791528/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

2018-11-01 17:33:35,620 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1791528" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 922
1739
1751
14
3984
3879
50


2018-11-01 17:35:50,367 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2302755/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 50
860
73
890
3950
5508
1849
3922


2018-11-01 17:38:12,947 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5816682/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

3980
3055
5504
901
3002


2018-11-01 17:40:01,429 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3381008/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 3002
1899
2996
5182
815
788
3868
1749
5
3031


2018-11-01 17:43:45,953 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2980210/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 3031
914
1869
85
3927
3038
3093
926
4036
2961
70
1824
3952
806
1755
3954
1818
1797
885
3885
5064
1744
2958
887


2018-11-01 17:51:37,430 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2870612/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 887
3913
1857
975


2018-11-01 17:52:59,981 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2167266/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 975
5180
2926
2993
1795
2894
47
1839
3112
5427
3962
3871
3981
37
817
5174
93
1757
13
57
3964


2018-11-01 17:59:55,442 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4695012/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 3964
2904
2951
5268
88
3969
865
81


2018-11-01 18:02:29,823 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt0765446/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 81
56
918
1802
3944
11
864
1775


2018-11-01 18:05:46,914 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2381941/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/c

3910


2018-11-01 18:06:54,273 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5308322/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 3910
871


2018-11-01 18:08:20,597 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt0365907/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('_ssl.c:825: The handshake operation timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_bo

error for page number 871
1792
5439
20
824
2890
3976
5176


2018-11-01 18:11:11,411 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2557478/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 18:11:41,476 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "2557478" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 5176
1747
110
5338
185
5181
876
893
5507
1902
3977
5173
78
82
1859
1750
2968
3861
117
1806
3970


2018-11-01 18:18:57,598 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3922818/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

2892
905
5515
1758
3893
2930


2018-11-01 18:21:15,976 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4624424/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

2902
1814
2957
941
2953
2975
828


2018-11-01 18:23:44,939 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2203939/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

error for page number 828
1855
3870
90
821


2018-11-01 18:25:42,115 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1234721/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 821
2889
3888
3892
158
5123
3076


2018-11-01 18:28:35,891 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4193394/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

3


2018-11-01 18:29:13,042 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1690953/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 18:29:43,228 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1690953" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 3
2966
831
2945
1845
823
5438
3877
1769
2907
5510


2018-11-01 18:32:46,406 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt7424200/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

error for page number 5510
849
5066
856


2018-11-01 18:34:03,882 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2980648/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 856
1953
2916
2891
3934


2018-11-01 18:35:41,547 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5462602/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 3934
2898
2908


2018-11-01 18:36:58,329 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1679335/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

error for page number 2908
5117
3082
3101
965
886
3994
3862
1785
145
1746
866
858
814
120
29
5190
3860
114
3912
1752
3890
2984
2950


2018-11-01 18:45:17,840 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2304933/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 2950
854
2954
3063
1851
89
2900
5430
1819
1861
816
1767
3876
3889
5339
1840
1865
2921
71
795
124


2018-11-01 18:53:12,415 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2209418/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 124
811
4013
5120
92
3926
41
853
3916
2924
1893
127
5336
1825
147
3109
2989
5121
61
803
1815
844
2
36
5509
3938
45
3919
891
1816
2888
909
1907
3930
172
2923
804
4006


2018-11-01 19:05:14,199 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1412528/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 4006
5258
3959
5428
138
101
4080
3895
3915
956
5122
2960
842
118
819
1794
1743
859
807
1832
3864
899
3071
1765
26
1774
1745
15
16
4012
1844
810
1783
1847
3878
1864
1776
3059
2981
5065
881
1881
1846
808
5186
910
2932
826
863
1919
896
787
1793
3873
2893
2963


2018-11-01 19:24:30,755 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3717252/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 566, in _readall_chunked
    value.append(self._safe_read(chunk_left))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 612, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
  File "/home/shijith/anaconda3/lib/python3.6/socket.

error for page number 2963
122


2018-11-01 19:25:48,507 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2034139/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/c

2887
2990
3032
3921
789
855
857
33
897
107
1788
59
2959
1913
83
5178
2938
3030
835
5188


2018-11-01 19:33:08,090 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5360952/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 19:33:38,153 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "5360952" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 5188
4052
1817
5184
6
38


2018-11-01 19:35:42,855 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1428538/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

2018-11-01 19:36:13,049 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1428538" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 38
1738


2018-11-01 19:36:51,186 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2820852/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 1738
1761
10
77
108
4025


2018-11-01 19:39:15,652 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5884230/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

5442
3062
3145
19
4000
1955
3137
3968
870
3904
3940
1866
3906
5125
192
940
4070
5119
4024
1805
3045
3961
115
1796
3127
3996
4026
3016
3143


2018-11-01 19:48:29,120 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4687276/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 3143
977
3900
1890


2018-11-01 19:49:39,370 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3859076/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 1890
2942
3932
153
4022
1923


2018-11-01 19:51:41,990 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt3172532/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

2920


2018-11-01 19:52:18,107 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4846340/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 2920
2976
1926
46
2980
948


2018-11-01 19:54:09,637 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2910274/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

2985
5261
3008
843
4002
958
149
2974
188
111
3880
4043
1868
133


2018-11-01 19:58:40,495 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1413495/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)

3935


2018-11-01 19:59:19,709 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1389072/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 3935
3982
943
5263


2018-11-01 20:00:40,156 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt0859635/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 5263
4045
44
1778
4017
3999
2915
5441
4010


2018-11-01 20:03:27,442 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5093026/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 231, in retrieve_unicode
    response = uopener.open(url)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 564, in error
    result = self._call_chain(*args)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/req

2018-11-01 20:03:57,592 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "5093026" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 4010
4040
951
2983
5349
4041
4059
1837
5272
49
4081


2018-11-01 20:07:06,598 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5805752/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chu

2917
4065
5189


2018-11-01 20:08:18,256 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt6652708/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 231, in retrieve_unicode
    response = uopener.open(url)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 564, in error
    result = self._call_chain(*args)
  File "/home/shijith/anaconda3/lib/python3.6/urllib/req

error for page number 5189
4060
1850
1932
3949
5185
1939
900
2973
4053
1005


2018-11-01 20:12:04,947 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2304459/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 1005
166
74
3967
3039
142
1950
3052
1951
1848
3960
3991


2018-11-01 20:16:28,818 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt5804314/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/cli

error for page number 3991
3859
3978
822


2018-11-01 20:18:25,532 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt2084970/plotsummary', 'proxy': '', 'exception type': 'IOError', 'original exception': timeout('The read operation timed out',)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/site-packages/imdb/parser/http/__init__.py", line 232, in retrieve_unicode
    content = response.read()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 456, in read
    return self._readall_chunked()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 563, in _readall_chunked
    chunk_left = self._get_chunk_left()
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 546, in _get_chunk_left
    chunk_left = self._read_next_chunk_size()
  File "/home/shijith/anaconda3/lib/python3.6/http/c

2937


2018-11-01 20:19:01,676 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4160708/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 2937
3965
3107
1879
3067
3928


2018-11-01 20:21:00,310 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt4131800/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 3928
3057
1878


2018-11-01 20:22:01,289 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1018765/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('_ssl.c:825: The handshake operation timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_bo

2018-11-01 20:22:31,413 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/__init__.py:714: caught an exception retrieving or parsing "plot" info set for mopID "1018765" (accessSystem: http)
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/home/shijith/anaconda3/lib/python

error for page number 1878
3894


2018-11-01 20:23:07,562 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt1753383/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(timeout('timed out',),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
 

error for page number 3894
5264
121
947
123
5179
832
5124
4009
42
170
2962
5443
1827
982
1871
5266
818
5175
2934
1898
1852
169
894
1929
3929
3092
923
1970
3988
3998
3914
5346
1944
1954
3917
5516
1820
1780
920
100
1829
4032
911
2952
3931
3925
157
3936
3963
5512
3021
5505
1924
3013
928
66
3025
48
143
4029
162
5192
1759
985
3957
4075
919
799
2913
187
3017
79
94
936
978
938
43
1808
2914
931
846
4056
4046
2999
128
3001
2912
4020
3020
906
159
1779
3920
4031
2946
898
974
3000
1791
3091
3011
51
3068
5191
997
2910
4014
3040
3089
3050
1910
1003
1960
4091
3896
3009
1981
839
5513
4051
3036
953
140
1969
3989
3857
1863
1833
99
5340
3990
86
935
3049
1841
3096
3898
1870
1948
5267
5265
58
3087
1887
5269
32
3027
877
1894
1782
4049
3897
215
995
76
971
5068
5440


2018-11-01 21:09:57,823 CRITICAL [imdbpy] /home/shijith/anaconda3/lib/python3.6/site-packages/imdb/_exceptions.py:34: IMDbDataAccessError exception raised; args: ({'errcode': None, 'errmsg': 'None', 'url': 'http://www.imdb.com/title/tt6212478/reference', 'proxy': '', 'exception type': 'IOError', 'original exception': URLError(gaierror(-2, 'Name or service not known'),)},); kwds: {}
Traceback (most recent call last):
  File "/home/shijith/anaconda3/lib/python3.6/urllib/request.py", line 1318, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/shijith/anaconda3/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunk

error for page number 5440
2918
917
5514
980
987
5511
1764
3070
3979
189
1838
3051
3078
4021
1952
3902
993
3044
3901
950
932
1904
1799
1909
179
3048
925
175
5341
3005
186
2969
3975
2925
35
1748
54
1918
3955
3937
168
96
1874
80
3046
91
1975
163
1867
968
1762
5345
5348
1886
5437
97
2995
1821
5342
1985
3933
1941
836
135
17
3120
3072
1937
5429
867
921
104
850
3019
903
1736
2982
3943
4035
1835
2949
4008
3923
880
197
4050
5583
5574
5590
5586
5567
5569
5580
5568
5591
5579
5570
5582
5571
5593
5581
5578
5585
5575
5573
5576
5572
5577
done and done


In [6]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v18.csv')

master_list_df_reconstituted_again.head()

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
0,3867,2017,Pirates of the Caribbean: Dead Men Tell No Tales,172558876,230000000,2017-05-26,Walt Disney,PG-13,Pirates of the Caribbean,Based on Theme Park Ride,...,"Walt Disney Pictures,Jerry Bruckheimer",United States,yes,225m-250m,150m-200m,"Joachim Ronnin,Espen Sandberg","action,adventure,fantasy","English,Spanish",1790809,"Action,Adventure,Fantasy"
1,164,2013,The East,2274649,6500000,2013-05-31,Fox Searchlight,PG-13,,Original Screenplay,...,Scott Free Films,United States,yes,100k-25m,50k-50m,Zal Batmanglij,"adventure,drama,thriller","English,American Sign Language",1869716,"Adventure,Drama,Thriller"
2,12,2013,World War Z,202359711,190000000,2013-06-21,Paramount Pictures,PG-13,World War Z,Based on Fiction Book/Short Story,...,"Skydance Productions,Hemisphere Media Capital,...",United States,yes,175m-200m,200m-250m,Marc Forster,"action,adventure,horror","English,Spanish,Hebrew,Arabic",816711,"Action,Adventure,Horror,Sci-Fi,Thriller"
3,1834,2015,Ricki and the Flash,26839498,18000000,2015-08-07,Sony Pictures,PG-13,,Original Screenplay,...,"Marc Platt Productions,Badwill Entertainment ,...",United States,yes,100k-25m,50k-50m,Jonathan Demme,"comedy,drama,music",English,3623726,"Comedy,Drama,Music"
4,851,2014,Transcendence,23022309,100000000,2014-04-18,Warner Bros.,PG-13,,Original Screenplay,...,"Straight Up Films,DMG Entertainment",United States,yes,100m-125m,50k-50m,Wally Pfister,"drama,mystery,romance",English,2209764,"Drama,Mystery,Romance,Sci-Fi,Thriller"


In [7]:
#note that the languages from the http website are capitalised

master_list_df_reconstituted_again['genres_imdb_extended'].isnull().sum().sum()

1

In [8]:
from imdb import IMDb
from random import randint
import time

ib = IMDb(accessSystem = 'http', adultSearch = False)

for index, row in master_list_df_reconstituted_again.iterrows():
    
    genres_ext_value = row['genres_imdb_extended']
    
    if str(genres_ext_value) == 'nan':

        number = row['number']
        id_imdb = row['id_imdb']
        id_imdb = str(id_imdb)
        print(number)

        try:
            movie_http = ib.get_movie(id_imdb)

            delay = randint(6,9)
            time.sleep(delay)

            genres_raw = movie_http['genres']
            genres_imdb = ','.join(genres_raw)

            master_list_df_reconstituted_again.loc[master_list_df_reconstituted_again['number'] == number,'genres_imdb_extended'] = genres_imdb

        except:
            print('error for page number {}'.format(number))
    
    else:
        continue

master_list_df_reconstituted_again.to_csv('data/india_release_check_v19.csv', index=False)

print('done and done')

2937
done and done


In [None]:
import pandas as pd

master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v19.csv')

master_list_df_reconstituted_again.head()

In [9]:
genres_imdb_list = list(master_list_df_reconstituted_again['genres_imdb_extended'].unique())
print(len(genres_imdb_list))
genres_imdb_list = [x for x in genres_imdb_list if str(x) != 'nan']
final_list = []
for genres_imdb in genres_imdb_list:
    list_to_add = [x.strip() for x in genres_imdb.split(",")]
    final_list.extend(list_to_add)
    
final_list = list(set(final_list))
print(len(final_list))

329
20


In [10]:
#movies from which genre do we get to see the least in India?

category = 'genres_imdb_extended'

df = pd.DataFrame(columns=["category", '2013','2014','2015','2016','2017','2018','total not','total','2013 %','2014 %','2015 %','2016 %','2017 %','2018 %','total not %'])

for category_name in final_list:
    
    df = df.append({"category": category_name,
                    '2013': None,
                    '2014': None,
                    '2015':None,
                    '2016':None,
                    '2017':None,
                    '2018':None,
                    'total not':None,
                    'total':None,
                    '2013 %':None,
                    '2014 %':None,
                    '2015 %':None,
                    '2016 %':None,
                    '2017 %':None,
                    '2018 %':None,
                    'total not %':None
                    }, ignore_index=True)
    
    movies_in_category_name_total_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False)) &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')])

    movies_in_category_name_total = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False))])

    movies_in_category_name_total_released_india_no_percent = (movies_in_category_name_total_released_india_no/movies_in_category_name_total)*100
    movies_in_category_name_total_released_india_no_percent = round(movies_in_category_name_total_released_india_no_percent,1)
    
    df.loc[df['category'] == category_name, "total"] = int(movies_in_category_name_total)
    df.loc[df['category'] == category_name, "total not" ] = int(movies_in_category_name_total_released_india_no)
    df.loc[df['category'] == category_name, 'total not %' ] = movies_in_category_name_total_released_india_no_percent

    for year in range(2013,2019):
        
        year_string = str(year)
        
        year_percent_string = str(year) + ' %'
        
        movies_in_category_name_released_india_no = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False))  &
                                                                            (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes') &
                                                                            (master_list_df_reconstituted_again['year'] == year)])

        movies_in_category_name = len(master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again[category].str.contains(category_name, na=False, regex=False)) &
                                                                                       (master_list_df_reconstituted_again['year'] == year) ])
        if movies_in_category_name != 0:
            movies_in_category_name_released_india_no_percent = (movies_in_category_name_released_india_no/movies_in_category_name)*100
            movies_in_category_name_released_india_no_percent = round(movies_in_category_name_released_india_no_percent,1)
            
            df.loc[df['category'] == category_name, year_string ] = movies_in_category_name_released_india_no
            df.loc[df['category'] == category_name, year_percent_string ] = movies_in_category_name_released_india_no_percent
            
df.sort_values('total not %',ascending=False)

#you should have a caveat that these are all the movies that earned at least $500,000 at the US box office

Unnamed: 0,category,2013,2014,2015,2016,2017,2018,total not,total,2013 %,2014 %,2015 %,2016 %,2017 %,2018 %,total not %
18,Romance,7,16,8,8,7,8,54,169,26.9,40.0,25.8,23.5,31.8,50.0,32.0
6,Drama,24,29,32,18,40,34,177,570,26.4,29.9,28.1,18.0,36.4,58.6,31.1
9,Sport,1,1,2,2,2,1,9,32,20.0,16.7,25.0,33.3,40.0,50.0,28.1
7,Music,2,3,1,4,2,5,17,63,25.0,20.0,8.3,33.3,25.0,62.5,27.0
5,History,2,1,3,1,3,4,14,53,50.0,16.7,21.4,7.1,30.0,80.0,26.4
3,Biography,1,2,6,8,8,5,30,117,6.2,11.8,24.0,33.3,32.0,50.0,25.6
0,Comedy,12,14,14,15,14,18,87,362,20.3,24.1,20.9,19.7,23.0,43.9,24.0
15,War,3,1,1,0,1,2,8,35,60.0,11.1,25.0,0.0,14.3,66.7,22.9
11,Western,0,1,0,0,0,1,2,11,0.0,50.0,0.0,0.0,0.0,100.0,18.2
4,Crime,4,1,1,4,4,11,25,160,11.8,4.2,3.8,15.4,15.4,45.8,15.6


In [None]:
#are there some tags that appear more often for movies? Have a feeling certain tages like 'drama' get overused and get penalised because so much gets put under it

In [16]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_extended'].str.contains('Sport', na=False, regex=False)) &
                                      (master_list_df_reconstituted_again['released_in_india_2nd_check'] != 'yes')].sort_values('domestic_box_office',ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
945,1867,2015,Woodlawn,14394097,13000000,2015-10-16,Pure Flix Entertainment,PG,,Based on Real Life Events,...,"Pure Flix,Provident Films",United States,no,100k-25m,50k-50m,"The Erwin Brothers,The Erwin Brothers","drama,sport",English,4183692,"Drama,Sport"
879,5265,2018,The Miracle Season,10230620,0,2018-04-06,LD Entertainment,PG,,Based on Real Life Events,...,LD Entertainment,United States,no,no info,50k-50m,Sean McNamara,"drama,sport",English,5427194,"Drama,Sport"
937,168,2013,Home Run,2859955,1200000,2013-04-19,Samuel Goldwyn Films,PG-13,,Original Screenplay,...,"Impact Productions,Hero Productions",United States,no,100k-25m,50k-50m,David Boyd,"drama,sport",English,2051894,"Drama,Sport"
791,1924,2015,My All-American,2246000,20000000,2015-11-13,Clarius Entertainment,PG,,Based on Factual Book/Article,...,"Clarius Entertainment,Anthem Ventures,Anthem P...",United States,no,100k-25m,50k-50m,Angelo Pizzo,"biography,drama,sport",English,3719896,"Biography,Drama,Sport"
838,3091,2016,Greater,2000093,9000000,2016-08-26,Hammond Entertainment,PG,,Based on Real Life Events,...,"Hammond Entertainment LLC,Greater Productions LLC",United States,no,100k-25m,50k-50m,David Hunt,"biography,family,sport",English,2950418,"Biography,Family,Sport"
889,4049,2017,Slamma Jamma,1687000,0,2017-03-24,Riverrain,PG,,Original Screenplay,...,,United States,no,no info,50k-50m,Timothy A. Chey,"drama,sport",English,5319866,"Drama,Sport"
630,3137,2016,The Bronze,615816,3500000,2016-03-18,Sony Pictures Classics,R,,Original Screenplay,...,"Duplass Brothers,Stage 6 Films",United States,no,100k-25m,50k-50m,Bryan Buckley,"comedy,drama,sport",English,3859304,"Comedy,Drama,Sport"
804,4075,2017,Tommy's Honour,569306,0,2017-04-14,Roadside Attractions,PG,,Based on Factual Book/Article,...,"Gutta Percha Productions,Creative Scotland,Tim...",United States,no,no info,50k-50m,Jason Connery,"biography,drama,romance",English,3467914,"Biography,Drama,Romance,Sport"
715,1005,2014,23 Blast,549185,1000000,2014-10-24,Abramorama Films,PG-13,,Original Screenplay,...,,United States,no,100k-25m,50k-50m,Dylan Baker,"drama,family,sport",English,2304459,"Drama,Family,Sport"


In [12]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_extended'].str.contains('Drama', na=False, regex=False))].sort_values('domestic_box_office',ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
391,2890,2016,The Jungle Book,364001123,175000000,2016-04-15,Walt Disney,PG,,Based on Fiction Book/Short Story,...,"Walt Disney Pictures,Fairview Entertainment",United States,yes,175m-200m,350m-400m,Jon Favreau,"adventure,drama,family",English,3040964,"Adventure,Drama,Family,Fantasy"
241,1742,2015,Inside Out,356461711,175000000,2015-06-19,Walt Disney,PG,,Original Screenplay,...,Disney-Pixar,United States,yes,175m-200m,350m-400m,Pete Docter,"adventure,animation,comedy",English,2096673,"Animation,Adventure,Comedy,Drama,Family,Fantasy"
806,799,2014,American Sniper,350126372,58000000,2014-01-16,Warner Bros.,R,,Based on Factual Book/Article,...,"Mad Chance,22nd & Indiana ,Malpaso Productions",United States,yes,50m-75m,350m-400m,Clint Eastwood,"action,biography,drama","English,Arabic",2179136,"Action,Biography,Drama,History,Thriller,War"
321,3868,2017,It,327481748,35000000,2017-09-08,Warner Bros.,R,,Remake,...,"Lin Pictures,Vertigo Entertainment,KatzSmith P...",United States,yes,25m-50m,300m-350m,Andy Muschietti,"drama,horror,thriller",English,1396484,"Drama,Horror,Thriller"
203,8,2013,Gravity,274092705,110000000,2013-10-04,Warner Bros.,PG-13,,Original Screenplay,...,"Warner Bros.,Esperanto Filmoj,Heyday Films",United States,yes,100m-125m,250m-300m,Alfonso Cuarón,"drama,sci-fi,thriller","English,Greenlandic",1454468,"Drama,Sci-Fi,Thriller"
345,1744,2015,The Martian,228433663,108000000,2015-10-02,20th Century Fox,PG-13,,Based on Fiction Book/Short Story,...,"Scott Free Films,Kinberg Genre",United States,yes,100m-125m,200m-250m,Ridley Scott,"adventure,drama,sci-fi","English,Mandarin",3659388,"Adventure,Drama,Sci-Fi"
428,3870,2017,Logan,226277068,127000000,2017-03-03,20th Century Fox,R,Wolverine,Based on Comic/Graphic Novel,...,"Marvel Studios,TSG Entertainment,Donners' Company",United States,yes,125m-150m,200m-250m,James Mangold,"action,drama,sci-fi","English,Spanish",3315342,"Action,Drama,Sci-Fi,Thriller"
57,797,2014,Big Hero 6,222527828,165000000,2014-11-07,Walt Disney,PG,,Original Screenplay,...,Walt Disney Animation Studios,United States,yes,150m-175m,200m-250m,"Don Hall,Chris Williams","action,adventure,animation",English,2245084,"Animation,Action,Adventure,Comedy,Drama,Family..."
250,794,2014,Dawn of the Planet of the Apes,208545589,170000000,2014-07-11,20th Century Fox,PG-13,Planet of the Apes,Remake,...,"Chernin Entertainment,TSG Entertainment,Ingeni...",United States,yes,150m-175m,200m-250m,Matt Reeves,"action,adventure,drama","English,American Sign Language",2103281,"Action,Adventure,Drama,Sci-Fi"
394,1747,2015,Cinderella,201151353,95000000,2015-03-13,Walt Disney,PG,,Based on Folk Tale/Legend/Fairytale,...,"Allison Shearmur,Beaglepug,Kinberg Genre",United States,yes,75m-100m,200m-250m,Kenneth Branagh,"drama,family,fantasy",English,1661199,"Drama,Family,Fantasy,Romance"


In [11]:
master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['genres_imdb_extended'].str.contains('Drama', na=False, regex=False))].sort_values('domestic_box_office',ascending=False)

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
391,2890,2016,The Jungle Book,364001123,175000000,2016-04-15,Walt Disney,PG,,Based on Fiction Book/Short Story,...,"Walt Disney Pictures,Fairview Entertainment",United States,yes,175m-200m,350m-400m,Jon Favreau,"adventure,drama,family",English,3040964,"Adventure,Drama,Family,Fantasy"
241,1742,2015,Inside Out,356461711,175000000,2015-06-19,Walt Disney,PG,,Original Screenplay,...,Disney-Pixar,United States,yes,175m-200m,350m-400m,Pete Docter,"adventure,animation,comedy",English,2096673,"Animation,Adventure,Comedy,Drama,Family,Fantasy"
806,799,2014,American Sniper,350126372,58000000,2014-01-16,Warner Bros.,R,,Based on Factual Book/Article,...,"Mad Chance,22nd & Indiana ,Malpaso Productions",United States,yes,50m-75m,350m-400m,Clint Eastwood,"action,biography,drama","English,Arabic",2179136,"Action,Biography,Drama,History,Thriller,War"
321,3868,2017,It,327481748,35000000,2017-09-08,Warner Bros.,R,,Remake,...,"Lin Pictures,Vertigo Entertainment,KatzSmith P...",United States,yes,25m-50m,300m-350m,Andy Muschietti,"drama,horror,thriller",English,1396484,"Drama,Horror,Thriller"
203,8,2013,Gravity,274092705,110000000,2013-10-04,Warner Bros.,PG-13,,Original Screenplay,...,"Warner Bros.,Esperanto Filmoj,Heyday Films",United States,yes,100m-125m,250m-300m,Alfonso Cuarón,"drama,sci-fi,thriller","English,Greenlandic",1454468,"Drama,Sci-Fi,Thriller"
345,1744,2015,The Martian,228433663,108000000,2015-10-02,20th Century Fox,PG-13,,Based on Fiction Book/Short Story,...,"Scott Free Films,Kinberg Genre",United States,yes,100m-125m,200m-250m,Ridley Scott,"adventure,drama,sci-fi","English,Mandarin",3659388,"Adventure,Drama,Sci-Fi"
428,3870,2017,Logan,226277068,127000000,2017-03-03,20th Century Fox,R,Wolverine,Based on Comic/Graphic Novel,...,"Marvel Studios,TSG Entertainment,Donners' Company",United States,yes,125m-150m,200m-250m,James Mangold,"action,drama,sci-fi","English,Spanish",3315342,"Action,Drama,Sci-Fi,Thriller"
57,797,2014,Big Hero 6,222527828,165000000,2014-11-07,Walt Disney,PG,,Original Screenplay,...,Walt Disney Animation Studios,United States,yes,150m-175m,200m-250m,"Don Hall,Chris Williams","action,adventure,animation",English,2245084,"Animation,Action,Adventure,Comedy,Drama,Family..."
250,794,2014,Dawn of the Planet of the Apes,208545589,170000000,2014-07-11,20th Century Fox,PG-13,Planet of the Apes,Remake,...,"Chernin Entertainment,TSG Entertainment,Ingeni...",United States,yes,150m-175m,200m-250m,Matt Reeves,"action,adventure,drama","English,American Sign Language",2103281,"Action,Adventure,Drama,Sci-Fi"
394,1747,2015,Cinderella,201151353,95000000,2015-03-13,Walt Disney,PG,,Based on Folk Tale/Legend/Fairytale,...,"Allison Shearmur,Beaglepug,Kinberg Genre",United States,yes,75m-100m,200m-250m,Kenneth Branagh,"drama,family,fantasy",English,1661199,"Drama,Family,Fantasy,Romance"


In [17]:
master_list_df_reconstituted_again.dtypes

number                          int64
year                            int64
title                          object
domestic_box_office             int64
production_budget               int64
domestic_release_date          object
domestic_distributor           object
mpaa_rating                    object
franchise                      object
source                         object
genre                          object
production_method              object
creative_type                  object
production_companies           object
production_countries           object
released_in_india_2nd_check    object
production_budget_bins         object
domestic_box_office_bins       object
director                       object
genres_imdb_final              object
languages_imdb                 object
id_imdb                         int64
genres_imdb_extended           object
dtype: object

In [30]:
big_studio_list = ["Disney","Warner","Fox","Universal","Columbia","Sony","Paramount","Lionsgate"]

year_list = list(range(2013,2019))

df_list = []

for studiox in big_studio_list:
    
#     print(studiox)
    
    for year in year_list:
        
#         print(year)
        
        df_filtered = master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['year'] == year) &
                                        (master_list_df_reconstituted_again['production_companies'].str.contains(studiox, na=False, regex=False))]

        try:
            df_to_append = df_filtered.ix[df_filtered['domestic_box_office'].argmax()]

            df_list.append(df_to_append)
        except:
            continue

df_appended_complete = pd.concat(df_list, axis = 1)

# df_appended_complete.reset_index(drop=True,inplace=True)

df_appended_complete.T

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
171,1,2013,Frozen,400738009,150000000,2013-11-27,Walt Disney,PG,Frozen,Based on Folk Tale/Legend/Fairytale,...,Walt Disney Animation Studios,United States,yes,150m-175m,400m-450m,"Chris Buck,Jennifer Lee","adventure,animation,comedy","English,Norwegian",2294629,"Animation,Adventure,Comedy,Family,Fantasy,Musical"
57,797,2014,Big Hero 6,222527828,165000000,2014-11-07,Walt Disney,PG,,Original Screenplay,...,Walt Disney Animation Studios,United States,yes,150m-175m,200m-250m,"Don Hall,Chris Williams","action,adventure,animation",English,2245084,"Animation,Action,Adventure,Comedy,Drama,Family..."
241,1742,2015,Inside Out,356461711,175000000,2015-06-19,Walt Disney,PG,,Original Screenplay,...,Disney-Pixar,United States,yes,175m-200m,350m-400m,Pete Docter,"adventure,animation,comedy",English,2096673,"Animation,Adventure,Comedy,Drama,Family,Fantasy"
529,2888,2016,Finding Dory,486295561,200000000,2016-06-17,Walt Disney,PG,Finding Nemo,Original Screenplay,...,Disney-Pixar,United States,yes,200m-225m,450m-500m,Andrew Stanton,"adventure,animation,comedy","English,Indonesian",2277860,"Animation,Adventure,Comedy,Family"
864,3857,2017,Star Wars Ep. VIII: The Last Jedi,620181382,317000000,2017-12-15,Walt Disney,PG-13,Star Wars,Original Screenplay,...,"Lucasfilm,Walt Disney Pictures",United States,yes,300m-325m,600m-650m,Rian Johnson,"action,adventure,fantasy",English,2527336,"Action,Adventure,Fantasy,Sci-Fi"
359,5427,2018,Incredibles 2,606354358,200000000,2018-06-15,Walt Disney,PG,The Incredibles,Original Screenplay,...,Disney-Pixar,United States,yes,200m-225m,600m-650m,Brad Bird,"action,adventure,animation",English,3606756,"Animation,Action,Adventure,Comedy,Family,Sci-Fi"
185,9,2013,Man of Steel,291045518,225000000,2013-06-14,Warner Bros.,PG-13,Superman,Based on Comic/Graphic Novel,...,"Warner Bros.,Legendary Pictures,Syncopy",United States,yes,225m-250m,250m-300m,Zack Snyder,"action,adventure,fantasy",English,770828,"Action,Adventure,Fantasy,Sci-Fi"
518,803,2014,The Lego Movie,257784718,60000000,2014-02-07,Warner Bros.,PG,Lego,Based on Toy,...,"Vertigo Entertainment,Lin Pictures,Warner Anim...","Australia,United States",yes,50m-75m,250m-300m,"Phil Lord,Christopher Miller","action,adventure,animation",English,1490017,"Animation,Action,Adventure,Comedy,Family,Fantasy"
94,1772,2015,Creed,109767581,37000000,2015-11-25,Warner Bros.,PG-13,Rocky,Original Screenplay,...,"Metro-Goldwyn-Mayer Pictures,Warner Bros.,New ...",United States,yes,25m-50m,100m-150m,Ryan Coogler,"drama,sport","English,Spanish",3076658,"Drama,Sport"
414,2892,2016,Batman v Superman: Dawn of Justice,330360194,250000000,2016-03-25,Warner Bros.,PG-13,Man of Steel,Based on Comic/Graphic Novel,...,"Warner Bros.,RatPac Entertainment,Dune Enterta...",United States,yes,250m-275m,300m-350m,Zack Snyder,"action,adventure,fantasy",English,2975590,"Action,Adventure,Fantasy,Sci-Fi"


In [31]:
big_distributor_list = ["Disney","Warner","Fox","Universal","Sony","Paramount","Lionsgate"]

year_list = list(range(2013,2019))

df_list = []

for distributorx in big_distributor_list:
    
#     print(studiox)
    
    for year in year_list:
        
#         print(year)
        
        df_filtered = master_list_df_reconstituted_again.loc[(master_list_df_reconstituted_again['year'] == year) &
                                        (master_list_df_reconstituted_again['domestic_distributor'].str.contains(distributorx, na=False, regex=False))]

        try:
            df_to_append = df_filtered.ix[df_filtered['domestic_box_office'].argmax()]

            df_list.append(df_to_append)
        except:
            continue

df_appended_complete = pd.concat(df_list, axis = 1)

# df_appended_complete.reset_index(drop=True,inplace=True)

df_appended_complete.T

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
521,2,2013,Iron Man 3,408992272,200000000,2013-05-03,Walt Disney,PG-13,Iron Man,Based on Comic/Graphic Novel,...,"Marvel Studios,Paramount Pictures,DMG Entertai...",United States,yes,200m-225m,400m-450m,Shane Black,"action,adventure,sci-fi",English,1300854,"Action,Adventure,Sci-Fi"
597,789,2014,Guardians of the Galaxy,333172112,170000000,2014-08-01,Walt Disney,PG-13,Marvel Cinematic Universe,Based on Comic/Graphic Novel,...,Marvel Studios,United States,yes,150m-175m,300m-350m,James Gunn,"action,adventure,comedy",English,2015381,"Action,Adventure,Comedy,Sci-Fi"
972,1736,2015,Star Wars Ep. VII: The Force Awakens,936662225,306000000,2015-12-18,Walt Disney,PG-13,Star Wars,Original Screenplay,...,"Lucasfilm,Bad Robot",United States,yes,300m-325m,900m-950m,J.J. Abrams,"action,adventure,fantasy",English,2488496,"Action,Adventure,Fantasy,Sci-Fi"
593,2887,2016,Rogue One: A Star Wars Story,532177324,200000000,2016-12-16,Walt Disney,PG-13,Star Wars,Spin-Off,...,Lucasfilm,United States,yes,200m-225m,500m-550m,Gareth Edwards,"action,adventure,sci-fi",English,3748528,"Action,Adventure,Sci-Fi"
864,3857,2017,Star Wars Ep. VIII: The Last Jedi,620181382,317000000,2017-12-15,Walt Disney,PG-13,Star Wars,Original Screenplay,...,"Lucasfilm,Walt Disney Pictures",United States,yes,300m-325m,600m-650m,Rian Johnson,"action,adventure,fantasy",English,2527336,"Action,Adventure,Fantasy,Sci-Fi"
457,5117,2018,Black Panther,700059566,200000000,2018-02-16,Walt Disney,PG-13,Marvel Cinematic Universe,Based on Comic/Graphic Novel,...,Marvel Studios,United States,yes,200m-225m,700m-750m,Ryan Coogler,"action,adventure,sci-fi","Swahili,Nama,English,Xhosa,Korean",1825683,"Action,Adventure,Sci-Fi"
185,9,2013,Man of Steel,291045518,225000000,2013-06-14,Warner Bros.,PG-13,Superman,Based on Comic/Graphic Novel,...,"Warner Bros.,Legendary Pictures,Syncopy",United States,yes,225m-250m,250m-300m,Zack Snyder,"action,adventure,fantasy",English,770828,"Action,Adventure,Fantasy,Sci-Fi"
806,799,2014,American Sniper,350126372,58000000,2014-01-16,Warner Bros.,R,,Based on Factual Book/Article,...,"Mad Chance,22nd & Indiana ,Malpaso Productions",United States,yes,50m-75m,350m-400m,Clint Eastwood,"action,biography,drama","English,Arabic",2179136,"Action,Biography,Drama,History,Thriller,War"
299,1751,2015,San Andreas,155190832,110000000,2015-05-29,Warner Bros.,PG-13,,Original Screenplay,...,"FPS,New Line Cinema,Village Roadshow Productio...",United States,yes,100m-125m,150m-200m,Brad Peyton,"action,adventure,drama",English,2126355,"Action,Adventure,Drama,Thriller"
414,2892,2016,Batman v Superman: Dawn of Justice,330360194,250000000,2016-03-25,Warner Bros.,PG-13,Man of Steel,Based on Comic/Graphic Novel,...,"Warner Bros.,RatPac Entertainment,Dune Enterta...",United States,yes,250m-275m,300m-350m,Zack Snyder,"action,adventure,fantasy",English,2975590,"Action,Adventure,Fantasy,Sci-Fi"


In [30]:
import pandas as pd

distributor_dict = {
    
    "disney": ["Walt Disney"],
    
    "time_warner": ["Warner Bros."],

    "21st_century_fox": ["Fox Searchlight","20th Century Fox"],
    
    "nbc_universal": ["Focus Features","Universal","Gramercy"],
    
    "sony": ["Sony Pictures Classics","Sony Pictures"],
    
    "viacom": ["Paramount Pictures"],
    
    "lionsgate": ["Roadside Attractions","Lionsgate","Codeblack Entertainment"]
}


master_list_df_reconstituted_again = pd.read_csv('data/india_release_check_v19.csv')

master_list_df_reconstituted_again.head()

Unnamed: 0,number,year,title,domestic_box_office,production_budget,domestic_release_date,domestic_distributor,mpaa_rating,franchise,source,...,production_companies,production_countries,released_in_india_2nd_check,production_budget_bins,domestic_box_office_bins,director,genres_imdb_final,languages_imdb,id_imdb,genres_imdb_extended
0,3867,2017,Pirates of the Caribbean: Dead Men Tell No Tales,172558876,230000000,2017-05-26,Walt Disney,PG-13,Pirates of the Caribbean,Based on Theme Park Ride,...,"Walt Disney Pictures,Jerry Bruckheimer",United States,yes,225m-250m,150m-200m,"Joachim Ronnin,Espen Sandberg","action,adventure,fantasy","English,Spanish",1790809,"Action,Adventure,Fantasy"
1,164,2013,The East,2274649,6500000,2013-05-31,Fox Searchlight,PG-13,,Original Screenplay,...,Scott Free Films,United States,yes,100k-25m,50k-50m,Zal Batmanglij,"adventure,drama,thriller","English,American Sign Language",1869716,"Adventure,Drama,Thriller"
2,12,2013,World War Z,202359711,190000000,2013-06-21,Paramount Pictures,PG-13,World War Z,Based on Fiction Book/Short Story,...,"Skydance Productions,Hemisphere Media Capital,...",United States,yes,175m-200m,200m-250m,Marc Forster,"action,adventure,horror","English,Spanish,Hebrew,Arabic",816711,"Action,Adventure,Horror,Sci-Fi,Thriller"
3,1834,2015,Ricki and the Flash,26839498,18000000,2015-08-07,Sony Pictures,PG-13,,Original Screenplay,...,"Marc Platt Productions,Badwill Entertainment ,...",United States,yes,100k-25m,50k-50m,Jonathan Demme,"comedy,drama,music",English,3623726,"Comedy,Drama,Music"
4,851,2014,Transcendence,23022309,100000000,2014-04-18,Warner Bros.,PG-13,,Original Screenplay,...,"Straight Up Films,DMG Entertainment",United States,yes,100m-125m,50k-50m,Wally Pfister,"drama,mystery,romance",English,2209764,"Drama,Mystery,Romance,Sci-Fi,Thriller"


In [31]:
import re

for index,row in master_list_df_reconstituted_again.iterrows():
    #numberx = row['number']
    #print(numberx)

    domestic_distributor_string = row['domestic_distributor']
    #print(domestic_distributor_string)

    if str(domestic_distributor_string) != 'nan':
        
        for key, value in distributor_dict.items():

            category_name = key

            division_list = value

            
            if any(i in domestic_distributor_string for i in division_list):
                

                distributor_us = category_name
                

                master_list_df_reconstituted_again.loc[index,'distributor_us'] = distributor_us
                
                break

            else:
                master_list_df_reconstituted_again.loc[index,'distributor_us'] = 'others'
                continue
        
        continue
        
    else:
        master_list_df_reconstituted_again.loc[index,'distributor_us'] = 'no_info'
        continue
        

master_list_df_reconstituted_again.to_csv('data/india_release_check_v20.csv', index=False)

print('DONE AND DONE')

DONE AND DONE


In [None]:
#QUESTIONS, TO-DOs

#how do you deal with the problem of multiple genres

#of the movies that haven't been released, find if they've applied for CBFC certificates
    #if no certificate, does that mean we don't get to see it on netflix, amazon either?
    #make sure that the movies that werent released couldnt get released because of the censor board

#jave a section at the top of the jupyter notebook, with a summary/overview of what all steps were taken in collecting data and cleaning the dataset

In [None]:
#NOTES FOR GITHUB METHODOLOGY SECTION

#when i am talking about domestic box office and domestic release date, the data is for the US and not India, just stuck to that labelling because that's the label used in the pages i scraped the data from
    #avoid any possible confusion while writing the code

#dont forget to mention bookmyshow list has all the english movies released in india, including those made in india. Just mention this in methodology notes


In [None]:
#TAKEN CARE OF, DONE, NOT BOTHERING WITH

#should I change the minimum box office gross to $1 million? even the promotional release of the TV show inhumans got $500,000 in one day. My bar is too low.

#have a feeling my focus should shift from production companies to distributors, the official oscars pdf has the nominations categorised by us distributor not production company

#also fifty shades movies, were they released in India or not
#https://www.hindustantimes.com/hollywood/despite-cuts-censor-blocks-fifty-shades-of-grey-in-india/story-W88JEi3GStWxMAiJvjGpyH.html
#http://www.rediff.com/movies/report/fifty-shades-darker-banned-in-india/20170310.htm


#have a feeling a few non-English movies may have slipped through, but think I've cleaned the dataset the best I can
#if a froeign title has an english language title for the american market, cant do anything about it

#shouldn't you talk about how much money these movies could possibly make in India? No, that makes it a business story

#do a manual check of the bookmyshow list UPDATE: nah forget about it, the bookmyshow list even has indian english movies, no way I can keep out titles based on a glance

#do one with just the boutique divisions of major studios

#whatever happeneded to that inhumans theatrical realease? has that gotten into the list?

#get a csv of films marked with yes for released in India

#split the datasheet into two parts, one with just the documentaries and one without documentaries
#so one will be feature-length films and the other will be documentaries

#seems i got the str(director) as != 'nan' instead of == 'nan'

#see if you have taken care of the atlas shrugged and religous titles
#until i have definite proof that they haven't been released (proof in the form of news articles etc.) I've pretty much used the list of titles from bookmyshow as is

#leaving them in , have a feeling they may have never been released in India, dont remember any buzz about them 
#who knows there may have been parts of the country where they were released
#will give the bookmyshow list the benefit of the doubt
#i thought I would exclude movies that have 0 critic/user ratings/reviews on the bookmyshow website, but anchorman 2, a movie which i'm sure would have been released in India, has 0 critic/user ratings/reviews

#weed out movies that aren't movies

#should i have documentaries in my movie list 
#what do i do about documentaries?

#i think I will have to go row by row and remove things like the game of thrones Imax experience, mayweather vs. macgregor etc. it's only a 1000 rows you can do it
#exclude stand up specials like Kevin Hart as well

#also there's a batman: killing joke in there, how do we exclude them programatically? Answer is there isn't, at least not with the attributes and metadata available