## Homework 7.0: BBC Movie List Scraping and Regex

In 2016 the BBC polled 177 film critics to get their picks for the best films of the century so far. While the BBC's [aggregate poll](http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films) is interesting, the long list including everyone who voted is perhaps more revealing from the data standpoint:

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

How do I wrangle this data? That is the central challenge that you'll be dealing with this week. You need to use beautiful soup to find the critic--as well as the list of movies that immediately follow them—and then use regular expression to divide the critic information and the movie info to create the most useful possible data structure. What should the data structure be? That is up to you to figure out.



### Getting started: Data Architecture

The central challenge of this assignment it's figuring out how you are going to set up your table (list of dictionaries) from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: what are the main categories of analysis: Try to design a schema that will give you a table that you can run solid aggregations in pandas. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### STEP 1

The first thing you need to do is scrape the page. 

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

Okay let's begin! (Note: I have set up the first few cells so that you can run requests once AND save the HTML page as a local file. And then load that local file in and do the spray thing on it. That way you only need to run requests once (ever)!)



In [1]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup
import re

In [2]:
#RUN THIS ONE TIME
#THEN COMMENT-OUT ALL OF THIS CODE
my_url = "https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted"
raw_html = requests.get(my_url).content

In [3]:
#WRITING THE HTML FILE TO A LOCAL HTML FILE
#RUN THIS ONE TIME, THEN COMMENT-OUT ALL OF THIS CODE
with open('bbc.html', 'wb+') as f:
    f.write(raw_html)

In [4]:
#If you have run requests already--START HERE
f = open("bbc.html", "r")
local_html = f.read()
local_html

'<!DOCTYPE html><html lang="en-GB"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><title>The 21st Century’s 100 greatest films: Who voted?</title><meta property="og:title" content="The 21st Century’s 100 greatest films: Who voted?"/><meta name="twitter:title" content="The 21st Century’s 100 greatest films: Who voted?"/><meta name="description" content="We polled 177 critics from around the world – here is how they voted."/><meta property="og:description" content="We polled 177 critics from around the world – here is how they voted."/><meta name="twitter:description" content="We polled 177 critics from around the world – here is how they voted."/><meta property="og:image" content="https://ychef.files.bbci.co.uk/624x351/p04548r6.jpg"/><meta name="twitter:image:src" content="https://ychef.files.bbci.co.uk/624x351/p04548r6.jpg"/><meta name="twitter:card" content="summary_large_image"/><meta name="msapplication-TileColor" content="#da532c"/><meta name="them

In [5]:
# read the URL, and put the HTML page into beautiful soup
soup_doc = BeautifulSoup(local_html, "html.parser")
print(soup_doc.prettify())

<!DOCTYPE html>
<html lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <title>
   The 21st Century’s 100 greatest films: Who voted?
  </title>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" property="og:title"/>
  <meta content="The 21st Century’s 100 greatest films: Who voted?" name="twitter:title"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="description"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." property="og:description"/>
  <meta content="We polled 177 critics from around the world – here is how they voted." name="twitter:description"/>
  <meta content="https://ychef.files.bbci.co.uk/624x351/p04548r6.jpg" property="og:image"/>
  <meta content="https://ychef.files.bbci.co.uk/624x351/p04548r6.jpg" name="twitter:image:src"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="#da532c" nam

In [6]:
#Using beautiful soup find the tag that contains 
#the entire list of critics and movies
#Make a variable (like full_list) that holds all that information 
full_list = soup_doc.article.find_all('p')
full_list

[<p class="sc-eb7bd5f6-0 fYAfXe"><b class="sc-7dcfb11b-0 kVRnKf" id="we-polled-177-critics-from-around-the-world-–-here-is-how-they-voted.">We polled 177 critics from around the world – here is how they voted.<!-- --></b></p>,
 <p class="sc-eb7bd5f6-0 fYAfXe">Communicating with 177 film critics is a time-consuming process. But for every critic who participated – and many more were invited – it wasn’t just a matter of lending their expertise; it was about sharing their passion. The critics who participated hail from 36 countries: 81 from the US, 19 from the UK, five each from Canada, Cuba, France, and Germany, and four each from Australia, Colombia, India, Israel and Italy. Lebanon, the UAE, China, Bangladesh, Chile, Namibia, Kazakhstan and many others are represented too. Of the 177 critics, 55 are women and 122 are men. We present their votes here in alphabetical order.<!-- --></p>,
 <p class="sc-eb7bd5f6-0 fYAfXe"><b class="sc-7dcfb11b-0 kVRnKf" id="simon-abrams-–-freelance-film-crit

In [7]:
div_list = soup_doc.article.find_all('div')
len(div_list)

508

In [8]:
div_list[9]

<div class="sc-18fde0d6-0 dlWCEZ" data-component="text-block"><p class="sc-eb7bd5f6-0 fYAfXe"><b class="sc-7dcfb11b-0 kVRnKf" id="sam-adams-–-freelance-film-critic-(us)">Sam Adams – Freelance film critic (US)<!-- --></b></p></div>

In [9]:
allb = soup_doc.find_all('b')
len(allb)
for b in allb:
    print(b.text)

We polled 177 critics from around the world – here is how they voted.
Simon Abrams – Freelance film critic (US)
Sam Adams – Freelance film critic (US)
Thelma Adams – Freelance film critic (US)
Arturo Aguilar – Rolling Stone Mexico (Mexico)
Matthew Anderson – BBC Culture (UK)
Tim Appelo – The Wrap (US)
Adriano Aprà – Film historian (Italy)
Michael Arbeiter – Nerdist (US)
Ali Arikan – Dipnot TV (Turkey)
Michael Atkinson – The Village Voice (US)
Ana Maria Bahiana – Freelance film critic (Brazil)
Cameron Bailey – Toronto Film Festival (Canada)
Lindsay Baker – BBC Culture (UK)
Miriam Bale – Freelance film critic (US)
Nicholas Barber – BBC Culture (UK)
Diego Batlle – La Nacion (Argentina)
NT Binh – Positif (France)
Lizelle Bisschoff – University of Glasgow (UK)
Christian Blauvelt – BBC Culture (US)
Mahen Bonetti – African Film Festival Inc (US)
Andreas Borcholte – Spiegel Online (Germany)
Utpal Borpujari – Freelance film critic (India)
Richard Brody – The New Yorker (US)
Hannah Brown – Jerus

**STEP 2** Using Beautiful Soup figure out how to separate the entries.


In [10]:
full_list[2]
full_list[-9]
#len(full_list)

<p class="sc-eb7bd5f6-0 fYAfXe">10. City of God (Fernando Meirelles and Kátia Lund, 2002)<!-- --></p>

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Set up a loop the PRINTS critics and movies: You need to set it up so that you're getting the critic string followed their movies. 

So just print out the lines along with a print message like "CRITIC" or "MOVIE" to make sure that the loop is recognizing the two categories differently.


In [11]:
##Write your loop for STEP 3 here
for entry in full_list[2:-8]:
    if entry.b:
        print(entry.text)
    else:
        print(entry.text)
        print ("-" * 90) 

Simon Abrams – Freelance film critic (US)
1. Mulholland Drive (David Lynch, 2001)
------------------------------------------------------------------------------------------
2. In the Mood for Love (Wong Kar-wai, 2000)
------------------------------------------------------------------------------------------
3. The Tree of Life (Terrence Malick, 2011)
------------------------------------------------------------------------------------------
4. Yi Yi: A One and a Two (Edward Yang, 2000)
------------------------------------------------------------------------------------------
5. Goodbye to Language (Jean-Luc Godard, 2014)
------------------------------------------------------------------------------------------
6. The White Meadows (Mohammad Rasoulof, 2009)
------------------------------------------------------------------------------------------
7. Night Across the Street (Raoul Ruiz, 2012)
------------------------------------------------------------------------------------------
8. Cer

**STEP 4**
If your loop is successfully isolating those two categories: now it's time to parse each with regular expressions (separately). This will need to happen inside the loop--for every critic, and then (in STEP 5) for every movie. But FIRST, just **focus on getting the critics name, organization, and country** in isolation (outside of the loops).

Once you have think you have your regular expressions working then bring them into a loop (just for CRITICS) and see how well they work.

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)[0]`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [12]:
#Practice/Build your regular expressions here
import re
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^([^–]+)"
regex_for_org = r"–([^(]+)\("
regex_for_cn = r"\((.+)\)$"
name = re.findall(regex_for_cn,crit_sample)
name[0]

'Mexico'

In [13]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0]
        org = re.findall(regex_for_org,entry.text)[0]
        cn = re.findall(regex_for_cn,entry.text)[0]
        print(name + "|||" + org+ "|||"+ cn)

Simon Abrams ||| Freelance film critic |||US
Sam Adams ||| Freelance film critic |||US
Thelma Adams ||| Freelance film critic |||US
Arturo Aguilar ||| Rolling Stone Mexico |||Mexico
Matthew Anderson ||| BBC Culture |||UK
Tim Appelo ||| The Wrap |||US
Adriano Aprà ||| Film historian |||Italy
Michael Arbeiter ||| Nerdist |||US
Ali Arikan ||| Dipnot TV |||Turkey
Michael Atkinson ||| The Village Voice |||US
Ana Maria Bahiana ||| Freelance film critic |||Brazil
Cameron Bailey ||| Toronto Film Festival |||Canada
Lindsay Baker ||| BBC Culture |||UK
Miriam Bale ||| Freelance film critic |||US
Nicholas Barber ||| BBC Culture |||UK
Diego Batlle ||| La Nacion |||Argentina
NT Binh ||| Positif |||France
Lizelle Bisschoff ||| University of Glasgow |||UK
Christian Blauvelt ||| BBC Culture |||US
Mahen Bonetti ||| African Film Festival Inc |||US
Andreas Borcholte ||| Spiegel Online |||Germany
Utpal Borpujari ||| Freelance film critic |||India
Richard Brody ||| The New Yorker |||US
Hannah Brown ||| Jeru

**STEP 5**
Now you need to get your **movie info**. You will want to use the same loop you have been working on (in STEP 6), and get the name of each movie along with the critic information.

But **FIRST**: practice your regular expressions and make sure that they're going to work before you bring them into the loop.


In [14]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"^\d{1,2}\. (.+)[(][^(]+[)]$"
regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
regex_for_year = r",\s+(\d{4})\)$"
#what else should you extract???
#set up all regexes here
movie_name = re.findall(regex_for_year,movie_harder)
movie_name[0].strip()

'2007'

**STEP 6**
You're almost there!!! Now that you have working regulars expression put those in your inner loop to get the movie name.

So now the entire loop should be getting critic information and movie information all separated as separate columns/properties.

Build this loop(s) using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [15]:
#Get that loop working here

for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0]
        org = re.findall(regex_for_org,entry.text)[0]
        cn = re.findall(regex_for_cn,entry.text)[0]
        print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
        regex_for_year = r",\s+([^,]+)\)$"
        # regex_for_multi_paren = r".+\(.+\("
        # regex_for_odddate = r", .{4}\)$"
        #what else should you extract???
        movie_name = re.findall(regex_for_mname,entry.text)[0]
        movie_dir = re.findall(regex_for_dir,entry.text)[0]
        movie_year = re.findall(regex_for_year,entry.text)[0]
        print(movie_name+ "|||" + movie_dir+ "|||"+movie_year)
        
        # paren = re.findall(regex_for_odddate,entry.text)
        # if paren:
        #     print(paren)

Simon Abrams ||| Freelance film critic |||US
Mulholland Drive |||David Lynch|||2001
In the Mood for Love |||Wong Kar-wai|||2000
The Tree of Life |||Terrence Malick|||2011
Yi Yi: A One and a Two |||Edward Yang|||2000
Goodbye to Language |||Jean-Luc Godard|||2014
The White Meadows |||Mohammad Rasoulof|||2009
Night Across the Street |||Raoul Ruiz|||2012
Certified Copy |||Abbas Kiarostami|||2010
Sparrow |||Johnnie To|||2008
Fados |||Carlos Saura|||2007
Sam Adams ||| Freelance film critic |||US
In the Mood for Love |||Wong Kar-wai|||2000
Eternal Sunshine of the Spotless Mind |||Michel Gondry|||2004
Syndromes and a Century |||Apichatpong Weerasethakul|||2006
Spirited Away |||Hayao Miyazaki|||2001
The Act of Killing |||Joshua Oppenheimer|||2012
The Grand Budapest Hotel |||Wes Anderson|||2014
The New World |||Terrence Malick|||2004
Certified Copy |||Abbas Kiarostami|||2010
The World |||Jia Zhangke|||2004
Elephant |||Gus Van Sant|||2003
Thelma Adams ||| Freelance film critic |||US
Zero Dark Thi

**STEP 7**
This is the final step of the hardest part! 

The final step is building a list of dictionaries of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [16]:
#figure out how you're going to collect your clean information
list_of_movies = []

#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0].strip()
        org = re.findall(regex_for_org,entry.text)[0].strip()
        cn = re.findall(regex_for_cn,entry.text)[0].strip()
        print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
        regex_for_year = r",\s+([^,]+)\)$"
        regex_for_multi_paren = r".+\(.+\("
        #what else should you extract???
        movie_name = re.findall(regex_for_mname,entry.text)[0].strip()
        movie_dir = re.findall(regex_for_dir,entry.text)[0].strip()
        movie_year = re.findall(regex_for_year,entry.text)[0].strip()
        
        new_movie_entry = [movie_name,movie_dir,movie_year,name,org,cn]
        list_of_movies.append(new_movie_entry)
        
#Try to figure out how you want to append things
#That is, how you want to organize your data

Simon Abrams|||Freelance film critic|||US
Sam Adams|||Freelance film critic|||US
Thelma Adams|||Freelance film critic|||US
Arturo Aguilar|||Rolling Stone Mexico|||Mexico
Matthew Anderson|||BBC Culture|||UK
Tim Appelo|||The Wrap|||US
Adriano Aprà|||Film historian|||Italy
Michael Arbeiter|||Nerdist|||US
Ali Arikan|||Dipnot TV|||Turkey
Michael Atkinson|||The Village Voice|||US
Ana Maria Bahiana|||Freelance film critic|||Brazil
Cameron Bailey|||Toronto Film Festival|||Canada
Lindsay Baker|||BBC Culture|||UK
Miriam Bale|||Freelance film critic|||US
Nicholas Barber|||BBC Culture|||UK
Diego Batlle|||La Nacion|||Argentina
NT Binh|||Positif|||France
Lizelle Bisschoff|||University of Glasgow|||UK
Christian Blauvelt|||BBC Culture|||US
Mahen Bonetti|||African Film Festival Inc|||US
Andreas Borcholte|||Spiegel Online|||Germany
Utpal Borpujari|||Freelance film critic|||India
Richard Brody|||The New Yorker|||US
Hannah Brown|||Jerusalem Post|||Israel
Luke Buckmaster|||The Guardian/BBC Culture|||Austra

In [17]:
##Take a peek at your final lists of lists
list_of_movies
len(list_of_movies)
list_of_movies[0:44]

[['Mulholland Drive',
  'David Lynch',
  '2001',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['In the Mood for Love',
  'Wong Kar-wai',
  '2000',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['The Tree of Life',
  'Terrence Malick',
  '2011',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Yi Yi: A One and a Two',
  'Edward Yang',
  '2000',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Goodbye to Language',
  'Jean-Luc Godard',
  '2014',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['The White Meadows',
  'Mohammad Rasoulof',
  '2009',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Night Across the Street',
  'Raoul Ruiz',
  '2012',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Certified Copy',
  'Abbas Kiarostami',
  '2010',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Sparrow',
  'Johnnie To',
  '2008',
  'Simon Abrams',
  'Freelance film critic',
  'US'],
 ['Fados',
  'Carlos Saura',
  '2007',
  'Sim

In [18]:
# for mov in list_of_movies:
#     if mov[1].startswith("Mool"):
#         print(mov)

for mov in list_of_movies:
    if re.search(r".+\d{4}",mov[2]):
        print(mov[2])

Sembèène 2004


Could fix this here, like this, but I am going to fix in pandas

In [19]:
for mov in list_of_movies:
    if re.search(r".+\d{4}",mov[2]):
        mov[1] = mov[1]+" "+mov[2].split(" ")[0]
        print(mov[1])
        mov[2] = mov[2].split(" ")[1]
        print(mov[2])

Ousmane Sembèène
2004


If you made it this far, yay!


And now, let's bring that into PANDAS!

In [20]:
import numpy as np
import pandas as pd
col_names = ['movie', 'director', 'm_year', 'critic','crit_org','crit_cn']
df = pd.DataFrame.from_records(list_of_movies, columns=col_names)

In [21]:
df

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US
...,...,...,...,...,...,...
1765,The Lives of Others,Florian Henckel von Donnersmarck,2006,Raymond Zhou,China Daily,China
1766,Still Life,Jia Zhangke,2006,Raymond Zhou,China Daily,China
1767,Birdman,Alejandro González Iñárritu,2014,Raymond Zhou,China Daily,China
1768,Infernal Affairs,Andrew Lau and Alan Mak,2002,Raymond Zhou,China Daily,China


In [22]:
#most popular films
df['movie'].value_counts().head(15)

movie
In the Mood for Love                     49
Mulholland Drive                         47
There Will Be Blood                      35
Spirited Away                            34
Boyhood                                  30
Eternal Sunshine of the Spotless Mind    29
A Separation                             28
The Tree of Life                         23
Yi Yi: A One and a Two                   22
No Country For Old Men                   21
Inside Llewyn Davis                      20
Children of Men                          18
4 Months, 3 Weeks & 2 Days               17
Pan's Labyrinth                          17
The Act of Killing                       16
Name: count, dtype: int64

In [23]:
#most unpopular films
m_count = df['movie'].value_counts()
m_count[m_count<10]

movie
The Dark Knight                                9
Certified Copy                                 9
Margaret                                       9
Uncle Boonmee Who Can Recall His Past Lives    9
Timbuktu                                       9
                                              ..
Story of My Death                              1
Stranger by the Lake                           1
Even If She Had Been a Criminal...             1
Heart of a Dog                                 1
Lust, Caution                                  1
Name: count, Length: 561, dtype: int64

In [24]:
#critics per country!
df.groupby('crit_cn')['critic'].nunique() # the number of unique values in the 'critic' column for each group created by groupby('crit_cn')

crit_cn
Argentina        2
Australia        4
Austria          2
Bangladesh       1
Belgium          1
Brazil           1
Canada           5
Chile            2
China            1
Colombia         4
Cuba             5
Egypt            1
France           5
Germany          5
Hong Kong        1
India            5
Indonesia        1
Israel           4
Italy            4
Japan            1
Kazakhstan       1
Lebanon          3
Mexico           2
Namibia          1
Philippines      1
Qatar            1
Senegal          1
Singapore        2
South Africa     1
South Korea      2
Switzerland      1
Taiwan           1
Turkey           2
UAE              3
UK              18
US              82
Name: critic, dtype: int64

In [25]:
#back up your results!!!
df.to_csv(r'backup_BBC1.csv', index = False)

In [26]:
df_new = pd.read_csv("backup_BBC1.csv")

In [27]:
d_list = list(df_new['director'].unique())
d_list.sort() # alphabetically
d_list

['Abbas Kiarostami',
 'Abdellatif Kechiche',
 'Abderrahmane Sissako',
 'Adam Curtis',
 'Adam McKay',
 'Agnieszka Holland',
 'Agnès Jaoui',
 'Agnès Varda',
 'Aki Kaurismäki',
 'Alain Cavalier',
 'Alain Gomis',
 'Alain Guiraudie',
 'Alain Resnais',
 'Albert Serra',
 'Alejandro González Iñárritu',
 'Aleksandr Sokurov',
 'Aleksey Fedorchenko',
 'Aleksey German',
 'Alex Garland',
 'Alexander Payne',
 'Alfonso Cuarón',
 'Amma Asante',
 'Ana Lily Amirpour',
 'Andrea Arnold',
 'Andrew Adamson and Vicky Jenson',
 'Andrew Dominik',
 'Andrew Dosunmu',
 'Andrew Haigh',
 'Andrew Lau and Alan Mak',
 'Andrew Stanton',
 'Andrew Stanton and Lee Unkrich',
 'Andrey Zvyagintsev',
 'Andrzej Wajda',
 'Andrzej Zulawski',
 'André Singer',
 'Ang Lee',
 'Annemarie Jacir',
 'Anthony and Joe Russo',
 'Anurag Kashyap',
 'Apichatpong Weerasethakul',
 'Ari Folman',
 'Arnaud Desplechin',
 'Asghar Farhadi',
 'Ashutosh Gowariker',
 'Asif Kapadia',
 'Ava DuVernay',
 'Avi Nesher',
 'Bahman Ghobadi',
 'Bart Layton',
 'Baz

**Getting the names separated**

In [28]:
pd.set_option('display.max_rows', None)
df_new[df_new['director'].str.contains(r"\band\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
50,No Country For Old Men,Joel and Ethan Coen,2007,Tim Appelo,The Wrap,US
54,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Tim Appelo,The Wrap,US
60,These Encounters of Theirs,Danièle Huillet and Jean-Marie Straub,2006,Adriano Aprà,Film historian,Italy
68,Terra,Marco De Angelis and Antonio Di Trapani,2015,Adriano Aprà,Film historian,Italy
69,Oh! Man,Yervant Gianikian and Angela Ricci Lucchi,2004,Adriano Aprà,Film historian,Italy
77,Inside Llewyn Davis,Joel and Ethan Coen,2013,Michael Arbeiter,Nerdist,US
86,A Serious Man,Joel and Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey
103,City of God,Fernando Meirelles and Kátia Lund,2002,Ana Maria Bahiana,Freelance film critic,Brazil
108,No Country For Old Men,Joel and Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil
133,This Is the End,Evan Goldberg and Seth Rogen,2013,Miriam Bale,Freelance film critic,US


And this lambda to function, just sends each cell (director) to the dirs_names() function.

Note that here I am testing to make sure the function is working, I'm not saving this work yet.

In [29]:
df_new['director'].value_counts().head(15)

director
Paul Thomas Anderson    52
Joel and Ethan Coen     52
Wong Kar-wai            51
David Lynch             48
Richard Linklater       39
Michael Haneke          35
Hayao Miyazaki          35
Terrence Malick         32
Asghar Farhadi          31
David Fincher           31
Michel Gondry           30
Wes Anderson            28
Christopher Nolan       28
Alfonso Cuarón          24
Edward Yang             22
Name: count, dtype: int64

In [30]:
import re
def dirs_names(dirs):
    each_word = re.split(r"\s+",dirs)
    if len(each_word) > 1 and each_word[1] == "and":
        each_word[0] = each_word[0] + " " + each_word[-1]
        print(' '.join(each_word))
        return ' '.join(each_word)
    else:
        return dirs

In [31]:
df_new['director'].apply(lambda x: dirs_names(x))

Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Josh Safdie and Benny Safdie
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Darden

0                                             David Lynch
1                                            Wong Kar-wai
2                                         Terrence Malick
3                                             Edward Yang
4                                         Jean-Luc Godard
5                                       Mohammad Rasoulof
6                                              Raoul Ruiz
7                                        Abbas Kiarostami
8                                              Johnnie To
9                                            Carlos Saura
10                                           Wong Kar-wai
11                                          Michel Gondry
12                              Apichatpong Weerasethakul
13                                         Hayao Miyazaki
14                                     Joshua Oppenheimer
15                                           Wes Anderson
16                                        Terrence Malick
17            

Looking for directors with a single name because that function is assuming that it is always First Name "and"

In [32]:
df_new[df_new['director'].str.contains(r"^\S+$",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn


Oh-oh, problem! Let's fix!

In [33]:
df_new["director"].iat[172] = df_new["director"].iat[172] + " " + df_new["m_year"].iat[172].split(" ")[0]

AttributeError: 'numpy.int64' object has no attribute 'split'

In [34]:
df_new["director"].iat[172]

'Ousmane Sembèène'

In [35]:
df_new["m_year"].iat[172] = df_new["m_year"].iat[172].split(" ")[1]

AttributeError: 'numpy.int64' object has no attribute 'split'

In [36]:
df_new["m_year"].iat[172]

np.int64(2004)

In [37]:
df_new[df_new['director'].str.contains(r"\bOusmane\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
172,Moolaadé,Ousmane Sembèène,2004,Lizelle Bisschoff,University of Glasgow,UK
190,Moolaadé,Ousmane Sembène,2004,Mahen Bonetti,African Film Festival Inc,US
395,Moolaadé,Ousmane Sembène,2004,Lindiwe Dovey,University of London,UK
1010,Moolaadé,Ousmane Sembène,2004,Hans-Christian Mahnke,AfricAvenir.org,Namibia
1536,Moolaadé,Ousmane Sembène,2004,Yael Shuv,Time Out Tel Aviv,Israel


In [None]:
#Whaaatttt???

In [38]:
df_new['director'] = df_new['director'].str.replace('Sembèène','Sembène')

In [39]:
df_new[df_new['director'].str.contains(r"\bOusmane\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
172,Moolaadé,Ousmane Sembène,2004,Lizelle Bisschoff,University of Glasgow,UK
190,Moolaadé,Ousmane Sembène,2004,Mahen Bonetti,African Film Festival Inc,US
395,Moolaadé,Ousmane Sembène,2004,Lindiwe Dovey,University of London,UK
1010,Moolaadé,Ousmane Sembène,2004,Hans-Christian Mahnke,AfricAvenir.org,Namibia
1536,Moolaadé,Ousmane Sembène,2004,Yael Shuv,Time Out Tel Aviv,Israel


**Okay...**

So after that tangent, I'm gonna go ahead and update the directors!!

Here I am saving the work, updating the director column.


In [40]:
df_new['director']=df_new['director'].apply(lambda x: dirs_names(x))

Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Josh Safdie and Benny Safdie
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Darden

And checking to make sure it came out, right!

In [41]:
df_new[df_new['director'].str.contains(r"\bCoen\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
50,No Country For Old Men,Joel Coen and Ethan Coen,2007,Tim Appelo,The Wrap,US
77,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Michael Arbeiter,Nerdist,US
86,A Serious Man,Joel Coen and Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey
108,No Country For Old Men,Joel Coen and Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil
137,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Miriam Bale,Freelance film critic,US
189,A Serious Man,Joel Coen and Ethan Coen,2009,Christian Blauvelt,BBC Culture,US
204,No Country For Old Men,Joel Coen and Ethan Coen,2007,Andreas Borcholte,Spiegel Online,Germany
232,No Country For Old Men,Joel Coen and Ethan Coen,2007,Hannah Brown,Jerusalem Post,Israel
242,"O Brother, Where Art Thou?",Joel Coen and Ethan Coen,2000,Luke Buckmaster,The Guardian/BBC Culture,Australia
260,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Monica Castillo,New York Times Watching,US


Now I need to deal with multiple directors by looking for commas.

In [42]:
df_new[df_new['director'].str.contains(r",\s+",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
144,Madagascar 3: Europe's Most Wanted,"Eric Darnell, Tom McGrath and Conrad Vernon",2012,Nicholas Barber,BBC Culture,UK
399,7 Letters,"Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopal...",2015,Lindiwe Dovey,University of London,UK
1396,"Monsters, Inc.","Pete Docter, David Silverman and Lee Unkrich",2001,Jonathan Romney,Freelance film critic,UK


In [43]:
#beware of oxford commas!!!!
df_new[df_new['director'].str.contains(r",\s+\band\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn


Replacing all the commas with ' and ' so that I have a consistent separator for every multiple director cell.

In [44]:
df_new['director']=df_new['director'].str.replace(r",\s+",' and ',regex=True)

In [45]:
df_new[df_new['director'].str.contains(r"\bUnkrich\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
54,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Tim Appelo,The Wrap,US
126,Toy Story 3,Lee Unkrich,2010,Lindsay Baker,BBC Culture,UK
495,Toy Story 3,Lee Unkrich,2010,Javier Porta Fouz,La Nacion,Argentina
719,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Ann Hornaday,The Washington Post,US
1343,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Sam Rigby,BBC Culture,UK
1396,"Monsters, Inc.",Pete Docter and David Silverman and Lee Unkrich,2001,Jonathan Romney,Freelance film critic,UK
1567,Toy Story 3,Lee Unkrich,2010,Eric D Snider,Freelance film critic,US


In [46]:
df_new['director'].apply(lambda x: len(re.findall(r'\band\b',x))+1)

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
30      1
31      1
32      1
33      1
34      1
35      1
36      1
37      1
38      1
39      1
40      1
41      1
42      1
43      1
44      1
45      1
46      1
47      1
48      1
49      1
50      2
51      1
52      1
53      1
54      2
55      1
56      1
57      1
58      1
59      1
60      2
61      1
62      1
63      1
64      1
65      1
66      1
67      1
68      2
69      2
70      1
71      1
72      1
73      1
74      1
75      1
76      1
77      2
78      1
79      1
80      1
81      1
82      1
83      1
84      1
85      1
86      2
87      1
88      1
89      1
90      1
91      1
92      1
93      1
94      1
95      1
96      1
97      1
98      1
99      1


Now I am transforming the Director cells into lists using split

In [47]:
df_new['director']=df_new['director'].str.split(' and ')

In [48]:
df_new.iloc[[1396]]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
1396,"Monsters, Inc.","[Pete Docter, David Silverman, Lee Unkrich]",2001,Jonathan Romney,Freelance film critic,UK


In [49]:
df_new

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
0,Mulholland Drive,[David Lynch],2001,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,[Wong Kar-wai],2000,Simon Abrams,Freelance film critic,US
2,The Tree of Life,[Terrence Malick],2011,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,[Edward Yang],2000,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,[Jean-Luc Godard],2014,Simon Abrams,Freelance film critic,US
5,The White Meadows,[Mohammad Rasoulof],2009,Simon Abrams,Freelance film critic,US
6,Night Across the Street,[Raoul Ruiz],2012,Simon Abrams,Freelance film critic,US
7,Certified Copy,[Abbas Kiarostami],2010,Simon Abrams,Freelance film critic,US
8,Sparrow,[Johnnie To],2008,Simon Abrams,Freelance film critic,US
9,Fados,[Carlos Saura],2007,Simon Abrams,Freelance film critic,US


This is fun...! I can use those lists to count the number of directors for a movie, like, why not?

In [50]:
#make dir numbers

df_new['nm_dir'] = df_new['director'].apply(lambda x: len(x))

In [51]:
df_new[df_new['nm_dir']>2]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
144,Madagascar 3: Europe's Most Wanted,"[Eric Darnell, Tom McGrath, Conrad Vernon]",2012,Nicholas Barber,BBC Culture,UK,3
399,7 Letters,"[Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopa...",2015,Lindiwe Dovey,University of London,UK,7
1396,"Monsters, Inc.","[Pete Docter, David Silverman, Lee Unkrich]",2001,Jonathan Romney,Freelance film critic,UK,3


But, more importantly, let's use **explode()**

This is why we put the director names into a list. explode() allows us to then take that list and make separate rows for each element in the list. This way we are "unwinding" the multiple directors.

This is making a new data frame that will have more rows.

In [52]:
df_large = df_new.explode('director')

In [53]:
df_large.shape

(1891, 7)

In [36]:
df_large[df_large['director'].str.contains(r"\bCoen\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
50,No Country For Old Men,Joel and Ethan Coen,2007,Tim Appelo,The Wrap,US
77,Inside Llewyn Davis,Joel and Ethan Coen,2013,Michael Arbeiter,Nerdist,US
86,A Serious Man,Joel and Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey
108,No Country For Old Men,Joel and Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil
137,Inside Llewyn Davis,Joel and Ethan Coen,2013,Miriam Bale,Freelance film critic,US
189,A Serious Man,Joel and Ethan Coen,2009,Christian Blauvelt,BBC Culture,US
204,No Country For Old Men,Joel and Ethan Coen,2007,Andreas Borcholte,Spiegel Online,Germany
232,No Country For Old Men,Joel and Ethan Coen,2007,Hannah Brown,Jerusalem Post,Israel
242,"O Brother, Where Art Thou?",Joel and Ethan Coen,2000,Luke Buckmaster,The Guardian/BBC Culture,Australia
260,Inside Llewyn Davis,Joel and Ethan Coen,2013,Monica Castillo,New York Times Watching,US


Now we can get better aggregations with one director per row.

In [54]:
df_large['director'].value_counts().head(15)

director
Ethan Coen              52
Joel Coen               52
Paul Thomas Anderson    52
Wong Kar-wai            51
David Lynch             48
Richard Linklater       39
Hayao Miyazaki          35
Michael Haneke          35
Terrence Malick         32
Asghar Farhadi          31
David Fincher           31
Michel Gondry           30
Christopher Nolan       28
Wes Anderson            28
Alfonso Cuarón          24
Name: count, dtype: int64

In [55]:
df_large.groupby('movie')['critic'].nunique().sort_values(ascending=False).reset_index(name='count').head(15)

Unnamed: 0,movie,count
0,In the Mood for Love,49
1,Mulholland Drive,47
2,There Will Be Blood,35
3,Spirited Away,34
4,Boyhood,30
5,Eternal Sunshine of the Spotless Mind,29
6,A Separation,28
7,The Tree of Life,23
8,Yi Yi: A One and a Two,22
9,No Country For Old Men,21


In [None]:
Now I can get a much better Director List!

In [61]:
d_list = list(df_large['director'].unique())
d_list.sort()
d_list

['Abbas Kiarostami',
 'Abdellatif Kechiche',
 'Abderrahmane Sissako',
 'Adam Curtis',
 'Adam McKay',
 'Agnieszka Holland',
 'Agnès Jaoui',
 'Agnès Varda',
 'Aki Kaurismäki',
 'Alain Cavalier',
 'Alain Gomis',
 'Alain Guiraudie',
 'Alain Resnais',
 'Alan Mak',
 'Albert Serra',
 'Alejandro González Iñárritu',
 'Aleksandr Sokurov',
 'Aleksey Fedorchenko',
 'Aleksey German',
 'Alex Garland',
 'Alexander Payne',
 'Alfonso Cuarón',
 'Amma Asante',
 'Ana Lily Amirpour',
 'Andrea Arnold',
 'Andrew Adamson',
 'Andrew Dominik',
 'Andrew Dosunmu',
 'Andrew Haigh',
 'Andrew Lau',
 'Andrew Stanton',
 'Andrey Zvyagintsev',
 'Andrzej Wajda',
 'Andrzej Zulawski',
 'André Singer',
 'Ang Lee',
 'Angela Ricci Lucchi',
 'Annemarie Jacir',
 'Anthony Russo',
 'Antonio Di Trapani',
 'Anurag Kashyap',
 'Apichatpong Weerasethakul',
 'Ari Folman',
 'Arnaud Desplechin',
 'Asghar Farhadi',
 'Ashutosh Gowariker',
 'Asif Kapadia',
 'Ava DuVernay',
 'Avi Nesher',
 'Bahman Ghobadi',
 'Bart Layton',
 'Baz Luhrmann',
 

## NEXT STEP

Getting more data!!

In [57]:
url = "https://www.imdb.com/search/name/?name=David%20Lynch"
#add headers to request
raw_html = requests.get(url).content
print(raw_html)

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n</body>\r\n</html>\r\n'


In [62]:
for director in d_list:
    url = "https://www.imdb.com/search/name/?name=" + director.replace(" ","%20")
    print(url)

https://www.imdb.com/search/name/?name=Abbas%20Kiarostami
https://www.imdb.com/search/name/?name=Abdellatif%20Kechiche
https://www.imdb.com/search/name/?name=Abderrahmane%20Sissako
https://www.imdb.com/search/name/?name=Adam%20Curtis
https://www.imdb.com/search/name/?name=Adam%20McKay
https://www.imdb.com/search/name/?name=Agnieszka%20Holland
https://www.imdb.com/search/name/?name=Agnès%20Jaoui
https://www.imdb.com/search/name/?name=Agnès%20Varda
https://www.imdb.com/search/name/?name=Aki%20Kaurismäki
https://www.imdb.com/search/name/?name=Alain%20Cavalier
https://www.imdb.com/search/name/?name=Alain%20Gomis
https://www.imdb.com/search/name/?name=Alain%20Guiraudie
https://www.imdb.com/search/name/?name=Alain%20Resnais
https://www.imdb.com/search/name/?name=Alan%20Mak
https://www.imdb.com/search/name/?name=Albert%20Serra
https://www.imdb.com/search/name/?name=Alejandro%20González%20Iñárritu
https://www.imdb.com/search/name/?name=Aleksandr%20Sokurov
https://www.imdb.com/search/name/?name

In [None]:
{"director":"David Lynch","link":"/name/nm0000186/?ref_=sr_t_1"}

In [None]:
#step 1: loop through director list and search for the link
#get a list of dictionaries (table) with just name and link to imdb page

In [None]:
#step 2: loop through that list of dicts 
#going to each individual and page adding to the dict

In [59]:
raw_html = requests.get(url).content
raw_html

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n</body>\r\n</html>\r\n'

Getting more Director Info from IMDB

In [60]:
#tell imdb you are using a browser!!
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
#possible urls to use
# url ="https://www.imdb.com/search/title/?name=spirited%20away&title_type=feature"
url = "https://www.imdb.com/search/name/?name=David%20Lynch"
#add headers to request
raw_html = requests.get(url,headers=head).content
#save html file
with open('imdb.html', 'wb+') as f:
    f.write(raw_html)
soup_doc = BeautifulSoup(raw_html, "html.parser")

## New code starts here: 

### Grab the IMDb links of all directors

In [50]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Build search URL for each director and extract their IMDb page link
director_links = []

for director in d_list:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the first result's link (adjust if IMDb changes structure)
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        director_links.append({
            "director": director,
            "link": "https://www.imdb.com" + link_tag['href']
        })
    else:
        director_links.append({
            "director": director,
            "link": None
        })

# Convert to DataFrame
df_directors = pd.DataFrame(director_links)
print(df_directors)

                                              director  \
0                                     Abbas Kiarostami   
1                                  Abdellatif Kechiche   
2                                 Abderrahmane Sissako   
3                                          Adam Curtis   
4                                           Adam McKay   
5                                    Agnieszka Holland   
6                                          Agnès Jaoui   
7                                          Agnès Varda   
8                                       Aki Kaurismäki   
9                                       Alain Cavalier   
10                                         Alain Gomis   
11                                     Alain Guiraudie   
12                                       Alain Resnais   
13                                        Albert Serra   
14                         Alejandro González Iñárritu   
15                                   Aleksandr Sokurov   
16            

### Grab the IMDb links of the 15 most popular movie directors

In [65]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Most Popular Films Top 15 (sorted by most mentions to least)
popular_films = (
    df_large.groupby('movie')['critic']
    .nunique()
    .sort_values(ascending=False)
    .reset_index(name='count')
    .head(15)
)

# Extract ordered top 15 movies
top_15_movies = popular_films['movie'].tolist()

# List of directors for the top 15 movies, keeping the order of movies
top_movie_directors = (
    df_large[df_large['movie'].isin(top_15_movies)]
    .drop_duplicates(subset=['movie'])
    .set_index('movie')
    .loc[top_15_movies, 'director']
    .tolist()
)

# Build search URL for each director and extract their IMDb page link
director_links = []

for director in top_movie_directors:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the first result's link (adjust if IMDb changes structure)
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        director_links.append({
            "director": director,
            "link": "https://www.imdb.com" + link_tag['href']
        })
    else:
        director_links.append({
            "director": director,
            "link": None
        })

# Print the director links in the order of top 15 movies
print("Top 15 Movie Directors and their IMDb Links (Ordered by Movie Mentions):")
for movie, entry in zip(top_15_movies, director_links):
    print(f"Movie: {movie}, Director: {entry['director']}, Link: {entry['link']}")

Top 15 Movie Directors and their IMDb Links (Ordered by Movie Mentions):
Movie: In the Mood for Love, Director: Wong Kar-wai, Link: https://www.imdb.com/name/nm0939182/?ref_=sr_i_1
Movie: Mulholland Drive, Director: David Lynch, Link: https://www.imdb.com/name/nm0000186/?ref_=sr_i_1
Movie: There Will Be Blood, Director: Paul Thomas Anderson, Link: https://www.imdb.com/name/nm0000759/?ref_=sr_i_1
Movie: Spirited Away, Director: Hayao Miyazaki, Link: https://www.imdb.com/name/nm0594503/?ref_=sr_i_1
Movie: Boyhood, Director: Richard Linklater, Link: https://www.imdb.com/name/nm0000500/?ref_=sr_i_1
Movie: Eternal Sunshine of the Spotless Mind, Director: Michel Gondry, Link: https://www.imdb.com/name/nm0327273/?ref_=sr_i_1
Movie: A Separation, Director: Asghar Farhadi, Link: https://www.imdb.com/name/nm1410815/?ref_=sr_i_1
Movie: The Tree of Life, Director: Terrence Malick, Link: https://www.imdb.com/name/nm0000517/?ref_=sr_i_1
Movie: Yi Yi: A One and a Two, Director: Edward Yang, Link: htt

### Go to the bio links of the 15 most popular movie directors

In [67]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Most Popular Films Top 15 (sorted by most mentions to least)
popular_films = (
    df_large.groupby('movie')['critic']
    .nunique()
    .sort_values(ascending=False)
    .reset_index(name='count')
    .head(15)
)

# Extract ordered top 15 movies
top_15_movies = popular_films['movie'].tolist()

# List of directors for the top 15 movies, keeping the order of movies
top_movie_directors = (
    df_large[df_large['movie'].isin(top_15_movies)]
    .drop_duplicates(subset=['movie'])
    .set_index('movie')
    .loc[top_15_movies, 'director']
    .tolist()
)

# Build search URL for each director and extract their IMDb bio page link
director_bio_links = []

for director in top_movie_directors:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        nm_id = re.search(r'/name/(nm\d+)/', link_tag['href']).group(1)  # Extract the nm ID
        bio_link = f"https://www.imdb.com/name/{nm_id}/bio/?ref_=nm_ov_bio_sm"
        director_bio_links.append({
            "director": director,
            "bio_link": bio_link
        })
    else:
        director_bio_links.append({
            "director": director,
            "bio_link": None
        })

# Print the director bio links in the order of top 15 movies
print("Top 15 Movie Directors and their IMDb Bio Links (Ordered by Movie Mentions):")
for movie, entry in zip(top_15_movies, director_bio_links):
    print(f"Movie: {movie}, Director: {entry['director']}, Bio Link: {entry['bio_link']}")

Top 15 Movie Directors and their IMDb Bio Links (Ordered by Movie Mentions):
Movie: In the Mood for Love, Director: Wong Kar-wai, Bio Link: https://www.imdb.com/name/nm0939182/bio/?ref_=nm_ov_bio_sm
Movie: Mulholland Drive, Director: David Lynch, Bio Link: https://www.imdb.com/name/nm0000186/bio/?ref_=nm_ov_bio_sm
Movie: There Will Be Blood, Director: Paul Thomas Anderson, Bio Link: https://www.imdb.com/name/nm0000759/bio/?ref_=nm_ov_bio_sm
Movie: Spirited Away, Director: Hayao Miyazaki, Bio Link: https://www.imdb.com/name/nm0594503/bio/?ref_=nm_ov_bio_sm
Movie: Boyhood, Director: Richard Linklater, Bio Link: https://www.imdb.com/name/nm0000500/bio/?ref_=nm_ov_bio_sm
Movie: Eternal Sunshine of the Spotless Mind, Director: Michel Gondry, Bio Link: https://www.imdb.com/name/nm0327273/bio/?ref_=nm_ov_bio_sm
Movie: A Separation, Director: Asghar Farhadi, Bio Link: https://www.imdb.com/name/nm1410815/bio/?ref_=nm_ov_bio_sm
Movie: The Tree of Life, Director: Terrence Malick, Bio Link: https:

### Extract the country of birth of the 15 most popular movie directors

In [76]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Most Popular Films Top 15 (already sorted by most mentions to least)
popular_films = (
    df_large.groupby('movie')['critic']
    .nunique()
    .sort_values(ascending=False)
    .reset_index(name='count')
    .head(15)
)

# Extract ordered top 15 movies
top_15_movies = popular_films['movie'].tolist()

# List of directors for the top 15 movies, keeping the order of movies
top_movie_directors = (
    df_large[df_large['movie'].isin(top_15_movies)]
    .drop_duplicates(subset=['movie'])
    .set_index('movie')
    .loc[top_15_movies, 'director']
    .tolist()
)

# Build search URL for each director and extract their IMDb bio page link
director_bio_links = []

for director in top_movie_directors:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the first result's link (adjust if IMDb changes structure)
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        nm_id = re.search(r'/name/(nm\d+)/', link_tag['href']).group(1)  # Extract the nm ID
        bio_link = f"https://www.imdb.com/name/{nm_id}/bio/?ref_=nm_ov_bio_sm"
        director_bio_links.append({
            "director": director,
            "bio_link": bio_link
        })
    else:
        director_bio_links.append({
            "director": director,
            "bio_link": None
        })

# Extract country of birth from each director's IMDb bio page
director_countries = []

for entry in director_bio_links:
    bio_link = entry['bio_link']
    if bio_link:
        response = requests.get(bio_link, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find the "Born" section
        born_section = soup.find('li', {'id': 'born'})
        if born_section:
            birth_place_tag = born_section.find_all('a', href=True)
            if birth_place_tag:
                # Combine all parts of the birthplace
                birth_place = ", ".join([tag.text.strip() for tag in birth_place_tag])
                # Extract only the country part
                country = birth_place.split(",")[-1].strip()
            else:
                country = "Unknown"
        else:
            country = "Unknown"
    else:
        country = "Unknown"
    
    director_countries.append({
        "director": entry['director'],
        "bio_link": bio_link,
        "country": country
    })

# Print results
print("Top 15 Directors and their Countries of Birth:")
for movie, entry in zip(top_15_movies, director_countries):
    print(f"Movie: {movie}, Director: {entry['director']}, Country: {entry['country']}")

Top 15 Directors and their Countries of Birth:
Movie: In the Mood for Love, Director: Wong Kar-wai, Country: China
Movie: Mulholland Drive, Director: David Lynch, Country: USA
Movie: There Will Be Blood, Director: Paul Thomas Anderson, Country: USA
Movie: Spirited Away, Director: Hayao Miyazaki, Country: Japan
Movie: Boyhood, Director: Richard Linklater, Country: USA
Movie: Eternal Sunshine of the Spotless Mind, Director: Michel Gondry, Country: France
Movie: A Separation, Director: Asghar Farhadi, Country: Iran
Movie: The Tree of Life, Director: Terrence Malick, Country: USA
Movie: Yi Yi: A One and a Two, Director: Edward Yang, Country: China
Movie: No Country For Old Men, Director: Joel Coen, Country: USA
Movie: Inside Llewyn Davis, Director: Joel Coen, Country: USA
Movie: Children of Men, Director: Alfonso Cuarón, Country: Mexico
Movie: Pan's Labyrinth, Director: Guillermo Del Toro, Country: Mexico
Movie: 4 Months, 3 Weeks & 2 Days, Director: Cristian Mungiu, Country: Romania
Movie:

### Extract the city and country of birth of the 15 most popular movie directors¶

In [101]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Most Popular Films Top 15 (already sorted by most mentions to least)
popular_films = (
    df_large.groupby('movie')['critic']
    .nunique()
    .sort_values(ascending=False)
    .reset_index(name='count')
    .head(15)
)

# Extract ordered top 15 movies
top_15_movies = popular_films['movie'].tolist()

# List of directors for the top 15 movies, keeping the order of movies
top_movie_directors = (
    df_large[df_large['movie'].isin(top_15_movies)]
    .drop_duplicates(subset=['movie'])
    .set_index('movie')
    .loc[top_15_movies, 'director']
    .tolist()
)

# Build search URL for each director and extract their IMDb bio page link
director_bio_links = []

for director in top_movie_directors:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the first result's link (adjust if IMDb changes structure)
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        nm_id = re.search(r'/name/(nm\d+)/', link_tag['href']).group(1)  # Extract the nm ID
        bio_link = f"https://www.imdb.com/name/{nm_id}/bio/?ref_=nm_ov_bio_sm"
        director_bio_links.append({
            "director": director,
            "bio_link": bio_link
        })
    else:
        director_bio_links.append({
            "director": director,
            "bio_link": None
        })

# Extract full birth location (cleaned) from each director's IMDb bio page
director_birth_locations = []

for entry in director_bio_links:
    bio_link = entry['bio_link']
    if bio_link:
        response = requests.get(bio_link, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find the "Born" section
        born_section = soup.find('li', {'id': 'born'})
        if born_section:
            # Extract text
            full_born_text = born_section.get_text(strip=True)
            
            # Remove 'Born' and the date
            birth_place = re.sub(r"^Born.*?·", "", full_born_text).strip()
        else:
            birth_place = "Unknown"
    else:
        birth_place = "Unknown"
    
    director_birth_locations.append({
        "director": entry['director'],
        "bio_link": bio_link,
        "birth_place": birth_place
    })

# Print results
print("Top 15 Directors and their Clean Birth Locations:")
for movie, entry in zip(top_15_movies, director_birth_locations):
    print(f"Movie: {movie}, Director: {entry['director']}, Birth Place: {entry['birth_place']}")

Top 15 Directors and their Clean Birth Locations:
Movie: In the Mood for Love, Director: Wong Kar-wai, Birth Place: Shanghai, China
Movie: Mulholland Drive, Director: David Lynch, Birth Place: Missoula, Montana, USA
Movie: There Will Be Blood, Director: Paul Thomas Anderson, Birth Place: Studio City, California, USA
Movie: Spirited Away, Director: Hayao Miyazaki, Birth Place: Tokyo, Japan
Movie: Boyhood, Director: Richard Linklater, Birth Place: Houston, Texas, USA
Movie: Eternal Sunshine of the Spotless Mind, Director: Michel Gondry, Birth Place: Versailles, Seine-et-Oise [now Yvelines], France
Movie: A Separation, Director: Asghar Farhadi, Birth Place: Khomeyni Shahr, Isfahan, Iran
Movie: The Tree of Life, Director: Terrence Malick, Birth Place: Ottawa, Illinois, USA
Movie: Yi Yi: A One and a Two, Director: Edward Yang, Birth Place: Shanghai, China
Movie: No Country For Old Men, Director: Joel Coen, Birth Place: Minneapolis, Minnesota, USA
Movie: Inside Llewyn Davis, Director: Joel C

In [None]:
### Dataframe

In [95]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Function to normalize country names
def normalize_country_name(country_name):
    country_mapping = {
        "USA": "US",
        "United States": "US",
        "United Kingdom": "UK",
    }
    return country_mapping.get(country_name, country_name)  # Default to the original name if not found

# Normalize the 'dir_cn' and 'crit_cn' columns in the DataFrame
def normalize_country_columns(df):
    df['dir_cn'] = df['dir_cn'].apply(normalize_country_name)
    df['crit_cn'] = df['crit_cn'].apply(lambda x: ', '.join(
        [normalize_country_name(country.strip()) for country in x.split(",")])
    )
    return df

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Extract Top 15 Movies
popular_films = (
    df_large.groupby('movie')['critic']
    .nunique()
    .sort_values(ascending=False)
    .reset_index(name='count')
    .head(15)
)

# Extract ordered top 15 movies
top_15_movies = popular_films['movie'].tolist()

# List of directors for the top 15 movies
top_movie_directors = (
    df_large[df_large['movie'].isin(top_15_movies)]
    .drop_duplicates(subset=['movie'])
    .set_index('movie')
    .loc[top_15_movies, 'director']
    .tolist()
)

# Build search URL for each director and extract their IMDb bio page link
director_bio_links = []

for director in top_movie_directors:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the first result's link (adjust if IMDb changes structure)
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        nm_id = re.search(r'/name/(nm\d+)/', link_tag['href']).group(1)  # Extract the nm ID
        bio_link = f"https://www.imdb.com/name/{nm_id}/bio/?ref_=nm_ov_bio_sm"
        director_bio_links.append({
            "director": director,
            "bio_link": bio_link
        })
    else:
        director_bio_links.append({
            "director": director,
            "bio_link": None
        })

# Extract Director Countries
director_countries = []

for entry in director_bio_links:
    bio_link = entry['bio_link']
    if bio_link:
        response = requests.get(bio_link, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find the "Born" section
        born_section = soup.find('li', {'id': 'born'})
        if born_section:
            birth_place_tag = born_section.find_all('a', href=True)
            if birth_place_tag:
                birth_place = ", ".join([tag.text.strip() for tag in birth_place_tag])
                country = birth_place.split(",")[-1].strip()
            else:
                country = "Unknown"
        else:
            country = "Unknown"
    else:
        country = "Unknown"
    
    director_countries.append({
        "director": entry['director'],
        "bio_link": bio_link,
        "country": country
    })

# Map Director Countries to DataFrame
director_country_map = {entry['director']: entry['country'] for entry in director_countries}

# Create Aggregated DataFrame
df_top_movies_directors['dir_cn'] = df_top_movies_directors['director'].map(director_country_map)
aggregated_df = (
    df_top_movies_directors.groupby('movie').agg({
        'director': 'first',
        'm_year': 'first',
        'dir_cn': 'first',
        'critic': lambda x: ', '.join(x.unique()),
        'crit_cn': lambda x: ', '.join(x.unique()),
    })
    .reset_index()
)
aggregated_df['count'] = df_top_movies_directors.groupby('movie')['critic'].nunique().values
aggregated_df = aggregated_df[['movie', 'director', 'm_year', 'dir_cn', 'count', 'critic', 'crit_cn']]
aggregated_df = aggregated_df.sort_values(by='count', ascending=False)

# Normalize country names in the aggregated DataFrame
aggregated_df = normalize_country_columns(aggregated_df)

# Save to CSV
aggregated_df.to_csv("top_15_movies_details.csv", index=False)

# Print the aggregated dataframe
print("Aggregated DataFrame with Normalized Country Names:")
print(aggregated_df)

Aggregated DataFrame with Normalized Country Names:
                                    movie              director  m_year  \
6                    In the Mood for Love          Wong Kar-wai    2000   
8                        Mulholland Drive           David Lynch    2001   
13                    There Will Be Blood  Paul Thomas Anderson    2007   
11                          Spirited Away        Hayao Miyazaki    2001   
2                                 Boyhood     Richard Linklater    2014   
4   Eternal Sunshine of the Spotless Mind         Michel Gondry    2004   
1                            A Separation        Asghar Farhadi    2011   
12                       The Tree of Life       Terrence Malick    2011   
14                 Yi Yi: A One and a Two           Edward Yang    2000   
9                  No Country For Old Men             Joel Coen    2007   
7                     Inside Llewyn Davis             Joel Coen    2013   
3                         Children of Men       

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_top_movies_directors['dir_cn'] = df_top_movies_directors['director'].map(director_country_map)


In [None]:
### Updated Dataframe including specific city of birth 

In [102]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

# Function to normalize country names
def normalize_country_name(country_name):
    country_mapping = {
        "USA": "US",
        "United States": "US",
        "United Kingdom": "UK",
    }
    return country_mapping.get(country_name, country_name)  # Default to the original name if not found

# Normalize the 'dir_cn' and 'crit_cn' columns in the DataFrame
def normalize_country_columns(df):
    df['dir_cn'] = df['dir_cn'].apply(normalize_country_name)
    df['crit_cn'] = df['crit_cn'].apply(lambda x: ', '.join(
        [normalize_country_name(country.strip()) for country in x.split(",")])
    )
    return df

# Base IMDb search URL
base_url = "https://www.imdb.com/search/name/?name="

# Adding headers to mimic browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
}

# Extract Top 15 Movies
popular_films = (
    df_large.groupby('movie')['critic']
    .nunique()
    .sort_values(ascending=False)
    .reset_index(name='count')
    .head(15)
)

# Extract ordered top 15 movies
top_15_movies = popular_films['movie'].tolist()

# List of directors for the top 15 movies
top_movie_directors = (
    df_large[df_large['movie'].isin(top_15_movies)]
    .drop_duplicates(subset=['movie'])
    .set_index('movie')
    .loc[top_15_movies, 'director']
    .tolist()
)

# Build search URL for each director and extract their IMDb bio page link
director_bio_links = []

for director in top_movie_directors:
    search_url = base_url + director.replace(" ", "%20")
    response = requests.get(search_url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the first result's link 
    link_tag = soup.find('a', href=re.compile(r'^/name/nm'))
    if link_tag:
        nm_id = re.search(r'/name/(nm\d+)/', link_tag['href']).group(1)  # Extract the nm ID
        bio_link = f"https://www.imdb.com/name/{nm_id}/bio/?ref_=nm_ov_bio_sm"
        director_bio_links.append({
            "director": director,
            "bio_link": bio_link
        })
    else:
        director_bio_links.append({
            "director": director,
            "bio_link": None
        })

# Extract Director Full Birth Locations
director_birth_locations = []

for entry in director_bio_links:
    bio_link = entry['bio_link']
    if bio_link:
        response = requests.get(bio_link, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find the "Born" section
        born_section = soup.find('li', {'id': 'born'})
        if born_section:
            # Extract text
            full_born_text = born_section.get_text(strip=True)
            birth_place = re.sub(r"^Born.*?·", "", full_born_text).strip()  # Remove 'Born' and date
        else:
            birth_place = "Unknown"
    else:
        birth_place = "Unknown"
    
    director_birth_locations.append({
        "director": entry['director'],
        "bio_link": bio_link,
        "birth_place": birth_place
    })

# Map birth places to directors
director_birth_map = {entry['director']: entry['birth_place'] for entry in director_birth_locations}

# Map Director Countries to DataFrame
df_top_movies_directors['dir_cn'] = df_top_movies_directors['director'].map(director_birth_map)

# Create Aggregated DataFrame
aggregated_df = (
    df_top_movies_directors.groupby('movie').agg({
        'director': 'first',
        'm_year': 'first',
        'dir_cn': 'first',
        'critic': lambda x: ', '.join(x.unique()),
        'crit_cn': lambda x: ', '.join(x.unique()),
    })
    .reset_index()
)
aggregated_df['count'] = df_top_movies_directors.groupby('movie')['critic'].nunique().values
aggregated_df = aggregated_df[['movie', 'director', 'm_year', 'dir_cn', 'count', 'critic', 'crit_cn']]

# Normalize country names in the aggregated DataFrame
aggregated_df = normalize_country_columns(aggregated_df)

# Save to CSV
aggregated_df.to_csv("top_15_movies_details.csv", index=False)

# Print the aggregated dataframe
print("Aggregated DataFrame with Normalized Country Names and Full Birth Locations:")
print(aggregated_df)

Aggregated DataFrame with Normalized Country Names and Full Birth Locations:
                                    movie              director  m_year  \
0              4 Months, 3 Weeks & 2 Days       Cristian Mungiu    2007   
1                            A Separation        Asghar Farhadi    2011   
2                                 Boyhood     Richard Linklater    2014   
3                         Children of Men        Alfonso Cuarón    2006   
4   Eternal Sunshine of the Spotless Mind         Michel Gondry    2004   
5                             Holy Motors            Leos Carax    2012   
6                    In the Mood for Love          Wong Kar-wai    2000   
7                     Inside Llewyn Davis             Joel Coen    2013   
8                        Mulholland Drive           David Lynch    2001   
9                  No Country For Old Men             Joel Coen    2007   
10                        Pan's Labyrinth    Guillermo Del Toro    2006   
11                     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_top_movies_directors['dir_cn'] = df_top_movies_directors['director'].map(director_birth_map)


In [105]:
import pandas as pd
import json
import random

# Load the aggregated DataFrame
aggregated_df = pd.read_csv("top_15_movies_details.csv")

# Detailed coordinates for directors
detailed_location_coordinates = {
    "Iasi, Romania": [27.58056, 47.15845],
    "Khomeyni Shahr, Isfahan, Iran": [51.5274, 32.7007],
    "Houston, Texas, USA": [-95.3698, 29.7604],
    "Mexico City, Distrito Federal, Mexico": [-99.1332, 19.4326],
    "Versailles, Seine-et-Oise [now Yvelines], France": [2.1333, 48.8014],
    "Suresnes, Seine [now Hauts-de-Seine], France": [2.2291, 48.8716],
    "Shanghai, China": [121.4737, 31.2304],
    "Minneapolis, Minnesota, USA": [-93.2650, 44.9778],
    "Missoula, Montana, USA": [-113.9966, 46.8721],
    "Guadalajara, Jalisco, Mexico": [-103.3918, 20.6597],
    "Tokyo, Japan": [139.6917, 35.6895],
    "Ottawa, Illinois, USA": [-88.8426, 41.3456],
    "Studio City, California, USA": [-118.3871, 34.1486]
}

# Country-level coordinates for critics
country_coordinates = {
    "US": [-95.712891, 37.09024],
    "France": [2.213749, 46.227638],
    "UK": [-3.435973, 55.378051],
    "Japan": [138.252924, 36.204824],
    "China": [104.195397, 35.86166],
    "Iran": [53.688046, 32.427908],
    "Mexico": [-102.552784, 23.634501],
    "Romania": [24.96676, 45.943161],
    "Turkey": [35.243322, 39.066162],
    "Argentina": [-63.616672, -38.416097],
    "India": [78.96288, 20.593684],
    "Australia": [133.775136, -25.274398],
    "Cuba": [-77.781167, 21.521757],
    "Belgium": [4.469936, 50.503887],
    "Colombia": [-74.297333, 4.570868],
    "Germany": [10.451526, 51.165691],
    "Lebanon": [35.862285, 33.854721],
    "South Korea": [127.766922, 35.907757],
    "Singapore": [103.819836, 1.352083],
    "Taiwan": [120.960515, 23.69781],
    "Qatar": [51.183884, 25.354826],
    "Chile": [-71.542969, -35.675147],
    "Israel": [34.851612, 31.046051],
    "Italy": [12.56738, 41.87194],
    "Switzerland": [8.227512, 46.818188],
    "Bangladesh": [90.356331, 23.684994],
    "Indonesia": [113.921327, -0.789275],
    "Austria": [14.550072, 47.516231],
    "Egypt": [30.802498, 26.820553],
    "Philippines": [121.774017, 12.879721],
    "Brazil": [-51.92528, -14.235004],
    "Hong Kong": [114.109497, 22.396428],
    "Canada": [-106.346771, 56.130366],
    "Kazakhstan": [66.923684, 48.019573],
    "UAE": [53.847818, 23.424076],
    "Senegal": [14.497401, -14.452362],
}

# Assign distinct colors for each movie
def generate_colors(n):
    colors = ["#" + ''.join(random.choices("0123456789ABCDEF", k=6)) for _ in range(n)]
    return colors

movie_colors = generate_colors(len(aggregated_df))
color_map = {movie: color for movie, color in zip(aggregated_df["movie"], movie_colors)}

# Initialize GeoJSON structure
geojson_data = {
    "type": "FeatureCollection",
    "features": []
}

# Track location usage to slightly offset overlapping points
location_usage = {}

# Build GeoJSON for directors and critics
for _, row in aggregated_df.iterrows():
    movie_color = color_map.get(row["movie"], "#000000")  # Default color if movie is not found

    # Add director's point
    director_location = row.get("dir_cn", "Unknown").strip()
    director_coordinates = detailed_location_coordinates.get(director_location, [0, 0])
    usage_count = location_usage.get(director_location, 0)
    director_coordinates = [
        director_coordinates[0] + usage_count * 0.02,
        director_coordinates[1] + usage_count * 0.02,
    ]
    location_usage[director_location] = usage_count + 1

    geojson_data["features"].append({
        "type": "Feature",
        "properties": {
            "name": row["movie"],
            "group_name": f"Director - {row['director']}",
            "group_id": 1,
            "headline": f"{row['movie']} by {row['director']}",
            "article": f"<p>{row['movie']} directed by {row['director']} with {row['count']} mentions.</p>",
            "color": movie_color,
            "radius": 10
        },
        "geometry": {
            "type": "Point",
            "coordinates": director_coordinates
        }
    })

    # Add critics' points
    critic_countries = [country.strip() for country in row["crit_cn"].split(",")]
    for critic_country in critic_countries:
        critic_coordinates = country_coordinates.get(critic_country, [0, 0])
        geojson_data["features"].append({
            "type": "Feature",
            "properties": {
                "name": row["movie"],
                "group_name": f"Critics from {critic_country}",
                "group_id": 2,
                "headline": f"Critics from {critic_country} mentioned '{row['movie']}'",
                "article": f"<p>Critics from {critic_country} mentioned '{row['movie']}'</p>",
                "color": movie_color,
                "radius": 5
            },
            "geometry": {
                "type": "Point",
                "coordinates": critic_coordinates
            }
        })

# Save GeoJSON to a JavaScript file
with open("geo-data.js", "w") as f:
    f.write(f"infoData = {json.dumps(geojson_data, indent=2)};")

print("GeoJSON file created: geo-data.js")

GeoJSON file created: geo-data.js


In [None]:
# this map uses geo-data copy 2.js