## Homework 7.0: BBC Movie List Scraping and Regex

In 2016 the BBC polled 177 film critics to get their picks for the best films of the century so far. While the BBC's [aggregate poll](http://www.bbc.com/culture/story/20160819-the-21st-centurys-100-greatest-films) is interesting, the long list including everyone who voted is perhaps more revealing from the data standpoint:

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

How do I wrangle this data? That is the central challenge that you'll be dealing with this week. You need to use beautiful soup to find the critic--as well as the list of movies that immediately follow them—and then use regular expression to divide the critic information and the movie info to create the most useful possible data structure. What should the data structure be? That is up to you to figure out.



### Getting started: Data Architecture

The central challenge of this assignment it's figuring out how you are going to set up your table (list of dictionaries) from this long list of critics and movies. What will each row be? What will the columns be and each row? How can you set it up so that you have the most useful table possible. 

Some things to think about: what are the main categories of analysis: Try to design a schema that will give you a table that you can run solid aggregations in pandas. Try to think about how you can transform the main source into one large table that can be aggregated and grouped.

### STEP 1

The first thing you need to do is scrape the page. 

https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted

Okay let's begin! (Note: I have set up the first few cells so that you can run requests once AND save the HTML page as a local file. And then load that local file in and do the spray thing on it. That way you only need to run requests once (ever)!)



In [3]:
##Import your libraries: Beautiful soup, requests, and re (For regular expressions)
import requests
from bs4 import BeautifulSoup
import re


In [4]:
#RUN THIS ONE TIME
#THEN COMMENT-OUT ALL OF THIS CODE
# my_url = "https://www.bbc.com/culture/article/20160819-the-21st-centurys-100-greatest-films-who-voted"
# raw_html = requests.get(my_url).content


In [6]:
#WRITING THE HTML FILE TO A LOCAL HTML FILE
#RUN THIS ONE TIME, THEN COMMENT-OUT ALL OF THIS CODE
# with open('bbc.html', 'wb+') as f:
#     f.write(raw_html)

In [7]:
#If you have run requests already--START HERE
# with open('bbc.html', 'wb+') as f:
#     f.write(raw_html)
f = open("bbc.html", "r")
local_html = f.read()

In [8]:
# read the URL, and put the HTML page into beautiful soup
soup_doc = BeautifulSoup(local_html, "html.parser")

In [9]:
#Using beautiful soup find the tag that contains 
#the entire list of critics and movies
#Make a variable (like full_list) that holds all that information 
full_list = soup_doc.article.find_all('p')

In [10]:
div_list = soup_doc.article.find_all('div')
len(div_list)

508

In [11]:
allb = soup_doc.find_all('b')
len(allb)
for b in allb[:3]:
    print(b.text)

We polled 177 critics from around the world – here is how they voted.
Simon Abrams – Freelance film critic (US)
Sam Adams – Freelance film critic (US)


**STEP 2** Using Beautiful Soup figure out how to separate the entries.


In [10]:
full_list[2]
full_list[-9]
len(full_list)

1957

**STEP THREE** This is where all the magic has to happen: you need to find a way to loop through all of the elements (loop through the list you just got from the find_all()) and pullout critics, and list of movies. 

Set up a loop the PRINTS critics and movies: You need to set it up so that you're getting the critic string followed their movies. 

So just print out the lines along with a print message like "CRITIC" or "MOVIE" to make sure that the loop is recognizing the two categories differently.


In [11]:
##Write your loop for STEP 3 here
for entry in full_list[2:-8]:
    if entry.b:
        print(entry.text)
    else:
        print(entry.text)
        print("------------")


Simon Abrams – Freelance film critic (US)
1. Mulholland Drive (David Lynch, 2001)
------------
2. In the Mood for Love (Wong Kar-wai, 2000)
------------
3. The Tree of Life (Terrence Malick, 2011)
------------
4. Yi Yi: A One and a Two (Edward Yang, 2000)
------------
5. Goodbye to Language (Jean-Luc Godard, 2014)
------------
6. The White Meadows (Mohammad Rasoulof, 2009)
------------
7. Night Across the Street (Raoul Ruiz, 2012)
------------
8. Certified Copy (Abbas Kiarostami, 2010)
------------
9. Sparrow (Johnnie To, 2008)
------------
10. Fados (Carlos Saura, 2007)
------------
Sam Adams – Freelance film critic (US)
1. In the Mood for Love (Wong Kar-wai, 2000)
------------
2. Eternal Sunshine of the Spotless Mind (Michel Gondry, 2004)
------------
3. Syndromes and a Century (Apichatpong Weerasethakul, 2006)
------------
4. Spirited Away (Hayao Miyazaki, 2001)
------------
5. The Act of Killing (Joshua Oppenheimer, 2012)
------------
6. The Grand Budapest Hotel (Wes Anderson, 2014

**STEP 4**
If your loop is successfully isolating those two categories: now it's time to parse each with regular expressions (separately). This will need to happen inside the loop--for every critic, and then (in STEP 5) for every movie. But FIRST, just **focus on getting the critics name, organization, and country** in isolation (outside of the loops).

Once you have think you have your regular expressions working then bring them into a loop (just for CRITICS) and see how well they work.

Inside the loop--once you have critic_info -- make a regular expression that pulls out the name of the critic--make a variable called critic_name

`critic_name = findall(regex,critic_info)[0]`

Do the same thing for critic_org and critic_cn

As you go print(critic_name) then print(critic_org), etc.--to make sure you're getting the results. I provided a cell below for you to practice your regular expressions before you put them into the loop.

In [12]:
#Practice/Build your regular expressions here
import re
crit_sample = "Arturo Aguilar – Rolling Stone Mexico (Mexico)"
regex_for_name = r"^([^–]+)"
regex_for_org = r"–([^(]+)\("
regex_for_cn = r"\((.+)\)$"
name = re.findall(regex_for_cn,crit_sample)
name[0]


'Mexico'

In [13]:
#Take your working loop from step three
#And put it here With the regular expression parsing inside it
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0]
        org = re.findall(regex_for_org,entry.text)[0]
        cn = re.findall(regex_for_cn,entry.text)[0]
        print(name + "|||" + org+ "|||"+ cn)


Simon Abrams ||| Freelance film critic |||US
Sam Adams ||| Freelance film critic |||US
Thelma Adams ||| Freelance film critic |||US
Arturo Aguilar ||| Rolling Stone Mexico |||Mexico
Matthew Anderson ||| BBC Culture |||UK
Tim Appelo ||| The Wrap |||US
Adriano Aprà ||| Film historian |||Italy
Michael Arbeiter ||| Nerdist |||US
Ali Arikan ||| Dipnot TV |||Turkey
Michael Atkinson ||| The Village Voice |||US
Ana Maria Bahiana ||| Freelance film critic |||Brazil
Cameron Bailey ||| Toronto Film Festival |||Canada
Lindsay Baker ||| BBC Culture |||UK
Miriam Bale ||| Freelance film critic |||US
Nicholas Barber ||| BBC Culture |||UK
Diego Batlle ||| La Nacion |||Argentina
NT Binh ||| Positif |||France
Lizelle Bisschoff ||| University of Glasgow |||UK
Christian Blauvelt ||| BBC Culture |||US
Mahen Bonetti ||| African Film Festival Inc |||US
Andreas Borcholte ||| Spiegel Online |||Germany
Utpal Borpujari ||| Freelance film critic |||India
Richard Brody ||| The New Yorker |||US
Hannah Brown ||| Jeru

**STEP 5**
Now you need to get your **movie info**. You will want to use the same loop you have been working on (in STEP 6), and get the name of each movie along with the critic information.

But **FIRST**: practice your regular expressions and make sure that they're going to work before you bring them into the loop.


In [14]:
#Practice/Build your regular expressions here
movie_sample = "1. Zero Dark Thirty (Kathryn Bigelow, 2012)"
movie_harder = "7. 4 Months, 3 Weeks & 2 Days (Cristian Mungiu, 2007)"
regex_for_mname = r"^\d{1,2}\. (.+)[(][^(]+[)]$"
regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
regex_for_year = r",\s+(\d{4})\)$"
#what else should you extract???
#set up all regexes here
movie_name = re.findall(regex_for_year,movie_harder)
movie_name[0].strip()


'2007'

**STEP 6**
You're almost there!!! Now that you have working regulars expression put those in your inner loop to get the movie name.

So now the entire loop should be getting critic information and movie information all separated as separate columns/properties.

Build this loop(s) using print() on the first one or two critic selections. Just to make sure you are pulling out the right data.




In [15]:
#Get that loop working here

for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0]
        org = re.findall(regex_for_org,entry.text)[0]
        cn = re.findall(regex_for_cn,entry.text)[0]
        # print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
        regex_for_year = r",\s+([^,]+)\)$"
        # regex_for_multi_paren = r".+\(.+\("
        # regex_for_odddate = r", .{4}\)$"
        #what else should you extract???
        movie_name = re.findall(regex_for_mname,entry.text)[0]
        movie_dir = re.findall(regex_for_dir,entry.text)[0]
        movie_year = re.findall(regex_for_year,entry.text)[0]
        print(movie_name+ "|||" + movie_dir+ "|||"+movie_year)
        paren = re.findall(regex_for_odddate,entry.text)
        # if paren:
        #     print(paren)

Mulholland Drive |||David Lynch|||2001


NameError: name 'regex_for_odddate' is not defined

**STEP 7**
This is the final step of the hardest part! 

The final step is building a list of dictionaries of all this information.

So you need have a loop that gets everything out--but you also need to figure out **how  you want to organize what you're pulling out.** What should a row look like in your table?




In [16]:
#figure out how you're going to collect your clean information
list_of_movies = []

#loop through the beautiful soup elements
#and use the regexes you developed above to get each unit of info
for entry in full_list[2:-8]:
    if entry.b:
        regex_for_name = r"^([^–]+)"
        regex_for_org = r"–([^(]+)\("
        regex_for_cn = r"\((.+)\)$"
        name = re.findall(regex_for_name,entry.text)[0].strip()
        org = re.findall(regex_for_org,entry.text)[0].strip()
        cn = re.findall(regex_for_cn,entry.text)[0].strip()
        # print(name+ "|||" + org+ "|||"+cn)
    else:
        regex_for_mname = r"^\d{1,2}\. (.+)\([^(]+\)$"
        regex_for_dir = r"\(([^(]+),\s+[^,(]+\)$"
        regex_for_year = r",\s+([^,]+)\)$"
        regex_for_multi_paren = r".+\(.+\("
        #what else should you extract???
        movie_name = re.findall(regex_for_mname,entry.text)[0].strip()
        movie_dir = re.findall(regex_for_dir,entry.text)[0].strip()
        movie_year = re.findall(regex_for_year,entry.text)[0].strip()
        new_movie_entry = [movie_name,movie_dir,movie_year,name,org,cn]
        list_of_movies.append(new_movie_entry)
        

#Try to figure out how you want to append things
#That is, how you want to organize your data

    

In [17]:
##Take a peek at your final lists of lists
list_of_movies
len(list_of_movies)
list_of_movies[22:44]

[['The Grand Budapest Hotel',
  'Wes Anderson',
  '2014',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Stories We Tell',
  'Sarah Polley',
  '2012',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Casino Royale',
  'Martin Campbell',
  '2006',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Eternal Sunshine of the Spotless Mind',
  'Michel Gondry',
  '2004',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Tabu',
  'Miguel Gomes',
  '2012',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Snow White',
  'Pablo Berger',
  '2012',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Frozen River',
  'Courtney Hunt',
  '2008',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['Gosford Park',
  'Robert Altman',
  '2001',
  'Thelma Adams',
  'Freelance film critic',
  'US'],
 ['In the Mood for Love',
  'Wong Kar-wai',
  '2000',
  'Arturo Aguilar',
  'Rolling Stone Mexico',
  'Mexico'],
 ['Mulholland Drive',
  'David Lynch',
  '2001',
 

In [18]:
# for mov in list_of_movies:
#     if mov[1].startswith("Mool"):
#         print(mov)

for mov in list_of_movies:
    if re.search(r".+\d{4}",mov[2]):
        print(mov[2])

Sembèène 2004


Could fix this here, like this, but I am going to fix in pandas

In [19]:
for mov in list_of_movies:
    if re.search(r".+\d{4}",mov[2]):
        mov[1] = mov[1]+" "+mov[2].split(" ")[0]
        print(mov[1])
        mov[2] = mov[2].split(" ")[1]
        print(mov[2])

Ousmane Sembèène
2004


If you made it this far, yay!


And now, let's bring that into PANDAS!

In [20]:
import numpy as np
import pandas as pd
col_names = ['movie', 'director', 'm_year', 'critic','crit_org','crit_cn']
df = pd.DataFrame.from_records(list_of_movies, columns=col_names)

In [21]:
df

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US
...,...,...,...,...,...,...
1765,The Lives of Others,Florian Henckel von Donnersmarck,2006,Raymond Zhou,China Daily,China
1766,Still Life,Jia Zhangke,2006,Raymond Zhou,China Daily,China
1767,Birdman,Alejandro González Iñárritu,2014,Raymond Zhou,China Daily,China
1768,Infernal Affairs,Andrew Lau and Alan Mak,2002,Raymond Zhou,China Daily,China


In [22]:
#most popular films
df['movie'].value_counts().head(15)

movie
In the Mood for Love                     49
Mulholland Drive                         47
There Will Be Blood                      35
Spirited Away                            34
Boyhood                                  30
Eternal Sunshine of the Spotless Mind    29
A Separation                             28
The Tree of Life                         23
Yi Yi: A One and a Two                   22
No Country For Old Men                   21
Inside Llewyn Davis                      20
Children of Men                          18
4 Months, 3 Weeks & 2 Days               17
Pan's Labyrinth                          17
The Act of Killing                       16
Name: count, dtype: int64

In [23]:
m_count = df['movie'].value_counts()
m_count[m_count<10]

movie
The Dark Knight                                9
Certified Copy                                 9
Margaret                                       9
Uncle Boonmee Who Can Recall His Past Lives    9
Timbuktu                                       9
                                              ..
Story of My Death                              1
Stranger by the Lake                           1
Even If She Had Been a Criminal...             1
Heart of a Dog                                 1
Lust, Caution                                  1
Name: count, Length: 561, dtype: int64

In [24]:
#critics per country!
df.groupby('crit_cn')['critic'].nunique()

crit_cn
Argentina        2
Australia        4
Austria          2
Bangladesh       1
Belgium          1
Brazil           1
Canada           5
Chile            2
China            1
Colombia         4
Cuba             5
Egypt            1
France           5
Germany          5
Hong Kong        1
India            5
Indonesia        1
Israel           4
Italy            4
Japan            1
Kazakhstan       1
Lebanon          3
Mexico           2
Namibia          1
Philippines      1
Qatar            1
Senegal          1
Singapore        2
South Africa     1
South Korea      2
Switzerland      1
Taiwan           1
Turkey           2
UAE              3
UK              18
US              82
Name: critic, dtype: int64

In [25]:
#back up your results!!!
df.to_csv(r'backup_BBC1.csv', index = False)

In [26]:
df_new = pd.read_csv("backup_BBC1.csv")

In [27]:
d_list = list(df_new['director'].unique())
d_list.sort()
d_list

['Abbas Kiarostami',
 'Abdellatif Kechiche',
 'Abderrahmane Sissako',
 'Adam Curtis',
 'Adam McKay',
 'Agnieszka Holland',
 'Agnès Jaoui',
 'Agnès Varda',
 'Aki Kaurismäki',
 'Alain Cavalier',
 'Alain Gomis',
 'Alain Guiraudie',
 'Alain Resnais',
 'Albert Serra',
 'Alejandro González Iñárritu',
 'Aleksandr Sokurov',
 'Aleksey Fedorchenko',
 'Aleksey German',
 'Alex Garland',
 'Alexander Payne',
 'Alfonso Cuarón',
 'Amma Asante',
 'Ana Lily Amirpour',
 'Andrea Arnold',
 'Andrew Adamson and Vicky Jenson',
 'Andrew Dominik',
 'Andrew Dosunmu',
 'Andrew Haigh',
 'Andrew Lau and Alan Mak',
 'Andrew Stanton',
 'Andrew Stanton and Lee Unkrich',
 'Andrey Zvyagintsev',
 'Andrzej Wajda',
 'Andrzej Zulawski',
 'André Singer',
 'Ang Lee',
 'Annemarie Jacir',
 'Anthony and Joe Russo',
 'Anurag Kashyap',
 'Apichatpong Weerasethakul',
 'Ari Folman',
 'Arnaud Desplechin',
 'Asghar Farhadi',
 'Ashutosh Gowariker',
 'Asif Kapadia',
 'Ava DuVernay',
 'Avi Nesher',
 'Bahman Ghobadi',
 'Bart Layton',
 'Baz

**Getting the names separated**

In [28]:
pd.set_option('display.max_rows', None)
df_new[df_new['director'].str.contains(r"\band\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
50,No Country For Old Men,Joel and Ethan Coen,2007,Tim Appelo,The Wrap,US
54,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Tim Appelo,The Wrap,US
60,These Encounters of Theirs,Danièle Huillet and Jean-Marie Straub,2006,Adriano Aprà,Film historian,Italy
68,Terra,Marco De Angelis and Antonio Di Trapani,2015,Adriano Aprà,Film historian,Italy
69,Oh! Man,Yervant Gianikian and Angela Ricci Lucchi,2004,Adriano Aprà,Film historian,Italy
77,Inside Llewyn Davis,Joel and Ethan Coen,2013,Michael Arbeiter,Nerdist,US
86,A Serious Man,Joel and Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey
103,City of God,Fernando Meirelles and Kátia Lund,2002,Ana Maria Bahiana,Freelance film critic,Brazil
108,No Country For Old Men,Joel and Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil
133,This Is the End,Evan Goldberg and Seth Rogen,2013,Miriam Bale,Freelance film critic,US


And this lambda to function, just sends each cell (director) to the dirs_names() function.

Note that here I am testing to make sure the function is working, I'm not saving this work yet.

In [29]:
df_new['director'].value_counts().head(15)

director
Paul Thomas Anderson    52
Joel and Ethan Coen     52
Wong Kar-wai            51
David Lynch             48
Richard Linklater       39
Michael Haneke          35
Hayao Miyazaki          35
Terrence Malick         32
Asghar Farhadi          31
David Fincher           31
Michel Gondry           30
Wes Anderson            28
Christopher Nolan       28
Alfonso Cuarón          24
Edward Yang             22
Name: count, dtype: int64

In [30]:
import re
def dirs_names(dirs):
    each_word = re.split(r"\s+",dirs)
    if len(each_word) > 1 and each_word[1] == "and":
        each_word[0] = each_word[0] + " " + each_word[-1]
        print(' '.join(each_word))
        return ' '.join(each_word)
    else:
        return dirs
        

In [31]:
df_new['director'].apply(lambda x: dirs_names(x))

Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Josh Safdie and Benny Safdie
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Darden

0                                             David Lynch
1                                            Wong Kar-wai
2                                         Terrence Malick
3                                             Edward Yang
4                                         Jean-Luc Godard
5                                       Mohammad Rasoulof
6                                              Raoul Ruiz
7                                        Abbas Kiarostami
8                                              Johnnie To
9                                            Carlos Saura
10                                           Wong Kar-wai
11                                          Michel Gondry
12                              Apichatpong Weerasethakul
13                                         Hayao Miyazaki
14                                     Joshua Oppenheimer
15                                           Wes Anderson
16                                        Terrence Malick
17            

Looking for directors with a single name because that function is assuming that it is always First Name "and"

In [32]:
df_new[df_new['director'].str.contains(r"^\S+$",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn


Oh-oh, problem! Let's fix!

In [33]:
df_new["director"].iat[172] = df_new["director"].iat[172] + " " + df_new["m_year"].iat[172].split(" ")[0]

AttributeError: 'numpy.int64' object has no attribute 'split'

In [34]:
df_new["director"].iat[172]

'Ousmane Sembèène'

In [35]:
df_new["m_year"].iat[172] = df_new["m_year"].iat[172].split(" ")[1]

AttributeError: 'numpy.int64' object has no attribute 'split'

In [36]:
df_new["m_year"].iat[172]

2004

In [37]:
df_new[df_new['director'].str.contains(r"\bOusmane\b",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
172,Moolaadé,Ousmane Sembèène,2004,Lizelle Bisschoff,University of Glasgow,UK
190,Moolaadé,Ousmane Sembène,2004,Mahen Bonetti,African Film Festival Inc,US
395,Moolaadé,Ousmane Sembène,2004,Lindiwe Dovey,University of London,UK
1010,Moolaadé,Ousmane Sembène,2004,Hans-Christian Mahnke,AfricAvenir.org,Namibia
1536,Moolaadé,Ousmane Sembène,2004,Yael Shuv,Time Out Tel Aviv,Israel


In [39]:
df_new['director'] = df_new['director'].str.replace('Sembèène','Sembène')


In [40]:
df_new[df_new['director'].str.contains(r"\bOusmane\b",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
172,Moolaadé,Ousmane Sembène,2004,Lizelle Bisschoff,University of Glasgow,UK
190,Moolaadé,Ousmane Sembène,2004,Mahen Bonetti,African Film Festival Inc,US
395,Moolaadé,Ousmane Sembène,2004,Lindiwe Dovey,University of London,UK
1010,Moolaadé,Ousmane Sembène,2004,Hans-Christian Mahnke,AfricAvenir.org,Namibia
1536,Moolaadé,Ousmane Sembène,2004,Yael Shuv,Time Out Tel Aviv,Israel


**Okay...**

So after that tangent, I'm gonna go ahead and update the directors!!

Here I am saving the work, updating the director column.


In [41]:
df_new['director']=df_new['director'].apply(lambda x: dirs_names(x))

Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Josh Safdie and Benny Safdie
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Dardenne and Luc Dardenne
Joel Coen and Ethan Coen
Joel Coen and Ethan Coen
Jean-Pierre Darden

And checking to make sure it came out, right!

In [42]:
df_new[df_new['director'].str.contains(r"\bCoen\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
50,No Country For Old Men,Joel Coen and Ethan Coen,2007,Tim Appelo,The Wrap,US
77,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Michael Arbeiter,Nerdist,US
86,A Serious Man,Joel Coen and Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey
108,No Country For Old Men,Joel Coen and Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil
137,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Miriam Bale,Freelance film critic,US
189,A Serious Man,Joel Coen and Ethan Coen,2009,Christian Blauvelt,BBC Culture,US
204,No Country For Old Men,Joel Coen and Ethan Coen,2007,Andreas Borcholte,Spiegel Online,Germany
232,No Country For Old Men,Joel Coen and Ethan Coen,2007,Hannah Brown,Jerusalem Post,Israel
242,"O Brother, Where Art Thou?",Joel Coen and Ethan Coen,2000,Luke Buckmaster,The Guardian/BBC Culture,Australia
260,Inside Llewyn Davis,Joel Coen and Ethan Coen,2013,Monica Castillo,New York Times Watching,US


Now I need to deal with multiple directors by looking for commas.

In [43]:
df_new[df_new['director'].str.contains(r",\s+",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
144,Madagascar 3: Europe's Most Wanted,"Eric Darnell, Tom McGrath and Conrad Vernon",2012,Nicholas Barber,BBC Culture,UK
399,7 Letters,"Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopal...",2015,Lindiwe Dovey,University of London,UK
1396,"Monsters, Inc.","Pete Docter, David Silverman and Lee Unkrich",2001,Jonathan Romney,Freelance film critic,UK


In [44]:
#beware of oxford commas!!!!
df_new[df_new['director'].str.contains(r",\s+\band\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn


Replacing all the commas with ' and ' so that I have a consistent separator for every multiple director cell.

In [45]:
df_new['director']=df_new['director'].str.replace(r",\s+",' and ',regex=True)

In [46]:
df_new[df_new['director'].str.contains(r"\bUnkrich\b",regex=True, case=False)]


Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
54,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Tim Appelo,The Wrap,US
126,Toy Story 3,Lee Unkrich,2010,Lindsay Baker,BBC Culture,UK
495,Toy Story 3,Lee Unkrich,2010,Javier Porta Fouz,La Nacion,Argentina
719,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Ann Hornaday,The Washington Post,US
1343,Finding Nemo,Andrew Stanton and Lee Unkrich,2003,Sam Rigby,BBC Culture,UK
1396,"Monsters, Inc.",Pete Docter and David Silverman and Lee Unkrich,2001,Jonathan Romney,Freelance film critic,UK
1567,Toy Story 3,Lee Unkrich,2010,Eric D Snider,Freelance film critic,US


In [47]:
df_new['director'].apply(lambda x: len(re.findall(r'\band\b',x))+1)

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
30      1
31      1
32      1
33      1
34      1
35      1
36      1
37      1
38      1
39      1
40      1
41      1
42      1
43      1
44      1
45      1
46      1
47      1
48      1
49      1
50      2
51      1
52      1
53      1
54      2
55      1
56      1
57      1
58      1
59      1
60      2
61      1
62      1
63      1
64      1
65      1
66      1
67      1
68      2
69      2
70      1
71      1
72      1
73      1
74      1
75      1
76      1
77      2
78      1
79      1
80      1
81      1
82      1
83      1
84      1
85      1
86      2
87      1
88      1
89      1
90      1
91      1
92      1
93      1
94      1
95      1
96      1
97      1
98      1
99      1


Now I am transforming the Director cells into lists using split

In [48]:
df_new['director']=df_new['director'].str.split(' and ')

In [49]:
df_new.iloc[[1396]]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
1396,"Monsters, Inc.","[Pete Docter, David Silverman, Lee Unkrich]",2001,Jonathan Romney,Freelance film critic,UK


In [50]:
df_new

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn
0,Mulholland Drive,[David Lynch],2001,Simon Abrams,Freelance film critic,US
1,In the Mood for Love,[Wong Kar-wai],2000,Simon Abrams,Freelance film critic,US
2,The Tree of Life,[Terrence Malick],2011,Simon Abrams,Freelance film critic,US
3,Yi Yi: A One and a Two,[Edward Yang],2000,Simon Abrams,Freelance film critic,US
4,Goodbye to Language,[Jean-Luc Godard],2014,Simon Abrams,Freelance film critic,US
5,The White Meadows,[Mohammad Rasoulof],2009,Simon Abrams,Freelance film critic,US
6,Night Across the Street,[Raoul Ruiz],2012,Simon Abrams,Freelance film critic,US
7,Certified Copy,[Abbas Kiarostami],2010,Simon Abrams,Freelance film critic,US
8,Sparrow,[Johnnie To],2008,Simon Abrams,Freelance film critic,US
9,Fados,[Carlos Saura],2007,Simon Abrams,Freelance film critic,US


This is fun...! I can use those lists to count the number of directors for a movie, like, why not?

In [51]:
#make dir numbers

df_new['nm_dir'] = df_new['director'].apply(lambda x: len(x))

In [52]:
df_new[df_new['nm_dir']>2]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
144,Madagascar 3: Europe's Most Wanted,"[Eric Darnell, Tom McGrath, Conrad Vernon]",2012,Nicholas Barber,BBC Culture,UK,3
399,7 Letters,"[Boo Junfeng, Eric Khoo, Jack Neo, K. Rajagopa...",2015,Lindiwe Dovey,University of London,UK,7
1396,"Monsters, Inc.","[Pete Docter, David Silverman, Lee Unkrich]",2001,Jonathan Romney,Freelance film critic,UK,3


But, more importantly, let's use **explode()**

This is why we put the director names into a list. explode() allows us to then take that list and make separate rows for each element in the list. This way we are "unwinding" the multiple directors.

This is making a new data frame that will have more rows.

In [53]:
df_large = df_new.explode('director')

In [54]:
df_large.shape

(1891, 7)

In [55]:
df_large[df_large['director'].str.contains(r"\bCoen\b",regex=True, case=False)]

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
50,No Country For Old Men,Joel Coen,2007,Tim Appelo,The Wrap,US,2
50,No Country For Old Men,Ethan Coen,2007,Tim Appelo,The Wrap,US,2
77,Inside Llewyn Davis,Joel Coen,2013,Michael Arbeiter,Nerdist,US,2
77,Inside Llewyn Davis,Ethan Coen,2013,Michael Arbeiter,Nerdist,US,2
86,A Serious Man,Joel Coen,2009,Ali Arikan,Dipnot TV,Turkey,2
86,A Serious Man,Ethan Coen,2009,Ali Arikan,Dipnot TV,Turkey,2
108,No Country For Old Men,Joel Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil,2
108,No Country For Old Men,Ethan Coen,2007,Ana Maria Bahiana,Freelance film critic,Brazil,2
137,Inside Llewyn Davis,Joel Coen,2013,Miriam Bale,Freelance film critic,US,2
137,Inside Llewyn Davis,Ethan Coen,2013,Miriam Bale,Freelance film critic,US,2


Now we can get better aggregations with one director per row.

In [56]:
df_large['director'].value_counts().head(15)

director
Ethan Coen              52
Joel Coen               52
Paul Thomas Anderson    52
Wong Kar-wai            51
David Lynch             48
Richard Linklater       39
Hayao Miyazaki          35
Michael Haneke          35
Terrence Malick         32
Asghar Farhadi          31
David Fincher           31
Michel Gondry           30
Christopher Nolan       28
Wes Anderson            28
Alfonso Cuarón          24
Name: count, dtype: int64

In [57]:
df_large.groupby('movie')['critic'].nunique().sort_values(ascending=False).reset_index(name='count')

Unnamed: 0,movie,count
0,In the Mood for Love,49
1,Mulholland Drive,47
2,There Will Be Blood,35
3,Spirited Away,34
4,Boyhood,30
5,Eternal Sunshine of the Spotless Mind,29
6,A Separation,28
7,The Tree of Life,23
8,Yi Yi: A One and a Two,22
9,No Country For Old Men,21


In [59]:
d_list = list(df_large['director'].unique())
d_list.sort()
d_list

['Abbas Kiarostami',
 'Abdellatif Kechiche',
 'Abderrahmane Sissako',
 'Adam Curtis',
 'Adam McKay',
 'Agnieszka Holland',
 'Agnès Jaoui',
 'Agnès Varda',
 'Aki Kaurismäki',
 'Alain Cavalier',
 'Alain Gomis',
 'Alain Guiraudie',
 'Alain Resnais',
 'Alan Mak',
 'Albert Serra',
 'Alejandro González Iñárritu',
 'Aleksandr Sokurov',
 'Aleksey Fedorchenko',
 'Aleksey German',
 'Alex Garland',
 'Alexander Payne',
 'Alfonso Cuarón',
 'Amma Asante',
 'Ana Lily Amirpour',
 'Andrea Arnold',
 'Andrew Adamson',
 'Andrew Dominik',
 'Andrew Dosunmu',
 'Andrew Haigh',
 'Andrew Lau',
 'Andrew Stanton',
 'Andrey Zvyagintsev',
 'Andrzej Wajda',
 'Andrzej Zulawski',
 'André Singer',
 'Ang Lee',
 'Angela Ricci Lucchi',
 'Annemarie Jacir',
 'Anthony Russo',
 'Antonio Di Trapani',
 'Anurag Kashyap',
 'Apichatpong Weerasethakul',
 'Ari Folman',
 'Arnaud Desplechin',
 'Asghar Farhadi',
 'Ashutosh Gowariker',
 'Asif Kapadia',
 'Ava DuVernay',
 'Avi Nesher',
 'Bahman Ghobadi',
 'Bart Layton',
 'Baz Luhrmann',
 

In [60]:
df_large.head()

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1


In [61]:
df_large.to_csv('preped_BBC.csv',index=False)

### For Javascript reading

In [None]:
#tell imdb you are using a browser!!
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
#add headers to request
raw_html = requests.get(url,headers=head).content
#save html file
with open('imdb.html', 'wb+') as f:
    f.write(raw_html)
soup_doc = BeautifulSoup(raw_html, "html.parser")

## NEXT STEP

Getting more data!!

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
url = "https://www.imdb.com/search/name/?name=David%20Lynch"
#add headers to request
raw_html = requests.get(url).content
print(raw_html)

b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body>\r\n<center><h1>403 Forbidden</h1></center>\r\n</body>\r\n</html>\r\n'


In [3]:
df = pd.read_csv('preped_BBC.csv')

In [4]:
d_list=list(df['director'].unique())

In [5]:
d_list.sort()

In [6]:
for director in d_list[:10]:
    url = "https://www.imdb.com/search/name/?name=" + director.replace(" ","%20")
    print(url)

https://www.imdb.com/search/name/?name=Abbas%20Kiarostami
https://www.imdb.com/search/name/?name=Abdellatif%20Kechiche
https://www.imdb.com/search/name/?name=Abderrahmane%20Sissako
https://www.imdb.com/search/name/?name=Adam%20Curtis
https://www.imdb.com/search/name/?name=Adam%20McKay
https://www.imdb.com/search/name/?name=Agnieszka%20Holland
https://www.imdb.com/search/name/?name=Agnès%20Jaoui
https://www.imdb.com/search/name/?name=Agnès%20Varda
https://www.imdb.com/search/name/?name=Aki%20Kaurismäki
https://www.imdb.com/search/name/?name=Alain%20Cavalier


In [None]:
{"director":"David Lynch","link":"/name/nm0000186/?ref_=sr_t_1"}

In [None]:
#step 1: loop through director list and search for the link
#get a list of dictionaries (table) with just name and link to imdb page

In [7]:
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}

In [8]:
l_d = []
for director in d_list:
    d = {}
    url = "https://www.imdb.com/search/name/?name=" + director.replace(" ","%20")
    raw_html = requests.get(url,headers = head).content
    with open('bbc.html', 'wb+') as f:
        f.write(raw_html)
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    try:
        link = soup_doc.find('div', attrs={'data-testid': 'nlib-title'}).a['href']
    except:
        link = 'N/A'
    d['director'] = director
    d['link'] = link
    l_d.append(d)

In [None]:
#step 2: loop through that list of dicts 
#going to each individual and page adding to the dict

Getting more Director Info from IMDB

In [9]:
l_d[0]

{'director': 'Abbas Kiarostami', 'link': '/name/nm0452102/?ref_=sr_t_1'}

In [10]:
#tell imdb you are using a browser!!
head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
#possible urls to use
# url ="https://www.imdb.com/search/title/?name=spirited%20away&title_type=feature"
url = "https://www.imdb.com/search/name/?name=David%20Lynch"
#add headers to request
raw_html = requests.get(url,headers=head).content

with open('imdb.html', 'wb+') as f:
    f.write(raw_html)
soup_doc = BeautifulSoup(raw_html, "html.parser")

## Get Birth Year of Director

In [45]:
l_y = []
for l in l_d:
    director = l['director']
    url = "https://www.imdb.com/" + l['link']
    raw_html = requests.get(url,headers = head).content
    with open('bbc.html', 'wb+') as f:
        f.write(raw_html)
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    try:
        year = soup_doc.find_all('span', class_='sc-59a43f1c-2 bMLVWg')[1].text.split(', ')[-1]
    except:
        year = 'N/A'
    d = {}
    d['director'] = director
    d['director_birth'] = year
    l_y.append(d)

In [51]:
year_df = pd.DataFrame(l_y)
year_df.head()

Unnamed: 0,director,year
0,Abbas Kiarostami,1940
1,Abdellatif Kechiche,1960
2,Abderrahmane Sissako,1961
3,Adam Curtis,1955
4,Adam McKay,1968


In [57]:
df = df.merge(year_df, on='director', how='left')
df.head()

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir,year
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1,1946
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1,1956
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1,1943
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1,1947
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1,1930


In [7]:
df.to_csv('preped_BBC.csv',index=False)

## Find the language/genre of the movie

In [11]:
l_m = list(df['movie'].unique())
l_m.sort()

In [33]:
l_m[:3]

['12 Years a Slave', '2046', '24 Hour Party People']

In [44]:
movie_link = []
for film in l_m:
    url = "https://www.imdb.com/find/?q=" + film.replace(" ","%20")
    raw_html = requests.get(url,headers = head).content
    with open('bbc.html', 'wb+') as f:
        f.write(raw_html)
    soup_doc = BeautifulSoup(raw_html, "html.parser")
    link = soup_doc.find('section', attrs={"data-testid":"find-results-section-title"}).find('div',class_='ipc-metadata-list-summary-item__tc').a['href']
    link = "https://www.imdb.com" + link
    d={}
    d['movie'] = film
    d['link'] = link
    movie_link.append(d)

In [45]:
movie_link[0]

{'movie': '12 Years a Slave',
 'link': 'https://www.imdb.com/title/tt2024544/?ref_=fn_all_ttl_1'}

In [46]:
df = pd.DataFrame(movie_link)
df.to_csv('movies_links.csv',index=False)

In [47]:
df.head()

Unnamed: 0,movie,link
0,12 Years a Slave,https://www.imdb.com/title/tt2024544/?ref_=fn_...
1,2046,https://www.imdb.com/title/tt0212712/?ref_=fn_...
2,24 Hour Party People,https://www.imdb.com/title/tt0274309/?ref_=fn_...
3,25th Hour,https://www.imdb.com/title/tt0307901/?ref_=fn_...
4,3-Iron,https://www.imdb.com/title/tt1300854/?ref_=fn_...


In [48]:
languages = []
problem = []
for movie in movie_link:
    url = movie['link']
    try:
        head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
        raw_html = requests.get(url,headers=head).content
        with open('imdb.html', 'wb+') as f:
            f.write(raw_html)
        soup_doc = BeautifulSoup(raw_html, "html.parser")
        language = soup_doc.find('section', attrs={"data-testid":"Details"}).find('li',attrs={"data-testid":"title-details-languages"}).a.text
    except Exception as e:
        language = 'N/A'
        print(e)
        print(movie)
        problem.append(movie)
    d = {}
    d['movie'] = movie['movie']
    d['language'] = language
    languages.append(d)

'NoneType' object has no attribute 'a'
{'movie': '35 Shots of Rum', 'link': 'https://www.imdb.com/title/tt33274024/?ref_=fn_all_ttl_1'}
'NoneType' object has no attribute 'a'
{'movie': 'A Pigeon Sat on a Branch Reflecting on Existence', 'link': 'https://www.imdb.com/title/tt30936380/?ref_=fn_all_ttl_1'}
'NoneType' object has no attribute 'a'
{'movie': 'Cuba: An African Odyssey', 'link': 'https://www.imdb.com/title/tt1106760/?ref_=fn_all_ttl_1'}
'NoneType' object has no attribute 'a'
{'movie': 'Daylight Moon', 'link': 'https://www.imdb.com/title/tt0360504/?ref_=fn_all_ttl_1'}
'NoneType' object has no attribute 'a'
{'movie': 'Death Proof – Grindhouse', 'link': 'https://www.imdb.com/title/tt34025154/?ref_=fn_all_ttl_1'}
'NoneType' object has no attribute 'a'
{'movie': 'Father of My Children', 'link': 'https://www.imdb.com/title/tt34256213/?ref_=fn_all_ttl_1'}
'NoneType' object has no attribute 'a'
{'movie': 'In the Mood for Love', 'link': 'https://www.imdb.com/title/tt20833922/?ref_=fn_al

## Look at the Error

In [36]:
problem[0]

{'movie': '35 Shots of Rum',
 'link': 'https://www.imdb.com/title/tt33274024/?ref_=fn_all_ttl_1'}

In [40]:
url = 'https://www.imdb.com/find/?q=Aurora'
raw_html = requests.get(url,headers = head).content
with open('bbc.html', 'wb+') as f:
    f.write(raw_html)
soup_doc = BeautifulSoup(raw_html, "html.parser")

In [43]:
link = soup_doc.find('section', attrs={"data-testid":"find-results-section-title"}).find('div',class_='ipc-metadata-list-summary-item__tc').a['href']
link = "https://www.imdb.com" + link
link

'https://www.imdb.com/title/tt19394520/?ref_=fn_all_ttl_1'

In [49]:
for p in problem:
    if 'name' in p['link']:
        print(p)

In [32]:
df = pd.DataFrame(problem)
df.to_csv('Problems.csv',index=False)

In [50]:
df_l = pd.DataFrame(languages)
df_l.head()

Unnamed: 0,movie,language
0,12 Years a Slave,English
1,2046,Cantonese
2,24 Hour Party People,English
3,25th Hour,English
4,3-Iron,English


In [53]:
df = pd.read_csv('preped_BBC.csv')

In [54]:
df = df.merge(df_l, on='movie', how='left')

In [55]:
df.head(10)

Unnamed: 0,movie,director,m_year,critic,crit_org,crit_cn,nm_dir,director_birth,language_x,language_y
0,Mulholland Drive,David Lynch,2001,Simon Abrams,Freelance film critic,US,1,1946.0,English,English
1,In the Mood for Love,Wong Kar-wai,2000,Simon Abrams,Freelance film critic,US,1,1956.0,,
2,The Tree of Life,Terrence Malick,2011,Simon Abrams,Freelance film critic,US,1,1943.0,English,English
3,Yi Yi: A One and a Two,Edward Yang,2000,Simon Abrams,Freelance film critic,US,1,1947.0,Mandarin,Mandarin
4,Goodbye to Language,Jean-Luc Godard,2014,Simon Abrams,Freelance film critic,US,1,1930.0,French,French
5,The White Meadows,Mohammad Rasoulof,2009,Simon Abrams,Freelance film critic,US,1,1972.0,Persian,Persian
6,Night Across the Street,Raoul Ruiz,2012,Simon Abrams,Freelance film critic,US,1,,,
7,Certified Copy,Abbas Kiarostami,2010,Simon Abrams,Freelance film critic,US,1,1940.0,French,French
8,Sparrow,Johnnie To,2008,Simon Abrams,Freelance film critic,US,1,1955.0,English,English
9,Fados,Carlos Saura,2007,Simon Abrams,Freelance film critic,US,1,1932.0,Portuguese,Portuguese


In [56]:
df.to_csv('preped_BBC.csv',index=False)

### How many different langauges are there?

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('preped_BBC.csv')

In [4]:
df['language'].nunique()

38

In [5]:
df['language'].unique()

array(['English', nan, 'Mandarin', 'French', 'Persian', 'Portuguese',
       'Thai', 'Japanese', 'Indonesian', 'Spanish', 'Italian',
       'Cantonese', 'Russian', 'German', 'Bambara', 'Korean', 'Nepali',
       'Swedish', 'Amharic', 'Hindi', 'Hungarian', 'Kannada', 'Bosnian',
       'Hebrew', 'Filipino', 'Danish', 'Swahili', 'Turkish', 'Greek',
       'Polish', 'Maya', 'Kabuverdianu', 'Arabic', 'Romanian', 'Kurdish',
       'Chinese', 'Low German', 'Czech', 'Bengali'], dtype=object)

### What are the relative percentage of each different languages?

In [127]:
df_l.head()

Unnamed: 0,movie,language
0,12 Years a Slave,English
1,2046,Cantonese
2,24 Hour Party People,English
3,25th Hour,English
4,3-Iron,English


In [129]:
df_l[df_l['language'] !='N/A']['language'].value_counts()

language
English         340
French           45
Spanish          21
Japanese         15
Italian          12
Portuguese       11
Hindi            10
None              9
Arabic            8
Mandarin          8
Persian           7
Korean            7
German            6
Hebrew            6
Swedish           5
Russian           5
Thai              4
Turkish           4
Hungarian         3
Romanian          3
Cantonese         3
Indonesian        2
Danish            2
Chinese           2
Bambara           1
Bengali           1
Swahili           1
Low German        1
Amharic           1
Bosnian           1
Nepali            1
Kurdish           1
Czech             1
Polish            1
Filipino          1
Greek             1
Kabuverdianu      1
Maya              1
Kannada           1
Name: count, dtype: int64

In [134]:
df_l[df_l['language'] !='N/A']['language'].value_counts(normalize=True)

language
English         0.614828
French          0.081374
Spanish         0.037975
Japanese        0.027125
Italian         0.021700
Portuguese      0.019892
Hindi           0.018083
None            0.016275
Arabic          0.014467
Mandarin        0.014467
Persian         0.012658
Korean          0.012658
German          0.010850
Hebrew          0.010850
Swedish         0.009042
Russian         0.009042
Thai            0.007233
Turkish         0.007233
Hungarian       0.005425
Romanian        0.005425
Cantonese       0.005425
Indonesian      0.003617
Danish          0.003617
Chinese         0.003617
Bambara         0.001808
Bengali         0.001808
Swahili         0.001808
Low German      0.001808
Amharic         0.001808
Bosnian         0.001808
Nepali          0.001808
Kurdish         0.001808
Czech           0.001808
Polish          0.001808
Filipino        0.001808
Greek           0.001808
Kabuverdianu    0.001808
Maya            0.001808
Kannada         0.001808
Name: proportion

### Where do the critics come from?

In [136]:
df['crit_cn'].unique()

array(['US', 'Mexico', 'UK', 'Italy', 'Turkey', 'Brazil', 'Canada',
       'Argentina', 'France', 'Germany', 'India', 'Israel', 'Australia',
       'Cuba', 'Colombia', 'Senegal', 'South Korea', 'Philippines',
       'Belgium', 'Egypt', 'Chile', 'Lebanon', 'Japan', 'South Africa',
       'Austria', 'Kazakhstan', 'UAE', 'Hong Kong', 'Namibia',
       'Singapore', 'Switzerland', 'Bangladesh', 'Indonesia', 'Taiwan',
       'Qatar', 'China'], dtype=object)

In [141]:
df['critic'].nunique()

177

In [140]:
df.groupby('crit_cn')['critic'].nunique().sort_values(ascending=False)

crit_cn
US              82
UK              18
Cuba             5
India            5
Canada           5
Germany          5
France           5
Australia        4
Israel           4
Italy            4
Colombia         4
Lebanon          3
UAE              3
Austria          2
Turkey           2
South Korea      2
Singapore        2
Mexico           2
Argentina        2
Chile            2
Japan            1
South Africa     1
China            1
Hong Kong        1
Taiwan           1
Switzerland      1
Brazil           1
Indonesia        1
Kazakhstan       1
Senegal          1
Qatar            1
Philippines      1
Egypt            1
Belgium          1
Bangladesh       1
Namibia          1
Name: critic, dtype: int64

### Tendency towards diversity of critics from different regions

In [144]:
# This shows homw many different language speaking movies did critic from each region recommend
df.groupby('crit_cn')['language'].nunique()

crit_cn
Argentina        7
Australia       12
Austria          9
Bangladesh       6
Belgium          6
Brazil           4
Canada           8
Chile            6
China            6
Colombia         8
Cuba            18
Egypt            6
France          13
Germany         13
Hong Kong        7
India           11
Indonesia        9
Israel          11
Italy           14
Japan            5
Kazakhstan       4
Lebanon         12
Mexico           8
Namibia          6
Philippines      6
Qatar            5
Senegal          5
Singapore        8
South Africa     5
South Korea      6
Switzerland      7
Taiwan           6
Turkey           8
UAE             10
UK              20
US              29
Name: language, dtype: int64

In [148]:
# most recommened language for different region
df[df['language'] !='N/A'].groupby('crit_cn')['language'].value_counts().head(50)

crit_cn     language  
Argentina   English       11
            Italian        2
            Japanese       2
            Spanish        2
            Korean         1
            Thai           1
Australia   English       20
            French         8
            Japanese       3
            Persian        2
            Cantonese      1
            Italian        1
            None           1
            Portuguese     1
            Russian        1
            Spanish        1
            Thai           1
Austria     English        7
            Spanish        3
            French         2
            Korean         2
            German         1
            None           1
            Romanian       1
            Thai           1
Bangladesh  English        3
            Japanese       1
            Korean         1
            Persian        1
            Russian        1
Belgium     Spanish        3
            English        2
            French         1
            Hungaria