You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was a 'mirror' of the IMDb database, stored in a mariadb
database. The data are downloaded from one of the IMDb FTP
mirrors, then processed to remove TV shows, porn, shorts,
and 'insignificant' films (based on number of ratings).
The database and scraper are housed in docker containers
and orchestrated via docker-compose
I note that as of the end of December 2017, IMDB have shuttered their
FTP site from which this project downloaded data. Since movies are no
long my bread and butter, I will put this project into mothballs.
Initialization
Prepare the srv directory where we will store the database
and the make state:
Then use docker-compose to initialize the database.
docker-compose build
docker-compose up -d
sleep 2s && docker-compose ps
It may take a while to pull the image for mariadb. To check on the status
of your images, run
docker-compose ps
You should see something like the following:
Name Command State Ports
-----------------------------------------------------------------------------------------
imdbmirror_imdb_1 /usr/bin/make noop Exit 0
imdbmirror_mariastate_1 /bin/true Exit 0
imdbmirror_mysqldb_1 /docker-entrypoint.sh mysqld Up 0.0.0.0:23306->3306/tcp
imdbmirror_state_1 /bin/true Exit 0
Download
The commands to download, transform, and store the IMDb data are
run via a Makefile within the docker image imdb. The main command
of this docker image is make, so y
be run via:
alias imake='docker-compose run --rm imdb'
imake help
Creating a mirror works by the following steps:
Downloading files from the FTP mirror.
Processing them with perl to remove TV shows.
Stuffing them into a sqlite file with imdb2sql.py.
Removing porn, unpopular titles and shorts, as well as stranded actors and
directors.
Stuffing the remaining data into mariadb, interpreting some of the fields.
You can run them via:
# download:
imake -j 4 downloaded
# process
imake -j 6 procd
# raw sqlite
imake raw_sqlite
# remove porn and so on:
imake sqlite
# stuff into mariadb:
imake stuffed
Note This can take a good long while. Depending on your internet connection, the
download might take 20 minutes or more. Processing into sqlite may take around 10 to
15 minutes, and then post-processing and loading into mariadb may take an hour. Also,
intermediate files will be kept around, saved in the ./srv/imdb/state directory.
Having intermediate files allows you to change the code and rerun without forcing a
new download.
Now connect to the database and poke around:
mysql -s --host=0.0.0.0 --port=23306 --user=moe --password=movies4me IMDB
Shiny frontend
The docker-compose command now also supports a shiny database front end. If you wish to
turn it off, comment out the following stanza from docker-compose.yml (I do not know
of a better solution at the moment):
# comment it out by putting '#' in front, like a nice yaml file.
shiny:
extends:
file: common.yml
service: common
build: ./shiny
links:
- mysqldb:mysqldb
ports:
- 0.0.0.0:19191:5555
If you do want to use the shiny frontend, once the image is running, go to
localhost port 19191, and poke around. You can
grep for title, cast members, directors, restrict by votes per year and
production year. When you select a row from the table in the titles tab,
details on that film will become available in the other tabs.
'Gung Ho!': The Story of Carlson's Makin Island Raiders
movie_admissions:
movie_id
id
amount
locale
end_date
title
249155
1
1225
FRANCE
2016-01-26
Fanny
249155
2
1128
FRANCE
2016-01-19
Fanny
532035
3
92169
SPAIN
NA
Nobleza baturra
166918
4
1152
FRANCE
2016-01-26
César
166918
5
1094
FRANCE
2016-01-19
César
124437
6
727
FRANCE
2015-04-14
Camille
movie_budgets:
movie_id
id
units
amount
title
2
1
$
500
#
15
2
$
500
#2 Chick
25
3
$
1000
#30
41
4
€
1500
#99
44
5
€
10000
#ambigot
55
6
$
20000
#bars4Justice
movie_companies:
movie_id
id
company_id
company_type_id
note
title
25
5
4
1
(2015) (USA) (video)
#30
39
7
6
1
(2007) (India) (all media)
#73, Shaanthi Nivaasa
70
9
8
1
(2016) (Mexico) (theatrical)
#BuscandoaInés
115
11
10
1
(2014) (USA) (all media) (Internet)
#Hacked
122
12
11
1
(2015) (worldwide) (video)
#HeroIsTheNewBlack!
124
13
12
1
(2013) (Russia) (all media)
#Hommes: True Story of the Thieves' World
movie_first_US_release:
movie_id
date
title
2
2014-01-20
#
6
2015-12-12
#1 at the Apocalypse Box Office
7
2014-04-30
#1 Beauty Nail Salon
11
2015-12-01
#10007
12
2015-07-30
#1137
15
2014-06-17
#2 Chick
movie_gross:
movie_id
id
units
amount
locale
reissue
end_date
title
800610
1
ITL
2.4e+08
ITALY
0
NA
The Silent Enemy
44754
2
$
3.3e+06
USA
0
NA
All Quiet on the Western Front
58263
3
$
1.0e+06
USA
0
NA
Anna Christie
58263
4
$
4.9e+05
WORLDWIDE
0
NA
Anna Christie
185421
5
$
7.8e+04
USA
0
2001-12-09
Der blaue Engel
185421
6
$
6.9e+04
USA
0
2001-11-11
Der blaue Engel
movie_info:
movie_id
id
info_type_id
info
note
title
269
370439
15
Dawn Divine: He could have killed me too, you know?::Joe Collins: That's right.::Dawn Divine: I hate him!::Joe Collins: Waste of time!::Dawn Divine: I don't care! I could kill him!::Joe Collins: Hey, stealing's a business, not a crusade.
NA
$
269
370440
15
Granich: Can you gets these bottles into Copenhagan?::Candy Man: What is it?::Granich: Pure, consentrated, acid. From one ounce, they will make 300,000 capsules of LSD. If you put one of these into the water supply, all of Hamburg takes the trip!
NA
$
269
370441
15
[on the telephone]::Sarge: Yeah, I know the line is busy. Find out if Miss Divine is talking, or if the phone is out of order. Hello? Operator? [hangs up] Damn, Hamburg operators. Why the hell don't they learn English?
NA
$
269
370442
15
Sarge: [in an irritable voice as he disgustedly watches the Candy Man lean back and forth and crane his neck in various directions and climb around to look on different sides of the closet walls and tap the panelling with his knuckles] What're ya BUILDIN'?
NA
$
269
370443
15
Mr. Kessel: [seeing Miss Devine's heavily-loaded grocery bag] Ahhhh, Fraulein! Soooo much to eat for such a little girl::Dawn Divine: [giggling] I have no willpower!
NA
$
269
370444
15
Sarge: [seeing that Miss Devine is exiting the booth that he was planning to go into] Hey - - what're you doing in there?::Dawn Divine: [flashing a bright mischievous smile and replying in a pretend furtive whisper] Robbing a bank!
NA
$
movie_info_idx:
movie_id
id
info_type_id
info
note
title
39
881598
99
1000001103
NA
#73, Shaanthi Nivaasa
39
881599
100
117
NA
#73, Shaanthi Nivaasa
39
881600
101
6.4
NA
#73, Shaanthi Nivaasa
41
881607
99
0.......08
NA
#99
41
881608
100
13
NA
#99
41
881609
101
9.2
NA
#99
movie_keyword:
movie_id
id
keyword_id
title
62
115
111
#Bitch, les filles et la violence
62
116
112
#Bitch, les filles et la violence
62
117
113
#Bitch, les filles et la violence
62
118
114
#Bitch, les filles et la violence
62
119
115
#Bitch, les filles et la violence
75
120
116
#Coachella
movie_opening_weekend:
movie_id
id
units
amount
locale
screens
end_date
title
185421
1
$
4808
USA
1
2001-07-22
Der blaue Engel
185421
2
$
6451
USA
1
2001-07-15
Der blaue Engel
405541
3
£
1901
UK
1
2004-02-15
L'âge d'or
464463
4
$
6123
USA
1
1931-03-14
M
488326
5
$
7554
USA
1
2015-08-09
Metropolitan
411775
6
$
7526
USA
1
2012-05-13
La grande illusion
movie_premiere_US_release:
movie_id
date
title
55
2015-10-01
#bars4Justice
244
2016-01-24
#UnitedWeWin.
251
2016-12-31
#WeFilmLA
269
1971-12-15
$
336
2008-10-09
& Teller
489
1935-04-18
'G' Men
movie_raw_runtimes:
movie_id
rt
title
7
5
#1 Beauty Nail Salon
6
11
#1 at the Apocalypse Box Office
11
10
#10007
18
7
#23
25
6
#30
28
11
#47
movie_release_dates:
movie_id
id
locale
date
is_premiere
note
title
2
1
USA
2014-01-20
0
(DVD premiere)
#
7
2
USA
2014-04-30
0
NA
#1 Beauty Nail Salon
6
3
AUSTRALIA
2015-07-31
0
NA
#1 at the Apocalypse Box Office
6
4
USA
2015-12-12
0
NA
#1 at the Apocalypse Box Office
11
5
USA
2015-12-01
0
NA
#10007
12
6
USA
2015-07-30
0
NA
#1137
movie_rentals:
movie_id
id
units
amount
locale
estimated
ex_usa
title
44754
1
$
1500000
USA
0
0
All Quiet on the Western Front
745215
2
$
1218000
WORLDWIDE
0
0
The Divorcee
348571
3
$
551000
USA
0
0
In Gay Madrid
348571
4
$
398000
NON-USA
0
0
In Gay Madrid
892605
5
$
2300000
WORLDWIDE
0
0
Whoopee!
145575
6
$
1500000
USA
0
0
City Lights
movie_runtime:
movie_id
runtime
minruntime
maxruntime
nruntimes
title
6
11
11
11
1
#1 at the Apocalypse Box Office
7
5
5
5
1
#1 Beauty Nail Salon
11
10
10
10
1
#10007
18
7
7
7
1
#23
25
6
6
6
1
#30
28
11
11
11
1
#47
movie_votes:
movie_id
id
updated_at
votes
rating
vote_mean
vote_sd
vote_se
vote1
vote2
vote3
vote4
vote5
vote6
vote7
vote8
vote9
vote10
title
39
1
2016-05-17 13:19:07
117
6.4
6.7
3.2
0.30
14
5
5
5
5
5
14
14
5
32
#73, Shaanthi Nivaasa
41
2
2016-05-17 13:19:07
13
9.2
9.5
2.0
0.56
5
0
0
0
0
0
0
0
5
89
#99
45
3
2016-05-17 13:19:07
17
7.0
7.1
2.3
0.56
5
0
0
14
5
0
24
24
14
14
#AmeriCan
57
4
2016-05-17 13:19:07
9
4.9
5.0
3.1
1.05
30
0
13
0
0
0
30
13
13
0
#Beings
120
5
2016-05-17 13:19:07
6
8.2
8.3
3.2
1.32
16
0
0
0
0
0
0
16
0
68
#Help
125
6
2016-05-17 13:19:07
1198
3.4
3.9
2.7
0.08
25
15
15
15
5
5
5
5
5
5
#Horror
movie_weekend_gross:
movie_id
id
units
amount
locale
screens
end_date
days_open
title
185421
1
$
7806
USA
1
2001-12-09
147
Der blaue Engel
185421
2
$
1854
USA
1
2001-11-11
119
Der blaue Engel
185421
3
$
8134
USA
1
2001-11-04
112
Der blaue Engel
185421
4
$
3430
USA
1
2001-08-12
28
Der blaue Engel
185421
5
$
4808
USA
1
2001-07-22
7
Der blaue Engel
185421
6
$
6451
USA
1
2001-07-15
0
Der blaue Engel
Cast by budget
Here we grab all movies with budget information (n.b. these are sometimes
bogus, because IMDb users can enter the information without much confirmation.)
then sum movie budgets for each actor and actress appearing in the movie,
then find the top actor and actress by birth year. Here we list the top
actor and actress by birth year, with total budget in billions, and the number
of films:
library(RMySQL)
library(dplyr)
library(knitr)
imcon<- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt<- dbGetQuery(imcon$con,'SET NAMES utf8')
# values over 300M are likely bogusbudgets<- tbl(imcon,'movie_budgets') %>%
filter(units=='$') %>%
select(movie_id,amount) %>%
filter(amount<3e8)
role_info<- tbl(imcon,'role_type') %>%
filter(role%in% c('actor','actress')) %>%
select(role_id)
sumbudget<- tbl(imcon,'cast_info') %>%
inner_join(role_info,by='role_id') %>%
select(movie_id,person_id,role_id) %>%
inner_join(budgets,by='movie_id') %>%
group_by(person_id) %>%
summarize(totroles=n(),
sumamount=sum(amount)) %>%
ungroup()
aggregated<-sumbudget %>%
inner_join(tbl(imcon,'name') %>%
select(person_id,name,dob,gender) %>%
filter(!is.null(dob)),by='person_id') %>%
collect()
topcast<-aggregated %>%
mutate(yob=as.numeric(gsub('^((19|20)\\d{2}).+','\\1',dob))) %>%
filter(!is.na(yob)) %>%
group_by(yob,gender) %>%
arrange(desc(sumamount)) %>%
summarize(totbudget=first(sumamount) /1e9,
topname=first(name),
toproles=first(totroles)) %>%
ungroup()
# I wish there were an easy way to do this via tidyr,# and maybe there is...topm<-topcast %>%
select(yob,topname,toproles,totbudget,gender) %>%
filter(gender=='m')
topf<-topcast %>%
select(yob,topname,toproles,totbudget,gender) %>%
filter(gender=='f')
topc<- inner_join(topm %>% select(-gender),topf %>% select(-gender),by='yob') %>%
rename(actor_total_budget=totbudget.x,
actor_name=topname.x,
actor_roles=toproles.x,
actress_total_budget=totbudget.y,
actress_name=topname.y,
actress_roles=toproles.y)
topc %>%
filter(yob>1945,yob<=1995) %>%
kable()
yob
actor_name
actor_roles
actor_total_budget
actress_name
actress_roles
actress_total_budget
1946
Welker, Frank
137
8.27
Sarandon, Susan
52
1.63
1947
Ratzenberger, John
35
2.88
Marshall, Mona
28
2.34
1948
Jackson, Samuel L.
88
4.80
Bates, Kathy
37
1.70
1949
Broadbent, Jim
42
2.48
Weaver, Sigourney
43
1.76
1950
Coltrane, Robbie
25
2.06
Walters, Julie
16
1.31
1951
Williams, Robin
58
2.56
MacNeille, Tress
35
1.82
1952
Neeson, Liam
63
3.66
Newman, Laraine
32
2.77
1953
Molina, Alfred
35
1.90
Lockhart, Anne
29
1.28
1954
Rabson, Jan
32
2.47
Russo, Rene
20
1.16
1955
Willis, Bruce
76
3.41
Goldberg, Whoopi
45
1.38
1956
Hanks, Tom
53
3.59
Staunton, Imelda
20
1.20
1957
Castellaneta, Dan
38
2.37
Hoffman, Bridget
30
1.83
1958
Oldman, Gary
42
3.20
Pfeiffer, Michelle
32
1.29
1959
Morshower, Glenn
39
1.97
Thompson, Emma
27
1.66
1960
Weaving, Hugo
32
3.30
Moore, Julianne
51
1.64
1961
Murphy, Eddie
58
3.38
Hunt, Bonnie
20
1.52
1962
Baker, Dee Bradley
58
3.37
Cusack, Joan
28
1.31
1963
Harnell, Jess
42
4.39
Shue, Elisabeth
24
0.59
1964
Papajohn, Michael
52
3.88
Janssen, Famke
31
1.51
1965
Downey Jr., Robert
53
3.50
Lane, Diane
25
1.39
1966
Favreau, Jon
30
2.15
Berry, Halle
36
2.60
1967
Ruffalo, Mark
38
2.24
Kidman, Nicole
38
1.95
1968
Vernon, Conrad
26
3.75
Liu, Lucy
26
1.29
1969
Black, Jack
47
2.58
Blanchett, Cate
33
2.48
1970
Damon, Matt
57
2.97
Weisz, Rachel
23
1.48
1971
Renner, Jeremy
27
2.59
Gugino, Carla
30
1.81
1972
Johnson, Dwayne
33
1.91
Diaz, Cameron
35
1.90
1973
Marsden, James
37
2.08
Wiig, Kristen
22
1.28
1974
Bale, Christian
33
1.93
Banks, Elizabeth
33
1.80
1975
Maguire, Tobey
21
1.81
Jolie, Angelina
34
2.32
1976
Cumberbatch, Benedict
19
1.79
Witherspoon, Reese
29
1.21
1977
Bloom, Orlando
17
1.71
Washington, Kerry
20
0.86
1978
Mackie, Anthony
37
2.64
Saldana, Zoe
25
1.32
1979
Evans, Luke
12
1.62
Dawson, Rosario
33
1.25
1980
Tatum, Channing
31
1.58
Green, Eva
9
0.96
1981
Evans, Chris
29
3.20
Portman, Natalie
33
1.51
1982
Rogen, Seth
34
1.84
Dunst, Kirsten
36
1.44
1983
Hemsworth, Chris
18
2.03
Kunis, Mila
21
1.11
1984
Curry, Graham
16
1.71
Johansson, Scarlett
39
3.26
1985
Lutz, Kellan
14
0.70
Gadot, Gal
10
1.29
1986
LaBeouf, Shia
20
1.49
Fox, Megan
11
0.82
1987
Felton, Tom
14
1.28
Page, Ellen
12
0.82
1988
Integra, Raiden
15
1.56
Stone, Emma
16
0.94
1989
Hoult, Nicholas
13
1.31
Olsen, Elizabeth
7
1.36
1990
Taylor-Johnson, Aaron
14
1.08
Lawrence, Jennifer
18
1.42
1991
Keynes, Skandar
3
0.56
Wright, Bonnie
7
1.03
1992
Sabara, Daryl
19
1.60
Watters, Bailee
8
0.81
1993
Bright, Cameron
16
0.80
Robb, AnnaSophia
11
0.42
1994
Melling, William
5
0.70
Fanning, Dakota
21
0.86
1995
Jackson, Billy
6
0.58
Hanratty, Sammi
14
0.87
Try again
That kind of works. However, the cast_info table also contains a field
nr_order that lists the order of an actor/actress in some films.
(In some titles, the nr_order field is not populated, or lists
each cast member as 1.) Presumably lower order equates to higher billing
in the cast and perhaps higher salary. Here we proceed to estimate
total earned salary for actors and actresses by assuming salary is inversely
proportional to the nr_order, as follows:
For each film, fill out any missing nr_order with the largest known
value for that film.
Then substitute all missing nr_order values with a 1 (this only
occurs for films with no recorded values for nr_order).
Rescale the nr_order for each film so that they sum to 1.0 for
each film.
Multiply the rescaled nr_order by one-seventh the budget of the film.
Aggregate over actors and actresses, then group by birth year and
gender.
library(RMySQL)
library(dplyr)
library(knitr)
imcon<- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt<- dbGetQuery(imcon$con,'SET NAMES utf8')
# values over 300M are likely bogusbudgets<- tbl(imcon,'movie_budgets') %>%
filter(units=='$') %>%
select(movie_id,amount) %>%
filter(amount<3e8)
role_info<- tbl(imcon,'role_type') %>%
filter(role%in% c('actor','actress')) %>%
select(role_id)
aggbudget<- tbl(imcon,'cast_info') %>%
inner_join(role_info,by='role_id') %>%
select(movie_id,person_id,role_id,nr_order) %>%
inner_join(budgets,by='movie_id') %>%
collect()
# fill with maxfix1<-aggbudget %>%
group_by(movie_id) %>%
mutate(nr_order=pmin(nr_order,max(nr_order,na.rm=TRUE),na.rm=TRUE)) %>%
ungroup()
# replace with 1fix2<-fix1 %>%
mutate(nr_order=pmin(nr_order,1,na.rm=TRUE))
# rescalefix3<-fix2 %>%
mutate(scale_order=1/ (nr_order)) %>%
group_by(movie_id) %>%
mutate(scale_order=scale_order/ sum(scale_order,na.rm=FALSE)) %>%
ungroup()
# multiply by budget, divide by 7, aggregate over peoplebbyname<-fix3 %>%
mutate(salary=(1.0/7.0) *scale_order*amount) %>%
group_by(person_id) %>%
summarize(totsalary=sum(salary),
totroles=n()) %>%
ungroup()
# get name infonaminfo<- tbl(imcon,'name') %>%
select(person_id,name,dob,gender) %>%
filter(!is.null(dob)) %>%
collect()
topcast<-bbyname %>%
inner_join(naminfo,by='person_id') %>%
mutate(yob=as.numeric(gsub('^((19|20)\\d{2}).+','\\1',dob))) %>%
filter(!is.na(yob)) %>%
group_by(yob,gender) %>%
arrange(desc(totsalary)) %>%
summarize(totbudget=first(totsalary) /1e6,
topname=first(name),
toproles=first(totroles)) %>%
ungroup()
# I wish there were an easy way to do this via tidyr,# and maybe there is...topm<-topcast %>%
select(yob,topname,toproles,totbudget,gender) %>%
filter(gender=='m')
topf<-topcast %>%
select(yob,topname,toproles,totbudget,gender) %>%
filter(gender=='f')
topc<- inner_join(topm %>% select(-gender),topf %>% select(-gender),by='yob') %>%
rename(actor_total_budget=totbudget.x,
actor_name=topname.x,
actor_roles=toproles.x,
actress_total_budget=totbudget.y,
actress_name=topname.y,
actress_roles=toproles.y)
topc %>%
filter(yob>1945,yob<=1995) %>%
kable()
yob
actor_name
actor_roles
actor_total_budget
actress_name
actress_roles
actress_total_budget
1946
Welker, Frank
137
23.09
Darling, Jennifer
20
3.97
1947
Ratzenberger, John
35
8.68
Marshall, Mona
28
6.29
1948
Jackson, Samuel L.
88
10.23
Bates, Kathy
37
3.89
1949
Broadbent, Jim
42
5.56
Weaver, Sigourney
43
5.20
1950
Perlman, Ron
49
6.45
Walters, Julie
16
3.87
1951
Williams, Robin
58
6.21
MacNeille, Tress
35
4.08
1952
Cummings, Jim
56
14.40
Newman, Laraine
32
7.55
1953
Molina, Alfred
35
4.29
Soucie, Kath
30
4.39
1954
Rabson, Jan
32
6.68
O'Hara, Catherine
30
3.94
1955
Willis, Bruce
76
7.48
Goldberg, Whoopi
45
3.22
1956
Hanks, Tom
53
6.42
Staunton, Imelda
20
2.94
1957
Buscemi, Steve
63
5.56
Hoffman, Bridget
30
4.74
1958
Oldman, Gary
42
6.65
Pfeiffer, Michelle
32
3.78
1959
Kilmer, Val
55
5.60
Thompson, Emma
27
4.69
1960
Banderas, Antonio
51
6.70
Moore, Julianne
51
3.77
1961
Murphy, Eddie
58
7.78
Hunt, Bonnie
20
3.30
1962
Baker, Dee Bradley
58
7.48
Yeoh, Michelle
17
3.64
1963
Harnell, Jess
42
10.18
Shue, Elisabeth
24
2.11
1964
Bergen, Bob
42
10.25
Bullock, Sandra
32
4.84
1965
Stiller, Ben
48
5.90
Lane, Diane
25
2.52
1966
Sturgis, Gary Anthony
19
20.92
Berry, Halle
36
4.62
1967
Statham, Jason
32
25.44
Kidman, Nicole
38
4.50
1968
Vernon, Conrad
26
12.60
Liu, Lucy
26
3.90
1969
Black, Jack
47
6.54
Blanchett, Cate
33
5.42
1970
Whiting, Al
12
29.91
Weisz, Rachel
23
4.01
1971
Stepanek, Brian
14
5.85
Poehler, Amy
25
2.93
1972
Law, Jude
29
4.74
Diaz, Cameron
35
3.80
1973
McKidd, Kevin
17
4.11
Wiig, Kristen
22
4.92
1974
Ribisi, Giovanni
37
3.56
Banks, Elizabeth
33
2.40
1975
Sabongui, Patrick
21
2.98
Jolie, Angelina
34
6.34
1976
Reynolds, Ryan
26
5.38
Witherspoon, Reese
29
2.66
1977
Bloom, Orlando
17
3.64
Wahlgren, Kari
9
2.64
1978
Hader, Bill
31
4.38
Mendez, Denny
4
3.25
1979
Carew, John
1
3.29
Hudson, Kate
25
2.68
1980
Tatum, Channing
31
2.86
Aguilera, Christina
5
2.71
1981
Miller, T.J.
16
4.86
Solomon, Christine
4
3.81
1982
Rogen, Seth
34
5.21
Dunst, Kirsten
36
2.57
1983
Zelocchi, Enzo
10
23.59
Ebanks, Selita
3
3.72
1984
Curry, Graham
16
2.17
Johansson, Scarlett
39
4.77
1985
Duke, Clark
10
2.84
Kendrick, Anna
22
2.57
1986
LaBeouf, Shia
20
2.52
Kanda, Sayaka
3
1.66
1987
Felton, Tom
14
2.53
Liu, Yifei
5
4.76
1988
Integra, Raiden
15
1.97
Stone, Emma
16
3.43
1989
Mintz-Plasse, Christopher
19
4.82
Panettiere, Hayden
13
1.67
1990
Jefferies, Marc John
16
2.01
Stewart, Kristen
24
3.15
1991
Keynes, Skandar
3
1.39
Jackson, Amy
2
1.77
1992
Hale, Alex
4
45.86
Olszanska, Michalina
3
3.88
1993
Bright, Cameron
16
1.66
Robb, AnnaSophia
11
0.99
1994
Jones, Tyler Patrick
7
1.59
Fanning, Dakota
21
2.53
1995
Majors, Austin
6
0.94
Hanratty, Sammi
14
1.44
Dominant actors over time
Let us consider this budget calculation another way: compute the pseudo-salary for each
actor and actress as a function of production year of the film,
then compute a 9 year rolling mean for each actor and actress, and
plot the dominant players over time.
library(RMySQL)
library(dplyr)
library(knitr)
imcon<- src_mysql(host='0.0.0.0',user='moe',password='movies4me',dbname='IMDB',port=23306)
capt<- dbGetQuery(imcon$con,'SET NAMES utf8')
# values over 300M are likely bogusbudgets<- tbl(imcon,'movie_budgets') %>%
filter(units=='$') %>%
select(movie_id,amount) %>%
filter(amount<3e8)
role_info<- tbl(imcon,'role_type') %>%
filter(role%in% c('actor','actress')) %>%
select(role_id)
movie_yr<- tbl(imcon,'title') %>%
select(movie_id,production_year) %>%
rename(year=production_year)
aggbudget<- tbl(imcon,'cast_info') %>%
inner_join(role_info,by='role_id') %>%
select(movie_id,person_id,role_id,nr_order) %>%
inner_join(budgets,by='movie_id') %>%
inner_join(movie_yr,by='movie_id') %>%
collect()
# fill with maxfix1<-aggbudget %>%
group_by(movie_id) %>%
mutate(nr_order=pmin(nr_order,max(nr_order,na.rm=TRUE),na.rm=TRUE)) %>%
ungroup()
# replace with 1fix2<-fix1 %>%
mutate(nr_order=pmin(nr_order,1,na.rm=TRUE))
# rescalefix3<-fix2 %>%
mutate(scale_order=1/ (nr_order)) %>%
group_by(movie_id) %>%
mutate(scale_order=scale_order/ sum(scale_order,na.rm=FALSE)) %>%
ungroup()
# multiply by budget, divide by 7, aggregate over people, years, summingbbynameyr<-fix3 %>%
mutate(salary=(1.0/7.0) *scale_order*amount) %>%
group_by(person_id,year) %>%
summarize(totsalary=sum(salary),
totroles=n()) %>%
ungroup()
# this is godawful slow and terrible: compute salary on a rolling windowpyrs<-1945:2016rollsum<-function(yrnum,tots,window=9) {
yrv<- rep(0,length(pyrs))
isin<-pyrs%in%yrnumyrv[isin] <-tots[yrnum%in%pyrs]
cyrv<- cumsum(yrv)
rv<-data.frame(yr=pyrs[window:length(pyrs)],dsum=diff(cyrv,lag=window-1))
rv[rv$dsum>0,]
}
aggbyyr<-bbynameyr %>%
filter(totsalary>1000) %>%
arrange(year) %>%
group_by(person_id) %>%
do(rollsum(.$year,.$totsalary)) %>%
ungroup() %>%
filter(dsum>0)
Kval<-1# select anyone who has been in the top K in any year, by gendertopK<-aggbyyr %>%
inner_join(naminfo %>% select(person_id,gender),by='person_id') %>%
arrange(desc(dsum)) %>%
group_by(yr,gender) %>%
summarize(cutoff=nth(dsum,Kval)) %>%
ungroup()
intopK<-aggbyyr %>%
inner_join(naminfo %>% select(person_id,gender),by='person_id') %>%
inner_join(topK,by=c('yr','gender')) %>%
filter(dsum>=cutoff) %>%
distinct(person_id)
# now get the top topdogs<-aggbyyr %>%
inner_join(naminfo %>% select(person_id,name,dob,gender),by='person_id') %>%
inner_join(intopK %>% select(person_id),by='person_id')
topdogs %>%
group_by(yr,gender) %>%
arrange(desc(dsum)) %>%
summarize(tname=first(name)) %>%
ungroup() %>%
rename(name=tname) %>%
kable()