# First stab at sorting actors by average movie gross

We use [requests](http://www.python-requests.org/en/latest/) and [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) to scrape the data from [BoxOfficeMojo](http://boxofficemojo.com). We then use [pandas](http://pandas.pydata.org/) to store the top 50 grossing actors in a DataFrame, and then sort that structure by average gross per movie rather than total gross overall. Finally, we use [pickle](https://docs.python.org/2/library/pickle.html) to store the data and the raw page HTML locally.

In [16]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import pickle

In [5]:
page1 = requests.get('http://www.boxofficemojo.com/people/?view=Actor&sort=sumgross&p=.htm')
page1.status_code

200

In [6]:
page1text = page1.text

In [7]:
soup1 = bs(page1text)

In [8]:
print soup1.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <title>
   Box Office Mojo - People Index
  </title>
  <link charset="utf-8" href="/css/mojo.css?1" media="screen" rel="stylesheet" title="no title" type="text/css"/>
  <link charset="utf-8" href="/css/mojo.css?1" media="print" rel="stylesheet" title="no title" type="text/css"/>
 </head>
 <body>
  <iframe frameborder="0" height="1" id="sis_pixel_sitewide" marginheight="0" marginwidth="0" style="display: none;" width="1">
  </iframe>
  <script>
   setTimeout(function(){
        try{
            //sis3.0 pixel
            var cacheBust = Math.random() * 10000000000000000,
                url_sis3 = 'http://s.amazon-adsystem.com/iu3?',
                params_sis3 = [
                    "d=boxofficemojo.com",
                    "cb=" + cacheBust
                ];

            (document.getElementById('sis_pixel_sitewide')).src = url_sis3 + params_sis3.join('&

In [9]:
my_table1 = soup1.find_all('table')[3].contents
print my_table1[0].prettify()

<tr bgcolor="#dcdcdc">
 <td align="center">
  <font size="2">
   Row
  </font>
 </td>
 <td align="center">
  <font size="2">
   <a href="/people/?view=Actor&amp;sort=person&amp;order=ASC&amp;p=.htm">
    Person
    <br/>
    (Click to view)
   </a>
  </font>
 </td>
 <td align="center">
  <font size="2">
   <a href="/people/?view=Actor&amp;sort=sumgross&amp;order=ASC&amp;p=.htm">
    <b>
     Total Gross
    </b>
   </a>
  </font>
 </td>
 <td align="center" colspan="2">
  <font size="2">
   <a href="/people/?view=Actor&amp;sort=nummovies&amp;order=DESC&amp;p=.htm">
    # Movies
   </a>
   /
  </font>
  <font size="2">
   <a href="/people/?view=Actor&amp;sort=avggross&amp;order=DESC&amp;p=.htm">
    Average
   </a>
  </font>
 </td>
 <td align="center">
  <font size="2">
   <a href="/people/?view=Actor&amp;sort=title&amp;order=ASC&amp;p=.htm">
    #1 Picture
   </a>
  </font>
 </td>
 <td align="center">
  <font size="2">
   <a href="/people/?view=Actor&amp;sort=gross&amp;order=DESC&amp;p=

In [10]:
for cell in my_table1[0].find_all('td'):
    for string in cell.strings:
        print string,
    print '\n'
    
print len(my_table1)

Row 

Person (Click to view) 

Total Gross 

# Movies  /  Average 

#1 Picture 

Gross 

101


In [11]:
def strip_dollar(s):
    if s[0] == '$':
        return float(s[1:].replace(',', ''))
    return s

In [12]:
#for row in my_table1[1:10]:
#    print row.prettify()
rows = []
for i in range(1, len(my_table1)):
    if my_table1[i] == u'\n':
        continue
    row = []
    for cell in my_table1[i].find_all('td'):
        for string in cell.strings:
            row.append(strip_dollar(str(string)))
    rows.append(row)

In [13]:
table = pd.DataFrame(zip(*zip(*rows)[1:]), columns = ['actor', 'total_gross', 'num_movies', 'av_gross', 'highest_movie', 'movie_gross'])
table.sort(columns = 'av_gross', ascending = False, inplace = True)

In [14]:
table

Unnamed: 0,actor,total_gross,num_movies,av_gross,highest_movie,movie_gross
24,Emma Watson,2681.8,14,191.6,Harry Potter / Deathly Hallows (P2),381.0
36,Daniel Radcliffe,2449.2,13,188.4,Harry Potter / Deathly Hallows (P2),381.0
43,Rupert Grint,2390.5,13,183.9,Harry Potter / Deathly Hallows (P2),381.0
17,Orlando Bloom,2815.8,17,165.6,Dead Man's Chest,423.3
18,Will Smith,2814.3,22,127.9,Independence Day,306.2
32,Bradley Cooper,2487.6,23,108.2,American Sniper,350.1
47,Steve Carell,2291.6,22,104.2,Despicable Me 2,368.1
12,Ian McKellen,3150.3,31,101.6,Return of the King,377.8
2,Tom Hanks,4264.2,42,101.5,Toy Story 3,415.0
4,Eddie Murphy,3810.4,38,100.3,Shrek 2,441.2


In [18]:
with open('top50.pkl', 'w') as picklefile:
    pickle.dump([page1text, table], picklefile)