Project Luther
Week 2 & Week 3



Back story:
Using information we scrape from the web, can we build linear regression models from which we can learn about the movie industry?

Data:
* acquisition: web scraping
* storage: flat files
* sources: boxofficemojo.com, any other publicly available information

Skills:
* basics of the web (requests, HTML)
* web scraping
* numpy and pandas
* statsmodels, scikit-learn

Analysis:
* linear regression

Deliverable/communication:
* organized project repository
* slide presentation
* visual and oral communication in presentations
* write-up of process and results

Design:
* iterative design process
* scoping
* "MVP"s and building outward

More information:
* We'll learn about web scraping using two popular tools - BeautifulSoup and Selenium. You'll have to know the very basics of HTML. We'll also be evolving the way we use IPython notebooks—during this project we'll begin to use the notebook as a development scratchpad, where we test things out through interactive scripting, but then solidify our work in python modules with reusable functions and classes.

* We'll practice using linear regression. We'll have a first taste of feature selection, this time based on our intuition and some trial and error, and we'll build and refine our models.

* We'll work in groups for brainstorming and design, and code sharing will be highly encouraged, but the final projects will be individual.

* This project will really give you the freedom to challenge yourself, no matter your skill level. Find your boundaries, meet them, and push them a little further.

* We are very excited to see what you will learn and do for Project Luther!

In [62]:
# psueudo code and outline

# 1) webscrape boxoffice mojo to gather data
# 2) random and fixed effects probit models by actor, using actors as dummy variables

In [405]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import random
from datetime import datetime
from __future__ import division
import requests
import re as re
from bs4 import BeautifulSoup
import time
from sklearn import cross_validation, datasets, tree, linear_model, grid_search
from sklearn.tree import export_graphviz, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier,  RandomForestRegressor,  ExtraTreesClassifier
from sklearn.cross_validation import KFold,StratifiedKFold

STAR WARS


OK we've seen how to scrape individual data pieces like - gross income(s), genre, actors, director[s], titles - from an individual movie.

Now I'm turning these into a list of relevant helper functions that can be applied to any given movie.

In [2]:
def title(soup):
    try:
        title = soup.body.find_all('td')[2].find_all('td')[1].find_all('b')[0]
        title_string = str(title)
        title_string = title_string.replace('<b>','').replace('</b>','').replace('<br/>',' ')
        return title_string
    except:
        return None

def director(soup):
    person_regex = r'([A-Z].?[A-Z]?.?[^A-Z]* [A-Z][^A-Z]*?)<'
    try:
        director = soup.find(text=re.compile('Director'))
        director = director.findParent().findParent().findParent().findNextSibling()
        director_string = str(director)
        return re.findall(person_regex, director_string)
    except:
        return ['None']

def actors(soup):
    person_regex = r'([A-Z].?[A-Z]?.?[^A-Z]* [A-Z][^A-Z]*?)<'
    try:
        actors = soup.find(text=re.compile('Actor'))
        actors = actors.findParent().findParent().findParent().findNextSibling()
        actor_string = str(actors)
        return re.findall(person_regex, actor_string)
    except:
        return ['None']

def genre(soup):
    try:
        genre = soup.find(text=re.compile('Genre: '))
        genre = genre.findNextSibling().text
        return genre
    except:
        return 'None'
    
def dom_gross(soup):
    try:
        dom_string = soup.find(text=re.compile('Domestic:'))
        dtg = dom_string.findParent().findParent().findNextSibling().text
        dtg = dtg.replace('$','').replace(',','').replace('/','')
        dtg_num = int(dtg)
        return dtg_num
    except:
        return None 

def for_gross(soup):
    try:
        foreign_string = soup.find(text=re.compile('Foreign:'))
        fg = foreign_string.findParent().findParent().findNextSibling().text
        fg = fg.replace('$','').replace(',','').replace('/','')
        fg_num = int(fg)
        return fg_num
    except:
        return None

def get_year(soup):
    date_regex = r'(\w+)</a>'
    try:
        date_string = soup.find(text=re.compile('Release Date:'))
        date = date_string.findNextSibling().findChild().findChild()
        date = str(date)
        date = re.findall(date_regex, date)
        year = int(date[0])
        return year
    except:
        return None

def budget(soup):
    try: 
        budget = soup.find(text=re.compile('Production Budget: '))
        budget = budget.findNextSibling().text
        return budget
    except:
        return 'None'
    
# currently does not include budget    
def single_movie_dict(soup):
    return_dict = {}
    return_dict = {'Title': title(soup), "Genre": genre(soup), "Actors": actors(soup), 'Director': director(soup), 'Domestic Gross': dom_gross(soup), 'Foreign Gross': for_gross(soup), "Year": get_year(soup), "Budget": budget(soup)}
    return return_dict

Testing Across a group of random movies across years, etc

In [3]:
#single_movie_dict testing

url = ['http://boxofficemojo.com/movies/?id=biglebowski.htm', 'http://boxofficemojo.com/movies/?id=starwars7.htm', 'http://www.boxofficemojo.com/movies/?id=avatar.htm', 'http://www.boxofficemojo.com/movies/?id=batmanrobin.htm', 'http://www.boxofficemojo.com/movies/?id=ipman3.htm']

for i in range(len(url)):
    response = requests.get(url[i])
    page = response.text
    soup = BeautifulSoup(page)
    print single_movie_dict(soup)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


{'Director': ['Joel Coen'], 'Foreign Gross': None, 'Actors': ['Jeff Bridges', 'John Goodman', 'Julianne Moore', 'Steve Buscemi', 'Seymour Hoffman', 'Tara Reid', 'Sam Elliott*'], 'Domestic Gross': 17451873, 'Title': 'The Big Lebowski', 'Genre': u'Crime Comedy', 'Year': 1998, 'Budget': u'N/A'}
{'Director': ['J.J. Abrams'], 'Foreign Gross': 1131561399, 'Actors': ['John Boyega', 'Daisy Ridley', 'Adam Driver', 'Oscar Isaac', 'Andy Serkis', 'Domhnall Gleeson', 'Max von Sydow', 'Harrison Ford', 'Carrie Fisher', 'Mark Hamill', 'Anthony Daniels', 'Peter Mayhew', 'Kenny Baker', 'Warwick Davis', "Lupita Nyong'o"], 'Domestic Gross': 936662225, 'Title': 'Star Wars: The Force Awakens', 'Genre': u'Sci-Fi Fantasy', 'Year': 2015, 'Budget': u'$245 million'}
{'Director': ['James Cameron'], 'Foreign Gross': 2027457462, 'Actors': ['Sam Worthington', 'Zoe Saldana', 'Sigourney Weaver', 'Michelle Rodriguez', 'Giovanni Ribisi', 'David Moore'], 'Domestic Gross': 760507625, 'Title': 'Avatar', 'Genre': u'Sci-Fi A

Now we need to compile a list of URLs from a seperate scrape.

In [4]:
# compiler functions to scrape movie urls, first (up to) 200 per year

def top_100(year):
    url = 'http://www.boxofficemojo.com/yearly/chart/?yr='+str(year)+'&p=.htm'
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    soup = soup
    # get to html
    movies = soup.find_all("table")[3].find_all('a')
    movies_regex= r'a href="/movies/\Wid=(\w+.htm)'
    url_list = re.findall(movies_regex, str(movies))        
    return url_list

def second_100(year):
    url = 'http://www.boxofficemojo.com/yearly/chart/?page=2&view=releasedate&view2=domestic&yr='+str(year)+'&p=.htm'
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    soup = soup
    # get to html
    movies = soup.find_all("table")[3].find_all('a')
    movies_regex= r'a href="/movies/\Wid=(\w+.htm)'
    url_list = re.findall(movies_regex, str(movies))        
    return url_list


top_100(2016)
# second_100(1980)

['pixar2015.htm',
 'marvel2016.htm',
 'illumination2015.htm',
 'junglebook2015.htm',
 'deadpool2016.htm',
 'disney2016.htm',
 'superman2015.htm',
 'dc2016.htm',
 'bourne5.htm',
 'startrek2016.htm',
 'kungfupanda3.htm',
 'ghostbusters2016.htm',
 'centralintelligence.htm',
 'tarzan2016.htm',
 'untitledlucasmoore.htm',
 'angrybirds.htm',
 'sully.htm',
 'id42.htm',
 'conjuring2.htm',
 'sausageparty.htm',
 'ridealong2.htm',
 'dontbreathe.htm',
 'tmnt2016.htm',
 'purge3.htm',
 'alice2.htm',
 'petesdragon2016.htm',
 'badrobot2016.htm',
 'newline0116.htm',
 'allegiant.htm',
 'nowyouseeme2.htm',
 'iceage5.htm',
 'michelledarnell.htm',
 'themagnificentseven.htm',
 'londonhasfallen.htm',
 'miraclesfromheaven.htm',
 'mybigfatgreekwedding2.htm',
 'mebeforeyou.htm',
 'bfg.htm',
 'universalcomedy2016.htm',
 'theshallows.htm',
 'barbershop3.htm',
 '13hoursthesecretsoldiersofbenghazi.htm',
 'huntsman.htm',
 'warcraft.htm',
 'howtobesingle.htm',
 'kuboandthetwostrings.htm',
 'mikeanddave.htm',
 'armsand

**SIDE NOTE THAT IS VERY IMPORTANT** We're going to need to use a special sampling type later in the analysis process if we want to use actors or directors in the overall movie set. This is called stratified sampling.

In [5]:
# function to add all the movies in a year to a list
def movies_in_year(year):
    year_list=[]
    url_base = 'http://boxofficemojo.com/movies/?id='
    url_list = top_100(year)
    for i in range(len(url_list)):
        response = requests.get(url_base+url_list[i])
        page = response.text
        soup = BeautifulSoup(page)
        year_list.append(single_movie_dict(soup))
        time.sleep(.1)
    url_list = second_100(year)
    for i in range(len(url_list)):
        response = requests.get(url_base+url_list[i])
        page = response.text
        soup = BeautifulSoup(page)
        year_list.append(single_movie_dict(soup))
        time.sleep(.1)
    return year_list

In [6]:
year = 1990
x = movies_in_year(year)
print year
print "# of movies made:"
print len(x)

1990
# of movies made:
199


THE BIG COMPILER

In [79]:
# Fails due to the sheer number of requests being sent at once
# Instead of using this construction we will break things down into subgroups

# all_movies=[]
# for year in range (1980,2017):
#     print year
#     count +=1
#     all_movies = all_movies + movies_in_year(year)
# print len(all_movies)

Breaking it into slightly smaller chunks to avoid chunking encode error



In [15]:
all_movies=[]

In [14]:
for year in range (1980,1990):
    print year
    all_movies = all_movies + movies_in_year(year)

1980


KeyboardInterrupt: 

In [16]:
for year in range (1980,1990):
    print year
    all_movies = all_movies + movies_in_year(year)
time.sleep(600)
for year in range (1990,2000):
    print year
    all_movies = all_movies + movies_in_year(year)
time.sleep(600)
for year in range (2000,2010):
    print year
    all_movies = all_movies + movies_in_year(year)
time.sleep(600)
for year in range (2010,2017):
    print year
    all_movies = all_movies + movies_in_year(year)
print len(all_movies)

1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
7040


In [None]:
# time.sleep(600)
# for year in range (2000,2010):
#     print year
#     all_movies = all_movies + movies_in_year(year)
# time.sleep(600)
# for year in range (2010,2017):
#     print year
#     all_movies = all_movies + movies_in_year(year)
# print len(all_movies)

In [17]:
all_movies[0]

{'Actors': ['Kenny Baker',
  'Anthony Daniels',
  'Peter Mayhew',
  'Dee Williams',
  'Mark Hamill',
  'Harrison Ford',
  'Carrie Fisher'],
 'Budget': u'$18 million',
 'Director': ['Irvin Kershner'],
 'Domestic Gross': 290475067,
 'Foreign Gross': 247900000,
 'Genre': u'Sci-Fi Fantasy',
 'Title': 'The Empire Strikes Back',
 'Year': None}

In [18]:
len(all_movies)

7040

In [19]:
df = pd.DataFrame(all_movies)
df[100:200]

Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year
100,[None],,[Mike Nichols],2261507.0,,Unknown,Gilda Live,
101,[None],,[None],2128395.0,,Unknown,Windows,
102,[None],,[None],2086905.0,,Romantic Comedy,Just Tell Me What You Want,
103,[None],,[None],2013193.0,,Unknown,Bon Voyage Charlie Brown,
104,[Jodie Foster],,[None],1817720.0,,Unknown,Carny,
105,[None],,[None],21411158.0,,Horror Thriller,When a Stranger Calls (Re-issue),
106,[None],,[None],1175855.0,,Unknown,Why Would I Lie?,
107,[None],,[None],1047454.0,,Unknown,Nijinsky,
108,[None],,[None],954046.0,,Unknown,Heart Beat,
109,"[Paul Simon, Blair Brown, Rip Torn]",,[M. Young],843215.0,,Unknown,One-Trick Pony,


In [12]:
df[:100]

Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year
0,"[Kenny Baker, Anthony Daniels, Peter Mayhew, D...",$18 million,[Irvin Kershner],290475067.0,247900000.0,Sci-Fi Fantasy,The Empire Strikes Back,
1,[Lily Tomlin],,[None],103290500.0,,Comedy,9 to 5,
2,[T. Nelson*],,[Sidney Poitier],101300000.0,,Comedy,Stir Crazy,
3,[Leslie Nielsen],$3.5 million,"[Jim Abrahams, David Zucker, Jerry Zucker]",83453539.0,,Comedy,Airplane!,
4,[Clint Eastwood],,[None],70687344.0,,Action Comedy,Any Which Way You Can,
5,"[Albert Brooks*, Goldie Hawn, T. Nelson*]",,[None],69847348.0,,Comedy,Private Benjamin,
6,"[Sissy Spacek, Lee Jones]",,[Michael Apted],67182787.0,,Music Drama,Coal Miner's Daughter,
7,"[Burt Reynolds, Sally Field]",,[None],66132626.0,,Action Comedy,Smokey and the Bandit II,
8,[None],,[None],58853106.0,,Romance,The Blue Lagoon,
9,"[John Belushi, Dan Aykroyd, Carrie Fisher*]",,[John Landis],57229890.0,58000000.0,Comedy,The Blues Brothers,


In [294]:
# testing how to access a single actor's movies

for i in range(len(df)):
    if 'Jason Statham' in df['Actors'][i]:
        print df['Title'][i] + " " + str(df['Domestic Gross'][i])

Lock, Stock and Two Smoking Barrels 3753929.0
Snatch 30328156.0
Turn It Up 1247949.0
The One 43905746.0
John Carpenter's Ghosts of Mars 8709640.0
The Transporter 25296447.0
The Italian Job 106128601.0
Collateral 101005703.0
Cellular 32003620.0
Transporter 2 43095856.0
Crank 27838408.0
WAR 22486409.0
Death Race 36316032.0
Transporter 3 31715062.0
The Bank Job 30060660.0
In the Name of the King: A Dungeon Siege Tale 4775656.0
Crank: High Voltage 13684249.0
The Expendables 103068524.0
Gnomeo and Juliet 99967670.0
The Mechanic 29121498.0
Killer Elite 25124966.0
The Expendables 2 85028192.0
Safe (2012) 17142080.0
Homefront 20158898.0
Parker 17616641.0
The Expendables 3 39322544.0
Furious 7 353007020.0
Spy 110825712.0
Mechanic: Resurrection 20866493.0


In [20]:
# fixing NaN Entries for 1980
count = 0
for i in range(0,116):
        df['Year'][i] = 1980

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [21]:
for i in range(116,227):
    df['Year'][i] = 1981
df[115:230]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year
115,"[Phil Daniels, Jon Finch, Jonathan Pryce]",,[Brian Gibson],2471.0,,Music Drama,Breaking Glass,1980.0
116,"[Harrison Ford, Alfred Molina*]",$18 million,[Steven Spielberg],248159971.0,141766000.0,Period Adventure,Raiders of the Lost Ark,1981.0
117,"[Katharine Hepburn, Henry Fonda, Jane Fonda, D...",,[Mark Rydell],119285432.0,,Drama,On Golden Pond,1981.0
118,[Christopher Reeve],$54 million,[None],108185706.0,,Action / Adventure,Superman II,1981.0
119,[None],,[None],95461682.0,,Romantic Comedy,Arthur,1981.0
120,[Bill Murray],,[Ivan Reitman],85297000.0,,Comedy,Stripes,1981.0
121,"[Burt Reynolds, Roger Moore, Jackie Chan*]",,[None],72179579.0,,Comedy,The Cannonball Run,1981.0
122,[Ian Holm],,[None],58972904.0,,Sports Drama,Chariots of Fire,1981.0
123,[Roger Moore],,[None],54812802.0,,Action,For Your Eyes Only,1981.0
124,[Alan Alda],,[None],50427646.0,,Comedy,The Four Seasons,1981.0


In [22]:
df[115:230]

Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year
115,"[Phil Daniels, Jon Finch, Jonathan Pryce]",,[Brian Gibson],2471.0,,Music Drama,Breaking Glass,1980.0
116,"[Harrison Ford, Alfred Molina*]",$18 million,[Steven Spielberg],248159971.0,141766000.0,Period Adventure,Raiders of the Lost Ark,1981.0
117,"[Katharine Hepburn, Henry Fonda, Jane Fonda, D...",,[Mark Rydell],119285432.0,,Drama,On Golden Pond,1981.0
118,[Christopher Reeve],$54 million,[None],108185706.0,,Action / Adventure,Superman II,1981.0
119,[None],,[None],95461682.0,,Romantic Comedy,Arthur,1981.0
120,[Bill Murray],,[Ivan Reitman],85297000.0,,Comedy,Stripes,1981.0
121,"[Burt Reynolds, Roger Moore, Jackie Chan*]",,[None],72179579.0,,Comedy,The Cannonball Run,1981.0
122,[Ian Holm],,[None],58972904.0,,Sports Drama,Chariots of Fire,1981.0
123,[Roger Moore],,[None],54812802.0,,Action,For Your Eyes Only,1981.0
124,[Alan Alda],,[None],50427646.0,,Comedy,The Four Seasons,1981.0


In [23]:
df.to_pickle('df_with_prod_values.pkl')

### WE WILL NOW PLAY IN A NEW DATAFRAME TO KEEP THIS ONE SACROSANCT

In [503]:
ndf = df.copy()
len(ndf)

7040

In [39]:
# checking how many movies have a budget and actors
count = 0

for i in range(len(ndf)):
    if ndf['Budget'][i] != 'N/A' and ndf['Actors'][i] != ['None']:
        count +=1
print count

2429


# Dropping all movies from the dataset with no actors or budget info to avoid 'lack of data' bias.

### Acknowledge that this biases the dataset towards larger budget, hype movies and away from "indie arthouse" style movies.

Removing actors also removes special editions and re-releases.

In [504]:
# I spent 3 hours trying for elegance. I resort to this ghettoness out of desperation.
count = 0
ndf_indexes_to_drop = []
for i in range(len(ndf)):
    # drop movies with no actors
    if ndf['Actors'][i] == ['None']:
        ndf_indexes_to_drop.append(i)
    elif ndf['Actors'][i] == []:
        ndf_indexes_to_drop.append(i)
    # drop movies with no domestic gross data (7 not covered above)
    elif np.isnan(ndf['Domestic Gross'][i]) == True:
        ndf_indexes_to_drop.append(i)
    # drop movies with no production value data
    elif ndf['Budget'][i] == 'N/A':
        ndf_indexes_to_drop.append(i)

# print count
# print ndf_indexes_to_drop

# reduced = ndf.drop(ndf.index[ndf_indexes_to_drop])
# reduced = reduced.reset_index(drop=True)
# print len(reduced)
# reduced[:20]

ndf.drop(ndf.index[ndf_indexes_to_drop], inplace=True)
ndf = ndf.reset_index(drop=True)
print len(ndf)
ndf[:20]

2419


Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year
0,"[Kenny Baker, Anthony Daniels, Peter Mayhew, D...",$18 million,[Irvin Kershner],290475067.0,247900000.0,Sci-Fi Fantasy,The Empire Strikes Back,1980.0
1,[Leslie Nielsen],$3.5 million,"[Jim Abrahams, David Zucker, Jerry Zucker]",83453539.0,,Comedy,Airplane!,1980.0
2,[Jack Nicholson],$19 million,[Stanley Kubrick],44360123.0,,Horror,The Shining,1980.0
3,[Michael Caine],$6.5 million,[De Palma],31899000.0,,Thriller,Dressed to Kill,1980.0
4,[Michael Caine],$22 million,[None],15716828.0,,Adventure,The Island (1980),1980.0
5,[Kurt Russell],$8 million,[Robert Zemeckis],11715321.0,,Comedy,Used Cars,1980.0
6,"[Woody Allen, Sharon Stone]",$10 million,[Woody Allen],10389003.0,,Comedy,Stardust Memories,1980.0
7,[Paul Newman],$20 million,[None],3763988.0,,Action Thriller,When Time Ran Out,1980.0
8,"[Jeff Bridges, Mickey Rourke, Christopher Walken]",$44 million,[Michael Cimino],3484331.0,,Western,Heaven's Gate,1980.0
9,"[Harrison Ford, Alfred Molina]",$18 million,[Steven Spielberg],248159971.0,141766000.0,Period Adventure,Raiders of the Lost Ark,1981.0


In [505]:
len(ndf)

2419

## Now to fix the formatting of some variables - actors, budget

In [506]:
# alters actors to remove the duplicates caused by voice-acting entries

lst = ['Albert Brooks*', 'Goldie Hawn', 'T. Nelson*']
def voice_acting(lst):
    for i in range(len(lst)):
        lst[i] = lst[i].replace("*", '')
    return lst
print voice_acting(lst)

ndf['Actors'].map(voice_acting)

['Albert Brooks', 'Goldie Hawn', 'T. Nelson']


0       [Kenny Baker, Anthony Daniels, Peter Mayhew, D...
1                                        [Leslie Nielsen]
2                                        [Jack Nicholson]
3                                         [Michael Caine]
4                                         [Michael Caine]
5                                          [Kurt Russell]
6                             [Woody Allen, Sharon Stone]
7                                           [Paul Newman]
8       [Jeff Bridges, Mickey Rourke, Christopher Walken]
9                          [Harrison Ford, Alfred Molina]
10                                    [Christopher Reeve]
11      [Warren Beatty, Diane Keaton, Jack Nicholson, ...
12                                       [Drew Barrymore]
13      [Dustin Hoffman, Jessica Lange, Bill Murray, S...
14                                        [Leonard Nimoy]
15                                   [Sylvester Stallone]
16                                         [Jeff Bridges]
17            

In [507]:
# now we clean up production budgets
def budget_str_to_int(mstring):
    try:
        mstring = mstring.replace('$','').replace(',','').replace(' ','')
        if 'million' in mstring:
            mstring = mstring.replace('million','')
            if '.' in mstring:
                msplit = mstring.split('.')
                power = -len(mstring[1])
                mstring = int(''.join([msplit[0],msplit[1]])) * 10**(6+power)
            else:
                mstring = int(mstring) * 10**(6)
        if isinstance(mstring,str):
            mstring.replace('.0','')
        return int(mstring)
    except:
        return np.nan

budget_str_to_int('$245 million')

ndf['Budget'] = ndf['Budget'].map(budget_str_to_int)

In [508]:
ndf.sort_values(['Domestic Gross'], ascending=False)

Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year
2251,"[John Boyega, Daisy Ridley, Adam Driver, Oscar...",245000000,[J.J. Abrams],936662225.0,1.131561e+09,Sci-Fi Fantasy,Star Wars: The Force Awakens,2015.0
260,"[Kate Winslet, Billy Zane, Kathy Bates, Bill P...",200000000,[James Cameron],658672302.0,1.528100e+09,Romance,Titanic,1997.0
2252,"[Nick Robinson, Omar Sy, Chris Pratt, Dallas H...",150000000,[Colin Trevorrow],652270625.0,1.018130e+09,Sci-Fi Horror,Jurassic World,2015.0
1925,"[Downey, Jr., Chris Hemsworth, Chris Evans, Je...",220000000,[Joss Whedon],623357910.0,8.962000e+08,Action / Adventure,Marvel's The Avengers,2012.0
1411,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",185000000,[Christopher Nolan],534858444.0,4.697000e+08,Action / Adventure,The Dark Knight,2008.0
368,"[Anthony Daniels, Liam Neeson, Natalie Portman...",115000000,[George Lucas],474544677.0,5.525000e+08,Sci-Fi Fantasy,Star Wars: Episode I - The Phantom Menace,1999.0
2253,"[Downey, Jr., Chris Hemsworth, Mark Ruffalo, C...",250000000,[Joss Whedon],459005868.0,9.464080e+08,Action / Adventure,Avengers: Age of Ultron,2015.0
1926,"[Christian Bale, Michael Caine, Anne Hathaway,...",250000000,[Christopher Nolan],448139099.0,6.368000e+08,Action Thriller,The Dark Knight Rises,2012.0
982,"[Mike Myers, Cameron Diaz, Eddie Murphy, Anton...",150000000,"[Andrew Adamson, Kelly Asbury, Conrad Vernon]",441226247.0,4.786125e+08,Animation,Shrek 2,2004.0
12,[Drew Barrymore],10500000,[Steven Spielberg],435110554.0,3.578000e+08,Family Adventure,E.T.: The Extra-Terrestrial,1982.0


In [509]:
# now we adjust budgets and domestic grosses for inflation

ndf['Adjusted Domestic']=ndf['Domestic Gross']*1.025**(2017-ndf['Year'])
ndf['Adjusted Budget']=ndf['Budget']*1.025**(2017-ndf['Year'])

In [510]:
ndf.sort_values(['Adjusted Domestic'], ascending=False)[:10]

Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year,Adjusted Domestic,Adjusted Budget
260,"[Kate Winslet, Billy Zane, Kathy Bates, Bill P...",200000000,[James Cameron],658672302.0,1528100000.0,Romance,Titanic,1997.0,1079311000.0,327723300.0
12,[Drew Barrymore],10500000,[Steven Spielberg],435110554.0,357800000.0,Family Adventure,E.T.: The Extra-Terrestrial,1982.0,1032607000.0,24918650.0
2251,"[John Boyega, Daisy Ridley, Adam Driver, Oscar...",245000000,[J.J. Abrams],936662225.0,1131561000.0,Sci-Fi Fantasy,Star Wars: The Force Awakens,2015.0,984080800.0,257403100.0
152,"[Matthew Broderick, Taylor Thomas, Earl Jones ...",45000000,"[Roger Allers, Rob Minkoff]",422783777.0,545700000.0,Animation,The Lion King,1994.0,746048800.0,79407480.0
368,"[Anthony Daniels, Liam Neeson, Natalie Portman...",115000000,[George Lucas],474544677.0,552500000.0,Sci-Fi Fantasy,Star Wars: Episode I - The Phantom Menace,1999.0,740127700.0,179360800.0
132,"[Sam Neill, Jeff Goldblum, Laura Dern, Richard...",63000000,[Steven Spielberg],402453882.0,626700000.0,Sci-Fi Horror,Jurassic Park,1993.0,727928800.0,113949700.0
0,"[Kenny Baker, Anthony Daniels, Peter Mayhew, D...",18000000,[Irvin Kershner],290475067.0,247900000.0,Sci-Fi Fantasy,The Empire Strikes Back,1980.0,724255600.0,44880280.0
18,"[Tony Cox, Mark Hamill, Harrison Ford, Carrie ...",32500000,[None],309306177.0,165800000.0,Sci-Fi Fantasy,Return of the Jedi,1983.0,716143400.0,75247970.0
1925,"[Downey, Jr., Chris Hemsworth, Chris Evans, Je...",220000000,[Joss Whedon],623357910.0,896200000.0,Action / Adventure,Marvel's The Avengers,2012.0,705272300.0,248909800.0
2252,"[Nick Robinson, Omar Sy, Chris Pratt, Dallas H...",150000000,[Colin Trevorrow],652270625.0,1018130000.0,Sci-Fi Horror,Jurassic World,2015.0,685291800.0,157593800.0


# SAVE POINT

In [511]:
ndf.to_pickle('ndf.pkl')

## Building Genre Dummy Variables

In [512]:
def genre_cleanup(string):
    string = string.replace('/', '').replace('-','')
    return string

ndf['Genre'] = ndf['Genre'].map(genre_cleanup)

In [513]:
genre_dummies = ndf['Genre'].str.get_dummies(sep=' ')
genre_dummies[:10]

Unnamed: 0,Action,Adventure,Animation,Comedy,Concert,Crime,Documentary,Drama,Epic,Family,...,Musical,Period,Romance,Romantic,SciFi,Sports,Thriller,Unknown,War,Western
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


## Building Actor Dummy Variables

In [514]:
# builds a table we can join later, which includes dummy variables for each actor- drops us to around 2k dummies
actor_dummies = ndf['Actors'].map(lambda a: ', '.join(a)).str.get_dummies(sep=', ')

In [271]:
print 'Affleck: '+ str(actor_dummies['Ben Affleck'].sum())
print 'Cage: '+ str(actor_dummies['Nicolas Cage'].sum())
print 'Cumberbatch: '+ str(actor_dummies['Benedict Cumberbatch'].sum())
print 'Damon: '+ str(actor_dummies['Matt Damon'].sum())
print 'Eastwood: '+ str(actor_dummies['Clint Eastwood'].sum())
print 'Ford: '+ str(actor_dummies['Harrison Ford'].sum())
print 'Freeman: '+ str(actor_dummies['Morgan Freeman'].sum())
print 'Knightley: '+ str(actor_dummies['Keira Knightley'].sum())
print 'Hanks: '+ str(actor_dummies['Tom Hanks'].sum())
print 'Johansson: '+ str(actor_dummies['Scarlett Johansson'].sum())
print 'Neeson: '+ str(actor_dummies['Liam Neeson'].sum())
print 'Oldman: '+ str(actor_dummies['Gary Oldman'].sum())
print 'Reeves: '+ str(actor_dummies['Keanu Reeves'].sum())
print 'Rickman: '+ str(actor_dummies['Alan Rickman'].sum())
print 'Saldana: ' + str(actor_dummies['Zoe Saldana'].sum())
print 'Stallone: '+ str(actor_dummies['Sylvester Stallone'].sum())
print 'Statham: '+ str(actor_dummies['Jason Statham'].sum())
print 'Streep: '+ str(actor_dummies['Meryl Streep'].sum())
print 'Snipes: '+ str(actor_dummies['Wesley Snipes'].sum())

Affleck: 26
Cage: 34
Cumberbatch: 7
Damon: 33
Eastwood: 7
Ford: 24
Freeman: 37
Knightley: 10
Hanks: 24
Johansson: 22
Neeson: 33
Oldman: 23
Reeves: 17
Rickman: 17
Saldana: 18
Stallone: 18
Statham: 17
Streep: 21
Snipes: 8


In [91]:
actor_dummies[:5]

Unnamed: 0,A. Fox,Aaron Eckhart,Aaron Johnson,Aaron Paul,Aasif Mandvi,Abbie Cornish,Abigail Breslin,"Actor&amp;id=50cent.htm"">50 Cent",Adam Beach,Adam Brody,...,Zach Galifianakis,Zach Gilford,Zachary Gordon,Zachary Levi,Zachary Quinto,Zhang Ziyi,Zoe Bell,Zoe Kazan,Zoe Saldana,Zooey Deschanel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [515]:
# we see we have 1900+ actors for a mere 2400 movies, we need to cut down to a more manageable number
# so, we drop all actors with less than 10 movies to cut down the size. we'll miss you, wesley snipes.
for i in actor_dummies.columns:
    if actor_dummies[i].sum()< 10:
        actor_dummies.drop(i, axis=1, inplace=True)
actor_dummies[:10]

Unnamed: 0,Aaron Eckhart,Abigail Breslin,Adam Sandler,Adam Scott,Adrien Brody,Al Pacino,Alan Arkin,Alan Rickman,Alec Baldwin,Alfred Molina,...,Warwick Davis,Will Ferrell,Will Smith,Willem Dafoe,William Fichtner,Winona Ryder,Woody Harrelson,Zach Galifianakis,Zoe Saldana,Zooey Deschanel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


# Now we build and conjoin our two dummy tables to our main table

In [516]:
aam = pd.merge(ndf, genre_dummies, how='outer', right_index=True, left_index=True)
combined = pd.merge(aam, actor_dummies, how='outer', right_index=True, left_index=True)
combined

Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year,Adjusted Domestic,Adjusted Budget,...,Warwick Davis,Will Ferrell,Will Smith,Willem Dafoe,William Fichtner,Winona Ryder,Woody Harrelson,Zach Galifianakis,Zoe Saldana,Zooey Deschanel
0,"[Kenny Baker, Anthony Daniels, Peter Mayhew, D...",18000000,[Irvin Kershner],290475067.0,247900000.0,SciFi Fantasy,The Empire Strikes Back,1980.0,7.242556e+08,4.488028e+07,...,0,0,0,0,0,0,0,0,0,0
1,[Leslie Nielsen],3500000,"[Jim Abrahams, David Zucker, Jerry Zucker]",83453539.0,,Comedy,Airplane!,1980.0,2.080788e+08,8.726720e+06,...,0,0,0,0,0,0,0,0,0,0
2,[Jack Nicholson],19000000,[Stanley Kubrick],44360123.0,,Horror,The Shining,1980.0,1.106053e+08,4.737363e+07,...,0,0,0,0,0,0,0,0,0,0
3,[Michael Caine],6500000,[De Palma],31899000.0,,Thriller,Dressed to Kill,1980.0,7.953533e+07,1.620677e+07,...,0,0,0,0,0,0,0,0,0,0
4,[Michael Caine],22000000,[None],15716828.0,,Adventure,The Island (1980),1980.0,3.918753e+07,5.485367e+07,...,0,0,0,0,0,0,0,0,0,0
5,[Kurt Russell],8000000,[Robert Zemeckis],11715321.0,,Comedy,Used Cars,1980.0,2.921038e+07,1.994679e+07,...,0,0,0,0,0,0,0,0,0,0
6,"[Woody Allen, Sharon Stone]",10000000,[Woody Allen],10389003.0,,Comedy,Stardust Memories,1980.0,2.590341e+07,2.493349e+07,...,0,0,0,0,0,0,0,0,0,0
7,[Paul Newman],20000000,[None],3763988.0,,Action Thriller,When Time Ran Out,1980.0,9.384935e+06,4.986697e+07,...,0,0,0,0,0,0,0,0,0,0
8,"[Jeff Bridges, Mickey Rourke, Christopher Walken]",44000000,[Michael Cimino],3484331.0,,Western,Heaven's Gate,1980.0,8.687652e+06,1.097073e+08,...,0,0,0,0,0,0,0,0,0,0
9,"[Harrison Ford, Alfred Molina]",18000000,[Steven Spielberg],248159971.0,141766000.0,Period Adventure,Raiders of the Lost Ark,1981.0,6.036579e+08,4.378564e+07,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#2419 rows by 427 columns, we can work with this.

In [518]:
combined = combined.reset_index(drop=True)
ncombined = combined.copy()
print ncombined.shape
ncombined.to_pickle('pre_model.pkl')

(2419, 427)


# WE NOW HAVE A DATASET (if not a great one) SO LETS PUT SOME MODELS ON OUR MODELS

In [519]:
print ncombined.shape

y = ncombined['Adjusted Domestic']
print y.shape
X = ncombined.drop(['Actors', 'Director','Domestic Gross', 'Foreign Gross', 'Genre', 'Title', 'Year', 'Budget', 'Adjusted Domestic'], axis=1)
print X.shape
# print X.shape()
# print y.shape()

(2419, 427)
(2419,)
(2419, 418)


In [453]:
ncombined.index[1]

1

In [524]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

kf = KFold(len(X_train), n_folds=5, shuffle=True, random_state=0)

In [455]:
parameters = {'normalize':(True,False),
              'alpha':np.logspace(4,6,50)}
grid_searcher = grid_search.GridSearchCV(lasso, parameters,cv=kf)
grid_searcher.fit(X, y)
print grid_searcher.best_estimator_
print grid_searcher.best_estimator_.score(X_test,y_test)
print grid_searcher.best_params_

KeyboardInterrupt: 

In [345]:
parameters = {'normalize':(True,False),
              'alpha':np.logspace(-2,6,50),
            'l1_ratio': np.linspace(0.01,.99,10)}
grid_searcher = grid_search.GridSearchCV(linear_model.ElasticNet(), parameters,cv=kf)
grid_searcher.fit(X, y)
print grid_searcher.best_estimator_
print grid_searcher.best_estimator_.score(X_test,y_test)
print grid_searcher.best_params_

ElasticNet(alpha=0.021209508879201904, copy_X=True, fit_intercept=True,
      l1_ratio=0.22777777777777777, max_iter=1000, normalize=False,
      positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
0.332988042393
{'normalize': False, 'alpha': 0.021209508879201904, 'l1_ratio': 0.22777777777777777}


In [525]:
models = {}
# models['lin_reg'] = linear_model.LinearRegression()
# models['ridge'] = linear_model.Ridge()
models['lasso'] = linear_model.Lasso(alpha=430000)
models['elasticnet'] = linear_model.ElasticNet(alpha=0.02, l1_ratio=0.22)
# models['tree'] = tree.DecisionTreeRegressor(min_samples_split=10, max_depth=4)

In [447]:
models

{'lasso': Lasso(alpha=430000, copy_X=True, fit_intercept=True, max_iter=1000,
    normalize=False, positive=False, precompute=False, random_state=None,
    selection='cyclic', tol=0.0001, warm_start=False)}

In [526]:
for name,model in models.iteritems():
    model.fit(X_train,y_train)
    print('Model: '+name)
    print("Score: " + str(model.score(X_test,y_test)))
    print("Constant:") + str(model.intercept_)
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
    for feature in sorted_features:
        print(feature)
        
    print("")
    
# shuffler = cross_validation.ShuffleSplit(2419, test_size=.3)

# for name,model in models.iteritems():
#     score = cross_validation.cross_val_score(model, X, y, n_jobs=2, cv=shuffler)
#     print('Model: ' + name)
#     print(score)
#     print("Score: " + str(np.mean(score)) + " with STD: " + str(np.std(score)))
#     sorted_features = sorted(zip(X_train.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
#     for feature in sorted_features:
#         print(feature)
        
#     print("")

Model: elasticnet
Score: 0.324662397535
Constant:31730478.0991
(u'Animation', 39831545.203055821)
('Harrison Ford', 38047307.916384868)
('Tom Hanks', 36746986.773138084)
('L. Jackson', 36133339.737239197)
('Mike Myers', 31069014.623224329)
(u'Unknown', -30328846.725090779)
('Alan Rickman', 29002968.480123021)
('Cameron Diaz', 23666802.245962985)
('Jack Nicholson', 23339671.058775261)
('Uma Thurman', -22214940.218161255)
('Bradley Cooper', 21824787.850643318)
('Keira Knightley', 21755456.572748676)
('Gwyneth Paltrow', 21017966.918635096)
('Kellan Lutz', 20953433.856165919)
('Tom Cruise', 20702200.571627311)
(u'Adventure', 20595972.40015395)
('Natalie Portman', 20521071.20175232)
('Anna Kendrick', 20163625.164675895)
('Billy Burke', 19964978.211857259)
('Sandra Bullock', 19878996.922651567)
('Orlando Bloom', 19215126.073842749)
('Kathy Bates', 18893865.14439036)
('Michael Keaton', 18509228.337607607)
('Elizabeth Banks', 18483659.475299582)
('Steve Carell', 18174386.55290623)
('Liam Neeso

In [461]:
shuffler = cross_validation.ShuffleSplit(len(X), test_size = 0.3)

for name,model in models.iteritems():
    score = cross_validation.cross_val_score(model, X, y, n_jobs=3, cv=shuffler)
    print('Model: ' + name)
    print(score)
    print("Score: " + str(np.mean(score)) + " with STD: " + str(np.std(score)))
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
    for feature in sorted_features:
        print(feature)
        
    print("")

Model: elasticnet
[ 0.33987414  0.35744652  0.35629157  0.33054131  0.34774414  0.352767
  0.29999961  0.35802353  0.28324209  0.29582143]
Score: 0.332175134853 with STD: 0.0271441454396


TypeError: zip argument #2 must support iteration

In [444]:
print model.coef_

None


In [399]:
rf =RandomForestRegressor(n_estimators = 100, max_features = 7 )
rf.fit(X_train,y_train)
print(sum(rf.feature_importances_))
a  = X.columns.values
print "Score: " + str(rf.score(X_test,y_test))
sorted_features = zip(X.columns, rf.feature_importances_)
sorted_features.sort(key=lambda x: x[1], reverse=True)
for feature in sorted_features:
    print(feature)

1.0
Score: 0.363238482975
('Adjusted Budget', 0.19660296414777917)
(u'Adventure', 0.026551374534381434)
('Drew Barrymore', 0.022044102666294689)
(u'Drama', 0.017843731083553298)
(u'Action', 0.017339264722556244)
(u'Comedy', 0.016585948315445975)
('Harrison Ford', 0.01521362205029414)
(u'SciFi', 0.014656049292423357)
(u'Animation', 0.013897417393145355)
(u'Romance', 0.011965656352527426)
('Tom Hanks', 0.010556050859639227)
(u'Family', 0.010019222962998453)
(u'Fantasy', 0.0093281336223671688)
('Kate Winslet', 0.00824788319699102)
('Dallas Howard', 0.0079627232376048279)
('Alan Rickman', 0.00767376341099716)
('Kathy Bates', 0.0068662278021145136)
(u'Thriller', 0.0068190030261964666)
('Tobey Maguire', 0.0064831848465622312)
('L. Jackson', 0.0063758167114366859)
('Elizabeth Banks', 0.0063463477536653736)
('Tom Cruise', 0.0058365541394931565)
(u'Horror', 0.0055546091077179364)
('Gary Oldman', 0.0054964695857674488)
('Jeff Goldblum', 0.0053990718219172366)
('Bonnie Hunt', 0.005369422820682897

In [403]:
print len(ncombined.columns)
for i in ncombined.columns:
    print i
# strat_sample = ncombined.loc(ncombined[ncombined.columns[i]])
# sKf = cross_validation.StratifiedKFold(len(strat_sample), n_folds=2, shuffle=True)

427
Actors
Budget
Director
Domestic Gross
Foreign Gross
Genre
Title
Year
Adjusted Domestic
Adjusted Budget
Action
Adventure
Animation
Comedy
Concert
Crime
Documentary
Drama
Epic
Family
Fantasy
Foreign
Historical
Horror
Music
Musical
Period
Romance
Romantic
SciFi
Sports
Thriller
Unknown
War
Western
Aaron Eckhart
Abigail Breslin
Adam Sandler
Adam Scott
Adrien Brody
Al Pacino
Alan Arkin
Alan Rickman
Alec Baldwin
Alfred Molina
Amanda Peet
Amanda Seyfried
Amber Heard
Amy Adams
Andy Samberg
Andy Serkis
Angela Bassett
Angelina Jolie
Anna Faris
Anna Kendrick
Anna Paquin
Anne Hathaway
Anne Moss
Annette Bening
Anthony Anderson
Anthony Hopkins
Anthony Mackie
Anton Yelchin
Antonio Banderas
Arnold Schwarzenegger
Ashley Judd
Ashton Kutcher
Barry Pepper
Ben Affleck
Ben Foster
Ben Kingsley
Ben Stiller
Benjamin Bratt
Bernie Mac
Bette Midler
Bill Murray
Bill Nighy
Billy Burke
Bob Thornton
Bonham Carter
Bonnie Hunt
Brad Pitt
Bradley Cooper
Brendan Fraser
Brendan Gleeson
Brian Cox
Brittany Murphy
Bruce Gr

# Ignore Below this line

## Some Work with some major genres dropped

In [527]:
no_animation = ncombined.copy()


In [529]:
print len(no_animation)
print no_animation.shape

2419
(2419, 427)


In [530]:
no_animation_to_drop= []
for i in range(len(noguns)):
    if no_animation['Unknown'][i] == 1:
        no_animation_to_drop.append(i)
    elif no_animation['Animation'][i] == 1:
        no_animation_to_drop.append(i)
    elif no_animation['Concert'][i] == 1:
        no_animation_to_drop.append(i)
        
no_animation.drop(noguns.index[no_animation_to_drop], inplace=True)
no_animation = no_animation.reset_index(drop=True)
print len(no_animation)
no_animation

2331


Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year,Adjusted Domestic,Adjusted Budget,...,Warwick Davis,Will Ferrell,Will Smith,Willem Dafoe,William Fichtner,Winona Ryder,Woody Harrelson,Zach Galifianakis,Zoe Saldana,Zooey Deschanel
0,"[Kenny Baker, Anthony Daniels, Peter Mayhew, D...",18000000,[Irvin Kershner],290475067.0,247900000.0,SciFi Fantasy,The Empire Strikes Back,1980.0,7.242556e+08,4.488028e+07,...,0,0,0,0,0,0,0,0,0,0
1,[Leslie Nielsen],3500000,"[Jim Abrahams, David Zucker, Jerry Zucker]",83453539.0,,Comedy,Airplane!,1980.0,2.080788e+08,8.726720e+06,...,0,0,0,0,0,0,0,0,0,0
2,[Jack Nicholson],19000000,[Stanley Kubrick],44360123.0,,Horror,The Shining,1980.0,1.106053e+08,4.737363e+07,...,0,0,0,0,0,0,0,0,0,0
3,[Michael Caine],6500000,[De Palma],31899000.0,,Thriller,Dressed to Kill,1980.0,7.953533e+07,1.620677e+07,...,0,0,0,0,0,0,0,0,0,0
4,[Michael Caine],22000000,[None],15716828.0,,Adventure,The Island (1980),1980.0,3.918753e+07,5.485367e+07,...,0,0,0,0,0,0,0,0,0,0
5,[Kurt Russell],8000000,[Robert Zemeckis],11715321.0,,Comedy,Used Cars,1980.0,2.921038e+07,1.994679e+07,...,0,0,0,0,0,0,0,0,0,0
6,"[Woody Allen, Sharon Stone]",10000000,[Woody Allen],10389003.0,,Comedy,Stardust Memories,1980.0,2.590341e+07,2.493349e+07,...,0,0,0,0,0,0,0,0,0,0
7,[Paul Newman],20000000,[None],3763988.0,,Action Thriller,When Time Ran Out,1980.0,9.384935e+06,4.986697e+07,...,0,0,0,0,0,0,0,0,0,0
8,"[Jeff Bridges, Mickey Rourke, Christopher Walken]",44000000,[Michael Cimino],3484331.0,,Western,Heaven's Gate,1980.0,8.687652e+06,1.097073e+08,...,0,0,0,0,0,0,0,0,0,0
9,"[Harrison Ford, Alfred Molina]",18000000,[Steven Spielberg],248159971.0,141766000.0,Period Adventure,Raiders of the Lost Ark,1981.0,6.036579e+08,4.378564e+07,...,0,0,0,0,0,0,0,0,0,0


In [531]:
for i in no_animation.columns:
    print i

Actors
Budget
Director
Domestic Gross
Foreign Gross
Genre
Title
Year
Adjusted Domestic
Adjusted Budget
Action
Adventure
Animation
Comedy
Concert
Crime
Documentary
Drama
Epic
Family
Fantasy
Foreign
Historical
Horror
Music
Musical
Period
Romance
Romantic
SciFi
Sports
Thriller
Unknown
War
Western
Aaron Eckhart
Abigail Breslin
Adam Sandler
Adam Scott
Adrien Brody
Al Pacino
Alan Arkin
Alan Rickman
Alec Baldwin
Alfred Molina
Amanda Peet
Amanda Seyfried
Amber Heard
Amy Adams
Andy Samberg
Andy Serkis
Angela Bassett
Angelina Jolie
Anna Faris
Anna Kendrick
Anna Paquin
Anne Hathaway
Anne Moss
Annette Bening
Anthony Anderson
Anthony Hopkins
Anthony Mackie
Anton Yelchin
Antonio Banderas
Arnold Schwarzenegger
Ashley Judd
Ashton Kutcher
Barry Pepper
Ben Affleck
Ben Foster
Ben Kingsley
Ben Stiller
Benjamin Bratt
Bernie Mac
Bette Midler
Bill Murray
Bill Nighy
Billy Burke
Bob Thornton
Bonham Carter
Bonnie Hunt
Brad Pitt
Bradley Cooper
Brendan Fraser
Brendan Gleeson
Brian Cox
Brittany Murphy
Bruce Greenw

In [532]:
print no_animation.shape

y = no_animation['Adjusted Domestic']
print y.shape
X = no_animation.drop(['Actors', 'Director','Domestic Gross', 'Foreign Gross', 'Genre', 'Title', 'Year', 'Budget', 'Adjusted Domestic', 'Animation', 'Unknown', 'Concert'], axis=1)
print X.shape

(2331, 427)
(2331,)
(2331, 415)


In [541]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

# sKf = StratifiedKFold(X, n_folds=2, shuffle=True, random_state=0)
kf = KFold(len(X_train), n_folds=5, shuffle=True, random_state=0)

In [489]:
parameters = {'normalize':(True,False),
              'alpha':np.logspace(3,6,50)}
grid_searcher = grid_search.GridSearchCV(lasso, parameters,cv=kf)
grid_searcher.fit(X, y)
print grid_searcher.best_estimator_
print grid_searcher.best_estimator_.score(X_test,y_test)
print grid_searcher.best_params_

KeyboardInterrupt: 

In [484]:
models = {}
# models['lin_reg'] = linear_model.LinearRegression()
# models['ridge'] = linear_model.Ridge()
models['lasso'] = linear_model.Lasso(alpha=75000)
models['elasticnet'] = linear_model.ElasticNet(alpha=1.1, l1_ratio=0.98)
# models['tree'] = tree.DecisionTreeRegressor(min_samples_split=10, max_depth=4)

In [540]:
for name,model in models.iteritems():
    model.fit(X_train,y_train)
    print('Model: '+name)
    print("Score: " + str(model.score(X_test,y_test)))
    print("Constant:") + str(model.intercept_)
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
    for feature in sorted_features:
        print(feature)
        
    print("")

Model: elasticnet
Score: 0.380850484882
Constant:25972493.6482
('Tom Hanks', 46922428.278743222)
('Harrison Ford', 43461842.89533174)
('Carrie Fisher', 36414958.146582603)
('L. Jackson', 32178821.395652346)
('Elizabeth Banks', 29790805.165425994)
('Drew Barrymore', 29291588.901411317)
('Andy Serkis', 29084994.977037877)
('Kathy Bates', 26703640.003090765)
('Natalie Portman', 24605320.576903656)
('Tyrese Gibson', 20200835.104312744)
('Billy Burke', 20200256.653717168)
('Jack Nicholson', 19729767.890426598)
('Sandra Bullock', 18649315.956886694)
('Anna Kendrick', 18377784.956935979)
('Gwyneth Paltrow', 18265208.632517882)
('Keira Knightley', 18213027.782507226)
('Kellan Lutz', 18087739.7402124)
('Bill Murray', 17707612.264248736)
('Tom Cruise', 17401323.56382243)
('Downey', 17236357.305731457)
('Charlize Theron', -16772134.967278853)
('William Fichtner', 16577410.547722405)
('Alan Rickman', 16547281.095601074)
('Cillian Murphy', 16323829.917937189)
('Jennifer Lawrence', 15698735.98473793

In [305]:
parameters = {'normalize':(True,False),
              'alpha':np.logspace(-2,3,30),
            'l1_ratio': np.linspace(0.01,.99,10)}
grid_searcher = grid_search.GridSearchCV(linear_model.ElasticNet(), parameters,cv=kf)
grid_searcher.fit(X, y)
print grid_searcher.best_estimator_
print grid_searcher.best_estimator_.score(X_test,y_test)
print grid_searcher.best_params_

ElasticNet(alpha=1.1721022975334805, copy_X=True, fit_intercept=True,
      l1_ratio=0.98999999999999999, max_iter=1000, normalize=False,
      positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
0.51130854105
{'normalize': False, 'alpha': 1.1721022975334805, 'l1_ratio': 0.98999999999999999}


# Cell below used for presentation results

In [545]:
shuffler = cross_validation.ShuffleSplit(len(X_train), test_size=0.2)

for name,model in models.iteritems():
    score = cross_validation.cross_val_score(model, X_train, y_train, n_jobs=1, cv=kf)
    print('Model: ' + name)
    print(score)
    print("Score: " + str(np.mean(score)) + " with STD: " + str(np.std(score)))
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
    for feature in sorted_features:
        print(feature)
        
    print("")

Model: elasticnet
[ 0.3604229   0.33072843  0.39899747  0.40574193  0.22986682]
Score: 0.345151509523 with STD: 0.0637218723594
('Tom Hanks', 46922428.278743222)
('Harrison Ford', 43461842.89533174)
('Carrie Fisher', 36414958.146582603)
('L. Jackson', 32178821.395652346)
('Elizabeth Banks', 29790805.165425994)
('Drew Barrymore', 29291588.901411317)
('Andy Serkis', 29084994.977037877)
('Kathy Bates', 26703640.003090765)
('Natalie Portman', 24605320.576903656)
('Tyrese Gibson', 20200835.104312744)
('Billy Burke', 20200256.653717168)
('Jack Nicholson', 19729767.890426598)
('Sandra Bullock', 18649315.956886694)
('Anna Kendrick', 18377784.956935979)
('Gwyneth Paltrow', 18265208.632517882)
('Keira Knightley', 18213027.782507226)
('Kellan Lutz', 18087739.7402124)
('Bill Murray', 17707612.264248736)
('Tom Cruise', 17401323.56382243)
('Downey', 17236357.305731457)
('Charlize Theron', -16772134.967278853)
('William Fichtner', 16577410.547722405)
('Alan Rickman', 16547281.095601074)
('Cillian Mur

In [542]:
shuffler = cross_validation.ShuffleSplit(len(X_train), test_size=0.2)

for name,model in models.iteritems():
    score = cross_validation.cross_val_score(model, X_train, y_train, n_jobs=1, cv=kf)
    print('Model: ' + name)
    print(score)
    print("Score: " + str(np.mean(score)) + " with STD: " + str(np.std(score)))
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
    for feature in sorted_features:
        print(feature)
        
    print("")

Model: elasticnet
[ 0.3604229   0.33072843  0.39899747  0.40574193  0.22986682]
Score: 0.345151509523 with STD: 0.0637218723594
('Tom Hanks', 46922428.278743222)
('Harrison Ford', 43461842.89533174)
('Carrie Fisher', 36414958.146582603)
('L. Jackson', 32178821.395652346)
('Elizabeth Banks', 29790805.165425994)
('Drew Barrymore', 29291588.901411317)
('Andy Serkis', 29084994.977037877)
('Kathy Bates', 26703640.003090765)
('Natalie Portman', 24605320.576903656)
('Tyrese Gibson', 20200835.104312744)
('Billy Burke', 20200256.653717168)
('Jack Nicholson', 19729767.890426598)
('Sandra Bullock', 18649315.956886694)
('Anna Kendrick', 18377784.956935979)
('Gwyneth Paltrow', 18265208.632517882)
('Keira Knightley', 18213027.782507226)
('Kellan Lutz', 18087739.7402124)
('Bill Murray', 17707612.264248736)
('Tom Cruise', 17401323.56382243)
('Downey', 17236357.305731457)
('Charlize Theron', -16772134.967278853)
('William Fichtner', 16577410.547722405)
('Alan Rickman', 16547281.095601074)
('Cillian Mur

# Unused below this point:

# Non Action, SciFi, Fantasy, Adventure, Animation

In [493]:
noguns = ncombined.copy()

In [494]:
noguns_to_drop= []
for i in range(len(noguns)):
    if noguns['Unknown'][i] == 1:
        noguns_to_drop.append(i)
    elif noguns['Action'][i] == 1:
        noguns_to_drop.append(i)
    elif noguns['SciFi'][i] == 1:
        noguns_to_drop.append(i)
    elif noguns['Animation'][i] == 1:
        noguns_to_drop.append(i)
    elif noguns['Adventure'][i] == 1:
        noguns_to_drop.append(i)
    elif noguns['Fantasy'][i] == 1:
        noguns_to_drop.append(i)
        
noguns.drop(noguns.index[noguns_to_drop], inplace=True)
noguns = noguns.reset_index(drop=True)
print len(noguns)
noguns[:10]

1565


Unnamed: 0,Actors,Budget,Director,Domestic Gross,Foreign Gross,Genre,Title,Year,Adjusted Domestic,Adjusted Budget,...,Zach Galifianakis,Zach Gilford,Zachary Gordon,Zachary Levi,Zachary Quinto,Zhang Ziyi,Zoe Bell,Zoe Kazan,Zoe Saldana,Zooey Deschanel
0,[Leslie Nielsen],3500000,"[Jim Abrahams, David Zucker, Jerry Zucker]",83453539.0,,Comedy,Airplane!,1980.0,208078800.0,8726720.0,...,0,0,0,0,0,0,0,0,0,0
1,[Jack Nicholson],19000000,[Stanley Kubrick],44360123.0,,Horror,The Shining,1980.0,110605300.0,47373630.0,...,0,0,0,0,0,0,0,0,0,0
2,[Michael Caine],6500000,[De Palma],31899000.0,,Thriller,Dressed to Kill,1980.0,79535330.0,16206770.0,...,0,0,0,0,0,0,0,0,0,0
3,[Kurt Russell],8000000,[Robert Zemeckis],11715321.0,,Comedy,Used Cars,1980.0,29210380.0,19946790.0,...,0,0,0,0,0,0,0,0,0,0
4,"[Woody Allen, Sharon Stone]",10000000,[Woody Allen],10389003.0,,Comedy,Stardust Memories,1980.0,25903410.0,24933490.0,...,0,0,0,0,0,0,0,0,0,0
5,"[Jeff Bridges, Mickey Rourke, Christopher Walken]",44000000,[Michael Cimino],3484331.0,,Western,Heaven's Gate,1980.0,8687652.0,109707300.0,...,0,0,0,0,0,0,0,0,0,0
6,"[Warren Beatty, Diane Keaton, Jack Nicholson, ...",32000000,[None],40382659.0,,Historical Drama,Reds,1981.0,98232240.0,77841130.0,...,0,0,0,0,0,0,0,0,0,0
7,"[Dustin Hoffman, Jessica Lange, Bill Murray, S...",21000000,[Sydney Pollack],177200000.0,,Comedy,Tootsie,1982.0,420532000.0,49837310.0,...,0,0,0,0,0,0,0,0,0,0
8,"[Jack Nicholson, Albert Brooks]",8000000,[L. Brooks],108423489.0,,Drama,Terms of Endearment,1983.0,251035300.0,18522580.0,...,0,0,0,0,0,0,0,0,0,0
9,[Tom Cruise],6200000,[None],63541777.0,,Comedy,Risky Business,1983.0,147119700.0,14355000.0,...,0,0,0,0,0,0,0,0,0,0


In [495]:
print noguns.shape

y = noguns['Adjusted Domestic']
print y.shape
X = noguns.drop(['Actors', 'Director','Domestic Gross', 'Foreign Gross', 'Genre', 'Title', 'Year', 'Budget', 'Adjusted Domestic', 'Animation', 'Action', 'Adventure', 'Fantasy', 'SciFi', 'Unknown'], axis=1)
print X.shape

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3)

kf = KFold(len(X_train), n_folds=5, shuffle=True, random_state=0)

(1565, 1964)
(1565,)
(1565, 1949)


In [323]:
parameters = {'normalize':(True,False),
              'alpha':np.logspace(3,6,50)}
grid_searcher = grid_search.GridSearchCV(lasso, parameters,cv=kf)
grid_searcher.fit(X, y)
print grid_searcher.best_estimator_
print grid_searcher.best_estimator_.score(X_test,y_test)
print grid_searcher.best_params_

Lasso(alpha=323745.75428176398, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
0.351089874994
{'normalize': False, 'alpha': 323745.75428176398}


In [324]:
parameters = {'normalize':(True,False),
              'alpha':np.logspace(-3,3,30),
            'l1_ratio': np.linspace(0.01,.99,10)}
grid_searcher = grid_search.GridSearchCV(linear_model.ElasticNet(), parameters,cv=kf)
grid_searcher.fit(X, y)
print grid_searcher.best_estimator_
print grid_searcher.best_estimator_.score(X_test,y_test)
print grid_searcher.best_params_

ElasticNet(alpha=0.02807216203941177, copy_X=True, fit_intercept=True,
      l1_ratio=0.22777777777777777, max_iter=1000, normalize=False,
      positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
0.395572271854
{'normalize': False, 'alpha': 0.02807216203941177, 'l1_ratio': 0.22777777777777777}


In [496]:
models = {}
# models['lin_reg'] = linear_model.LinearRegression()
# models['ridge'] = linear_model.Ridge()
models['lasso'] = linear_model.Lasso(alpha=323000)
models['elasticnet'] = linear_model.ElasticNet(alpha=.02, l1_ratio=0.22)
# models['tree'] = tree.DecisionTreeRegressor(min_samples_split=10, max_depth=4)

In [497]:
for name,model in models.iteritems():
    model.fit(X_train,y_train)
    print('Model: '+name)
    print("Score: " + str(model.score(X_test,y_test)))
    print("Constant:") + str(model.intercept_)
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
#     for feature in sorted_features:
#         print(feature)
        
    print("")

Model: elasticnet
Score: 0.225455012169
Constant:27877584.8345

Model: lasso
Score: 0.153930391145
Constant:29721599.5183



In [498]:
shuffler = cross_validation.ShuffleSplit(len(X_train))

for name,model in models.iteritems():
    score = cross_validation.cross_val_score(model, X_train, y_train, n_jobs=1, cv=shuffler)
    print('Model: ' + name)
    print(score)
    print("Score: " + str(np.mean(score)) + " with STD: " + str(np.std(score)))
    sorted_features = sorted(zip(X.columns,model.coef_), key=lambda tup: abs(tup[1]), reverse=True)
    for feature in sorted_features:
        print(feature)
        
    print("")

Model: elasticnet
[ 0.23422083  0.11043525  0.21370191  0.17206562  0.18405645  0.24363795
  0.20387615  0.29873817 -0.07942703  0.34617283]
Score: 0.19274781427 with STD: 0.110001691657
('Tom Hanks', 48886837.436442867)
('Sally Field', 36032206.951474689)
('Billy Zane', 33463990.719319113)
('Bernard Hill', 30908399.873620663)
('Ioan Gruffudd', 30401603.416047588)
('Bill Paxton', 30048176.387829367)
('Kathy Bates', 28367640.68496212)
('Monica Bellucci', 26580790.578939322)
('Tom Cruise', 21425348.402241264)
('Robin Wright', 20636486.771453496)
('Julia Roberts', 20535540.991714902)
('Jack Nicholson', 20503921.134194512)
('Night Shyamalan', 18861887.539305132)
('Macaulay Culkin', 18243416.049060244)
('Toni Collette', 18131631.519510027)
('Gary Sinise', 18092670.537019882)
('Patrick Swayze', 17398428.134050533)
('Whoopi Goldberg', 17398428.134050533)
('Bill Murray', 17279016.138279431)
('Gooding', 17203805.94073987)
('Kate Winslet', 16846029.245335575)
('Joel Osment', 16804087.338366058)
