-
Notifications
You must be signed in to change notification settings - Fork 0
/
getdata.html
104 lines (95 loc) · 5.98 KB
/
getdata.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
<!DOCTYPE html>
<html class="no-js">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>And the Oscar goes to...</title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="css/bootstrap.min.css">
<style>
body {
padding-top: 50px;
padding-bottom: 20px;
}
</style>
<link rel="stylesheet" href="css/bootstrap-theme.min.css">
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/styles.css">
<link href="img/favicon.ico" rel="shortcut icon">
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
<script src="js/jquery-1.10.1.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<script src="js/scripts.js"></script>
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<script>window.html5 || document.write('<script src="js/vendor/html5shiv.js"><\/script>')</script>
<![endif]-->
</head>
<body>
<div class='wrapper'>
<div class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class='navbar-brand' href="index.html">Home</a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li><a href="getdata.html">Getting the Data</a></li>
<li><a href="visdata.html">Visualizing the Data</a></li>
<li><a href="stat.html">Statistical Analysis</a></li>
<li><a href="predict.html">Predictions</a></li>
<li><a href="closing_remarks.html">Closing Remarks</a></li>
</ul>
</div><!--/.navbar-collapse -->
</div>
</div>
<!-- Main jumbotron for a primary marketing message or call to action -->
<div class="header">
<div class="container">
<h2>Getting the Data</h2>
</div>
</div>
<div class="container">
<h4>Getting the Movies</h4>
<div>
<p>In order to run our analysis, we need to isolate a dataset of movies for a given year that are candidates for an Oscar. For this purpose, we decided to extract a subset of the movies outlined by Box Office Mojo as the top movies in a given year. Assuming that the movies most likely to win an Oscar are likely to be among those with the highest gross revenue in a given year, we decided to select the Top 200 movies from each year as candidates for an Oscar (in the following year).</p>
<p>The data collected via get_top_movies by year successfully found the IMDBid (the reference id used across the different segment of this project) for approximately 90% of the top 200 movies in each given year. The additional 10% of IMDBids not found by the above procedure were identified manually. The manual determination of IMDBids was recorded in two new files: 'revised_id_data.csv' and 'completedata.csv'.</p>
<p>Of the top 200 movies from each year, approximately 5% were excluded from further analysis. The top reasons for movie exclusion included:
<ul>
<li>Re-releases of prior movies (often occurs around the holidays)</li>
<li>Re-releases of movies with visual enhancements (a very common procedure for Disney movies)</li>
<li>Exclusive release of movies in IMAX theaters</li>
<li>One-time limited showings of events such as concerts (see Fathom Events)</li>
<li>When a movie was excluded from analysis its IMDBid was set to 'None'. These rows of the dataframe were then removed from analysis. In addition, all dollar values were converted to 2013 dollars, accounting for inflation as reported by <a href="http://usinflation.org/us-inflation-rate" target="_blank">usinflationrates.org</a></li>
</ul>
</p>
</div>
<hr>
<h4>Getting the Movie Reviews</h4>
<div>
<p>We needed to construct DataFrames for our movie reviews and rating data, so first we needed to pull this data from the web. We chose to use the following sources:
<ul>
<li>Rotten Tomatoes</li>
<li>Metacritic</li>
<li>IMDB</li>
</ul>
When pulling the data, we were sure to only take reviews and ratings that are specific to the site in question (for example, for IMDB, we only took the IMDB <i>user</i> reviews because the critic reviews are linked from other sites like Rotten Tomatoes and Metacritic.</p>
<p>From these sources, we scraped our review data and stored in it Dataframes of the following strcture:</p>
<table class="table-bordered"><thead><tr><th>Critic</th><th>Normalized Score</th><th>Quote</th><th>ID</th><th>Title</th><th>Source</th><th>Overall Score</th><th>Year</th></tr></thead><thead></thead><tbody><tr><td>Name of the Critic</td><td>Score normalized to what the max is for this source</td><td>The string of the review</td><td>IMDB ID</td><td>Title of the movie</td><td>Source of the review</td><td>The overall score for this movie on this site (if there is one)</td><td>Year that the movie is form</td></tr></tbody></table>
</div>
<br>
<div><a class="btn btn-default slide btn-success" href="visdata.html">Seeing the Data »</a></div>
<hr>
<footer>
<p>Presented By: Nick Perkons, Mike Rizzo, Julia Careaga, & Ibrahim Khan</p>
</footer>
</div> <!-- /container -->
</div> <!-- wrapper -->
</body>
</html>