#Analysis of Film Locations

##Table of Contents
* [Analysis of Film Locations](#Analysis-of-Film-Locations)
    * [1. Overview](#1.-Overview)
	* [2. Data Acquisition & Management](#2.-Data-Acquisition-&-Management)
        * [2.1. Approach 1](#2.1.-Approach-1)
        * [2.2. Approach 2](#2.2.-Approach-2) 
    * [3. Statistics](#3.-Statistics)
        * [3.1. Summary Statistics of Data Set](#3.1.-Summary-Statistics-of-Data-Set)
        * [3.2. Logistic Regression](#3.2.-Logistic-Regression)
        * [3.3. K-Nearest Neighbours](#3.3.-K-Nearest-Neighbours)
        * [3.4. Principal Component Analysis](#3.4.-Principal-Component-Analysis)


#1. Overview

In this project, we study the relationship between film locations and other factors like ratings and genres. Data will be scraped from [IMDB](http://www.imdb.com). Statistical and machine learning analytical tools will be used in our study. Visualisation of results will be done via graphical plots and Tableau.

#2. Data Acquisition & Management

We first scrape the data from IMDB which gives us information about each movie such as budget, country, critic ratings, duration, genre, gross earnings, language, location, name, opening weekend earnings, release dates, url, user ratings, user ratings count and year. We want a data set of 12,000 titles with our analysis focused on the most recent titles. As an example, there are 8,997 titles in 2014. However, some of these are missing location data. Hence, we scraped as far back as we could to obtain the desired data set size.

The ratings and genres of each title does not require much cleaning. We merely remove titles with no ratings. However, because the locations of each title is user-contributed, it requires much cleaning. For example, there are many cases of multiple similar entries, albeit differently worded. In some cases, locations are specified to the street while others merely state the country. Our first task is to clean the locations variable. After which, we split our data into three sets for training, validation and testing in the ratio 4:1:1.

In cleaning up the variable "locations", there are two approaches for consideration.

##2.1. Approach 1

In this approach, we restrict "locations" to cities; in particular, cities belonging to top filming locations. In this case, the top $T$ filming locations will be identified by the most frequent filming locations in our data set. So, we split each "locations" by commas into phrases and add them to a set. Then, we remove elements of this set which are not cities, as defined by [Wikipedia](https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B). Then, we find the $T$ most frequent elements from our data set.

##2.2. Approach 2

Split the entries of "locations" by commas and treat each unique phrase as a distinct location. For example, suppose we have "Newbury Street, Boston, USA, Boston, USA". We split them into "Newbury Street", "Boston" and "USA". Later on, in our analysis, we treat each of these as distinct locations.

#3. Statistics

##3.1. Summary Statistics of Data Set

To give a better idea of the data set that we are working with, it helps to have some plots of our data. We can look at a histogram of the ratings. Depending on the skewness of ratings, we can categorise ratings based on a threshold into a binary variable "Category". Perhaps, if half of the titles have ratings above 7.5, we can categorise titles with ratings above 7.5 as "Good" and those below as "Bad".

We can also look at bar plots of the number of titles for each city and each genre. This would give us some sense of the difference in numbers for each city and genre. Potentially, we might want to look at the relation between cities and genres. For example, what is the probability that a title was filmed in New York conditional on it being a drama. We might be able to write a function where you input parameters $X$ and $Y$, each representing a city or genre, and it returns $Probability(X|Y)$.

We can present some of these results using Tableau. For example, we can present the number of titles per city on a map in Tableau.

##3.2. Logistic Regression

We wish to study the effects of locations and genre on rating category. We denote $ Y = $ Category, $L_c$ = Dummy Variable for City $c$, $G_g$ = Dummy Variable for Genre $g$.

Our regression equation will take the following form:

$$ Y = F(\beta_0 + L_1 + L_2 + ... + L_C + G_1 + G_2 + ... + G_G)$$

where $\beta_0$ is the intercept parameter and $F$ is the logistic function defined as:

$$ F(x) = \frac{e^x}{e^x + 1} $$

After estimating the coefficients, we can remove the statistically insignificant dummy variables and test our estimates on the test set. Then, we compare the accuracy of predicting $Y$ on both the training and test set. If the accuracy on the training set is very high while the accuracy on the test set is very low, it suggests that our model is overfitted.

##3.3. K-Nearest Neighbours

Next, we use the non-parametrised kNN classification with distance defined by the locations only. We train the data and then use the validation set to find the optimal K. Then, we look at the accuracy of prediction on the test set, comparing it with the validation set. A significant difference would indicate overfitting as well.

##3.4. Principal Component Analysis

Finally, we use PCA for its dimensionality reduction property. Using locations as the principal components, PCA tells us how significant each component is in our classification.

Or do we use PCA to reduce dimensions and then apply KNN?