#Reception of Films from Locations and Genres

##Table of Contents
* [Reception of Films from Locations and Genres](#Reception-of-Films-from-Locations-and-Genres)
    * [1. Overview](#1.-Overview)
	* [2. Data Acquisition & Management](#2.-Data-Acquisition-&-Management)
        * [2.1. Approach 1](#2.1.-Approach-1)
        * [2.2. Approach 2](#2.2.-Approach-2) 
    * [3. Analyses](#3.-Analyses)
        * [3.1. Summary Statistics of Data Set](#3.1.-Summary-Statistics-of-Data-Set)
        * [3.2. Principal Component Analysis](#3.2.-Principal-Component-Analysis)
        * [3.3. Train Test Splitting](#3.3.-Train-Test-Splitting)
        * [3.4. Logistic Regression](#3.4.-Logistic-Regression)
        * [3.5. K-Nearest Neighbours](#3.5.-K-Nearest-Neighbours)


#1. Overview

In this project, we study the relationship between film reception and other factors like filming locations and genres. Data will be scraped from [IMDB](http://www.imdb.com). Principal Component Analysis(PCA) will be used for dimensionality reduction of our data set. Then, logistic regression and k-Nearest Neighbours(kNN) will be used for classification. The accuracy of both estimators will be compared. Visualisation of results will be done via graphical plots and Tableau.

#2. Data Acquisition & Management

We first scrape the data from IMDB which gives us information about each film title such as budget, country, critic ratings, duration, genre, gross earnings, language, location, name, opening weekend earnings, release dates, url, user ratings, user ratings count and year. Though we planned to work with a data set of 12,000 titles initially, a preliminary test run of PCA, particularly with around 18,000 features of locations and genres, indicated that it was infeasible. Instead, we reduce the size of our data to 10,000 title. To do so, we scraped IMDB for all film titles from 2009 to 2014. Titles without user ratings, film locations and genres are then removed from our data set. Subsequently, we pick 10,000 titles randomly from the remaining pool of titles.

First, we convert user ratings to a binary feature where "1" indicates that the film was well-received while "0" indicates otherwise. The threshold rating for determining if a film is well-received is not "5" nor is it a randomly chosen number. Instead, we look at the spread of user ratings across our data. Films with ratings in the top 50% will be assigned "1" while films in the bottom 50% will be assigned "0". Though such conversion of user ratings might raise problems, the intention here is to have a balanced number for both sides; additionally, this technique can be applied to larger data sets where the threshold is determined by the data in hand.

For locations, substantial data cleaning is required. This is largely due to film locations being user-contributed. Hence, we face problems like spelling mistakes, multiple entries of similar details or details with trivial differences that were irrelevant to our study. For example, some locations are specific to the street while others merely state the country. In cleaning up the feature "locations", we had considered two approaches which are elaborated below.

##2.1. Approach 1

We restrict "locations" to cities and countries. Using a list of cities, that we acquired online, and countries across the world, we split the entries of "locations" for each title by the commas into phrases. This is because most entries take the form - (specific location),(city),(country) - with various titles having anomalous and repeated entries. Phrases that appear in the list of cities and countries are kept while other location details are discarded. While this resulted in some problems, one particularly interesting issue was that many countries had names shared by cities. 

For example, China turned out to be the name of a city in Texas as well. As a result, if a film was filmed in China, Texas, it would be recorded as being filmed in the country China as well. However, given that these cities with similar names to countries are not especially known widely, we assume that such occurrences are rare anomalies. In addition, we chose to include countries so as to mitigate this problem since our hypothetical film will be recorded as being filmed in China and USA. As such, it is true that the film was filmed in USA but merely a rare anomaly that it is erroneously recorded to be filmed in China too.

While Approach 2 below was also considered, Approach 1 was taken instead due to the complexity of the latter.


##2.2. Approach 2

Here, we split the entries of "locations" by commas and treat each unique phrase as a distinct location. For example, suppose we have "Newbury Street, Boston, USA, Boston, USA". We split them into "Newbury Street", "Boston" and "USA". From a data set of around 18,000 titles, we end up with around 19,000 distinct locations as features. Then, we rely on PCA to reduce the number of features into principal components.

However, preliminary tests concluded that the computational demands of this approach was too immense. The number of components to be reduced to by PCA was intended to explain 90% of the variation in our features. While it was possible to find that number with our data set, the first few test runs simply took too long, prompting us to turn to the first approach instead.

#3. Analyses

##3.1. Summary Statistics of Data Set

To give a better representations of the data set that we have, we will have some graphical plots of our data, like histograms across countries and genres. The large number of cities appearing in our data means that a histogram across all cities should be avoided. We can still study the frequency of the top few cities. Film reception can also be presented as a barplot.

Potentially, we might want to look at the relation between locations and genres. For example, what is the probability that a title was filmed in New York conditional on it being a drama. We might be able to write a function where you input parameters $X$ and $Y$, each representing a location, which may be a city, country or genre, and it returns $Probability(X|Y)$.

Other visualizations will be done using Tableau. For example, we can present the number of titles per city on a map in Tableau since we have city coordinates from the list of cities acquired.

##3.2. Principal Component Analysis

We begin analysis by using PCA to reduce the number of dimensions features in the data. Our variation cutoff will be 5%. So if variation falls below 5% after a certain component, we will remove all remaining dimensions.

In [None]:
from sklearn.decomposition import PCA

#Replace 60 with desired n later
pca = PCA(n_components=60)

#X should consist only of independent variables, leave out Y
new_X = pca.fit_transform(df[X])

In [None]:
pca.explained_variance_ratio_
print pca.explained_variance_ratio_.sum() #This should be above 90% at desired n

In [None]:
#Loop to find the optimal n_components that explains 95% of variation

for n in range(1,1001):
        pca = PCA(n_components=n)
        new_X = pca.fit_transform(df[X])
        if pca.explained_variance_ratio_.sum()>0.95:
            N_pca = n
            print "Optimal n:", n
            break

In [None]:
#Apply N_pca to data set
pca = PCA(n_components=N_pca)

new_X = pca.fit_transform(df[X])

In [None]:
#The transformed data

new_X

This gives us a data set with reduced dimension. Now, the features are the principal components.

##3.3. Train Test Splitting

Here, we split our data set into train and test sets. Our data set consists of 10,000 titles. We want training and testing data in the ratio 3:1. So, the size of each set will be 4,500 and 1,500, respectively.

In [None]:
from sklearn.cross_validation import train_test_split
Xlr, Xtestlr, ylr, ytestlr = train_test_split(df[X], df[Y], train_size=0.75, random_state=1)

##3.4. Logistic Regression

We wish to study the effects of locations and genre on category. Since we have reduced locations and genres to principal components, we will perform a logistic regression of film reception on the principal components. We denote $ Y = $ Reception and $PC_c$ = Dummy Variable for principal component $c$.

Our regression equation will take the following form:

$$ P(Y_i=1) = F(\beta_0 + \delta_1 PC_{1i} + \delta_2 PC_{2i} + ... + \delta_C PC_{Ci})$$

where $\beta_0$ is the intercept parameter, $\delta_c$ are the slope parameters for the principal components and $F$ is the logistic function defined as:

$$ F(x) = \frac{e^x}{e^x + 1} $$

After estimating the coefficients, we apply the coefficient estimates on the test set. Then, we compare the accuracy of predicting $Y$ on both the training and test data.

In the code below, we first find the best paramater C for the logistic regression from the training set using 5-folds cross-validation.

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf=LogisticRegression()
parameters = {"C": [0.0001, 0.001, 0.1, 1, 10, 100]}
fitmodel = GridSearchCV(clf, param_grid=parameters, cv=5, scoring="accuracy")
fitmodel.fit(Xlr, ylr)
fitmodel.best_estimator_, fitmodel.best_params_, fitmodel.best_score_, fitmodel.grid_scores_

In the following code, we perform the logistic regression using the best parameter C, fit our training set and then test on the test data.

In [None]:
clfl2=LogisticRegression(C=fitmodel.best_params_['C'])
clfl2.fit(Xlr, ylr)
ypred2=clfl2.score(Xtestlr,ytestlr)
#accuracy_score(ypred2, ytestlr)

##3.5. K-Nearest Neighbours

Next, we use the non-parametrised kNN classification with distance defined by the principal components. We train the data and use cross-validation to find the optimal K. Then, we look at the accuracy of prediction on the test data, comparing it with the training data.

This is the code for performing kNN classification:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf= KNeighborsClassifier(nbrs)
clf=clf.fit(Xtrain, ytrain)
accuracy = clf.score(Xtrain, ytrain)

This is the code to find the optimal k neighbours from training set:

In [None]:
gs = GridSearchCV(KNeighborsClassifier(), param_grid={"n_neighbors": range(1,40,2)}, cv=5)
gs.fit(Xtrain, ytrain)
gs.best_params_

Then we apply to the test set and compare the accuracy with logistic regression.

In [None]:
clfknn=KNeighborsClassifier(n_neighbors=gs.best_params_)
clfknn.fit(Xtrain, ytrain)
clfknn.score(Xtest, ytest)


In [None]:
plt.figure()
ax=plt.gca()
points_plot(ax, Xtrain, Xtest, ytrain, ytest, clfknn.fit(Xtrain, ytrain), alpha=0.3, psize=20)