#Ratings from Locations and Genre

##Table of Contents
* [Analysis of Film Locations](#Analysis-of-Film-Locations)
    * [1. Overview](#1.-Overview)
	* [2. Data Acquisition & Management](#2.-Data-Acquisition-&-Management)
        * [2.1. Approach 1](#2.1.-Approach-1)
        * [2.2. Approach 2](#2.2.-Approach-2) 
    * [3. Statistics](#3.-Statistics)
        * [3.1. Summary Statistics of Data Set](#3.1.-Summary-Statistics-of-Data-Set)
        * [3.2. Principal Component Analysis](#3.2.-Principal-Component-Analysis)
        * [3.3. Train Test Splitting](#3.3.-Train-Test-Splitting)
        * [3.4. Logistic Regression](#3.4.-Logistic-Regression)
        * [3.5. K-Nearest Neighbours](#3.5.-K-Nearest-Neighbours)


#1. Overview

In this project, we study the relationship between film locations and other factors like ratings and genres. Data will be scraped from [IMDB](http://www.imdb.com). Statistical and machine learning analytical tools will be used in our study. Visualisation of results will be done via graphical plots and Tableau.

#2. Data Acquisition & Management

We first scrape the data from IMDB which gives us information about each movie such as budget, country, critic ratings, duration, genre, gross earnings, language, location, name, opening weekend earnings, release dates, url, user ratings, user ratings count and year. Though we wanted a data set of size 12,000 initially, we had to lower that number to keep our project feasible. So, we focus on 6,000 random samples of the most recent titles. As an example, there are 8,997 titles in 2014. However, some of these are missing location data. Hence, we scraped as far back as we could to obtain the desired data set size.

The ratings and genres of each title does not require much cleaning. We merely remove titles with no ratings. However, because the locations of each title is user-contributed, it requires much cleaning. For example, there are many cases of multiple similar entries, albeit differently worded. In some cases, locations are specific to the street while others merely state the country. Our first task is to clean the locations variable. After which, we split our data into two sets for training and testing in the ratio 3:1.

In cleaning up the variable "locations", there are two approaches for consideration.

##2.1. Approach 1

In this approach, we restrict "locations" to cities; in particular, cities belonging to top filming locations. In this case, the top $T$ filming locations will be identified by the most frequent filming locations in our data set. So, we split each "locations" by commas into phrases and add them to a set. Then, we remove elements of this set which are not cities, as defined by [Wikipedia](https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B). Then, we find the $T$ most frequent elements from our data set.

##2.2. Approach 2

Split the entries of "locations" by commas and treat each unique phrase as a distinct location. For example, suppose we have "Newbury Street, Boston, USA, Boston, USA". We split them into "Newbury Street", "Boston" and "USA". Later on, in our analysis, we treat each of these as distinct locations.

We will proceed with this approach so as to avoid mistakably introducing bias into our analysis. PCA will be used to reduce the number of locations and genres later.

#3. Statistics

##3.1. Summary Statistics of Data Set

To give a better idea of the data set that we are working with, it helps to have some plots of our data. We can look at a histogram of the ratings. Depending on the skewness of ratings, we can categorise ratings based on a threshold into a binary variable "Category". Perhaps, if half of the titles have ratings above 7.5, we can categorise titles with ratings above 7.5 as "Good" and those below as "Bad".

We can also look at bar plots of the number of titles for each city and each genre. This would give us some sense of the difference in numbers for each city and genre. Potentially, we might want to look at the relation between cities and genres. For example, what is the probability that a title was filmed in New York conditional on it being a drama. We might be able to write a function where you input parameters $X$ and $Y$, each representing a city or genre, and it returns $Probability(X|Y)$.

We can present some of these results using Tableau. For example, we can present the number of titles per city on a map in Tableau.

##3.2. Principal Component Analysis

We begin analysis by using PCA to reduce the number of dimensions of our independent variables in the data. Our variation cutoff will be 5%; so if variation falls below 5% after a certain component, we will remove all remaining dimensions. Then, we split our data set into train and test set.

In [None]:
from sklearn.decomposition import PCA

#Replace 60 with desired n later
pca = PCA(n_components=60)

#X should consist only of independent variables, leave out Y
new_X = pca.fit_transform(df[X])

In [None]:
pca.explained_variance_ratio_
print pca.explained_variance_ratio_.sum() #This should be above 95% at desired n

In [None]:
#Loop to find the optimal n_components that explains 95% of variation

for n in range(1,1001):
        pca = PCA(n_components=n)
        new_X = pca.fit_transform(df[X])
        if pca.explained_variance_ratio_.sum()>0.95:
            N_pca = n
            print "Optimal n:", n
            break

In [None]:
#Apply N_pca to data set
pca = PCA(n_components=N_pca)

new_X = pca.fit_transform(df[X])

In [None]:
#The N components

pca.components_

This gives us the optimal n components.

##3.3. Train Test Splitting

Here, we split our data set into train and test sets.

In [None]:
from sklearn.cross_validation import train_test_split
Xlr, Xtestlr, ylr, ytestlr = train_test_split(df[X], df[Y], train_size=0.75, random_state=1)

##3.4. Logistic Regression

We wish to study the effects of locations and genre on category. We denote $ Y = $ Category, $L_c$ = Dummy Variable for City $c$, $G_g$ = Dummy Variable for Genre $g$.

Our regression equation will take the following form:

$$ P(Y_i=1) = F(\beta_0 + \delta_1 L_{1i} + \delta_2 L_{2i} + ... + \delta_C L_{Ci} + \gamma_1 G_{1i} + \gamma_2 G_{2i} + ... + \gamma_G G_{Gi})$$

where $\beta_0$ is the intercept parameter, $\delta_c$ are the slope parameters for the locations, $\gamma_g$ are the slope parameters for the genres and $F$ is the logistic function defined as:

$$ F(x) = \frac{e^x}{e^x + 1} $$

After estimating the coefficients, we apply the coefficient estimates on the test set. Then, we compare the accuracy of predicting $Y$ on both the training and test set. If the accuracy on the training set is very high while the accuracy on the test set is very low, it suggests that our model is overfitted.

In the code below, we first find the best paramater C for the logistic regression from the training set using n-folds cross-validation.

In [None]:
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

clf=LogisticRegression()
parameters = {"C": [0.0001, 0.001, 0.1, 1, 10, 100]}
fitmodel = GridSearchCV(clf, param_grid=parameters, cv=5, scoring="accuracy")
fitmodel.fit(Xlr, ylr)
fitmodel.best_estimator_, fitmodel.best_params_, fitmodel.best_score_, fitmodel.grid_scores_

In the following code, we perform the logistic regression using the best parameter C, fit our training set and then test on test set.

In [None]:
clfl2=LogisticRegression(C=fitmodel.best_params_['C'])
clfl2.fit(Xlr, ylr)
ypred2=clfl2.predict(Xtestlr)
accuracy_score(ypred2, ytestlr)

##3.5. K-Nearest Neighbours

Next, we use the non-parametrised kNN classification with distance defined by the locations and genre. We train the data and use cross-validation to find the optimal K. Then, we look at the accuracy of prediction on the test set, comparing it with the training set. A significant difference would indicate overfitting as well.

This is the code for performing kNN classification:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

def knn_classify(X,y, nbrs, plotit=True):
    clf= KNeighborsClassifier(nbrs)
    clf=clf.fit(X, y)
    accuracy = clf.score(X, y)
    if plotit:
        print "Accuracy: %0.2f" % (accuracy)
        plt.figure()
        ax=plt.gca()
        points_plot(ax, Xtrain, Xtest, ytrain, ytest, clf, alpha=0.3, psize=20)
    return nbrs, accuracy

This is the code to find the optimal k neighbours from training set:

In [None]:
def optimize_nbrs(clf, parameters, Xtrain, ytrain, n_folds=5):
    gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds)
    gs.fit(Xtrain, ytrain)
    return gs.best_params_

Then we apply to the test set and compare the accuracy with logistic regression.