# Discussion about the imbalanced nature of the data and how you want to address it
Imbalance in classification problems arise when our training set has massive imbalances in the types of classes we have. For example, say in a simple case we have two classes, A and B, and we have 100 training points. 99 of these points are class A while 1 point is class B. As a result, we can easily overfit on class A and our predictions will be majorly affected. While this is an extreme case, we see this problem extended to our current movie dataset.

For the IMDb5000 dataset from Kaggle in particular, the movies are very imbalanced across all the combinations of genres, with a few combinations accounting for the vast majority of the movies in the dataset and the rest of the genre combinations appearing only once or twice.
[Insert histogram to show that the data is very right-skewed]. 

There are two main approaches to deal with imbalanced multi-label datasets: (1) Classifier-dependent solutions where the training algorithm for the classifiers account for the imbalance and (2) preprocessing techniques which are applied to the data before training the classifiers. 

However, the simplest (and perhaps best) method to overcome imbalancing is to collect balanced data in the first place. For instance, we could use the IMDbPY python package to scrape the metadata of 100 movies from each of the 26 genres. Thus, instead of addressing imbalance during our model-building process, we can just ensure that imbalance is never a problem to begin with.
# Description of your data
Our data currently consists of a pandas dataframe with each row representing the data for each movie in our training set. Each row consists of the values of our predictor variables for a given movie as well as indicator variables for each genre, each of which would be 1 if the movie falls within the corresponding genre or 0 if not. 

We generated these indicator variables using one-hot encoding of each movie’s listed genres against all possible genres we observed in our dataset.

## Possible Additional Data Sources

Additional textual data could potentially be another source of rich features. Specifically, we are thinking of using critic reviews from [Rotten Tomatoes](https://www.blog.pythonlibrary.org/2013/11/06/python-101-how-to-grab-data-from-rottentomatoes/) and film information (i.e. synopses) from [Wikipedia](https://www.programmableweb.com/api/wikipedia). We could then generate additional low dimensional features for each movie through topic modelling techniques such as [Latent Dirichlet Allocation](http://www.cs.columbia.edu/~blei/papers/BleiNgJordan2003.pdf) (LDA) or document embedding via [doc2vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf). 
# What does your choice of Y look like?
We represent the Y variable as multiple labels per observation, because each movie can have more than one genre. For example, a movie can have genres Mystery, Romance, and Action all at once. This transforms the problem into a multi-label model. We choose to represent the Y variable both as a column of lists of genres for each movie in the data table, as well as a one-hot encoded representation with one indicator variable per genre. Both representations have their pros and cons, and the software packages we plan to use take one or the other as input.

We plan on using both problem transformation approaches and algorithm adaptation approaches to perform multi-label classification. We plan on using modern variations of the Binary Relevance (BR) method (commonly referred to as One vs. Rest), like Stacked BR or Chain Classifiers. We also will use two variations of the Label Powerset (LP) method, like Random k-label subsets (RAKEL) and Ensembles of Pruned Sets (EPS). RAKEL and EPS help correct the tendency of LP to produce too many classes. Between these methods, we anticipate RAKEL and EPS will outperform the BR-based methods, as seen in prevailing academic literature (see [here](https://books-google-com.ezp-prod1.hul.harvard.edu/books?hl=en&lr=&id=OHp3sRnZD-oC&oi=fnd&pg=PA325&dq=multilabel+classification&ots=oESKwHmxa7&sig=ijM-CxoUylvxWfAIQevKFyIzPE4#v=onepage&q=multilabel%20classification&f=false) and [here](https://pdfs.semanticscholar.org/97b3/a052f93ad52a2c6e46be89c5e134c4ec6bf8.pdf). We may also use other problem transformation methods like Calibrated Label Ranking (CLR) or Copy-Weight Classifiers (CW) if time permits. We plan on using the multi-label adapted algorithms for KNN, decision trees / random forests / boosting, and SVMs. 

To evaluate our multi-label classifier, we intend to use a suite of metrics, including Hamming Loss, 0/1 Loss (implemented as Exact Match), Coverage, Ranking Loss, Macro and Micro AUC, Macro and Micro F1, and others. We chose to use multiple metrics because each of them have unique advantages (see [here](https://users.ics.aalto.fi/jesse/talks/Multilabel-Part01.pdf) and [here](https://books-google-com.ezp-prod1.hul.harvard.edu/books?hl=en&lr=&id=1bpEifVEi2MC&oi=fnd&pg=PA64&dq=multi+label+classification&ots=WyK60fxhME&sig=yT5pOsPZYQzh7sTMHaZ1-7hbXe4#v=onepage&q=multi%20label%20classification&f=false)). 

## Deep Learning

For the deep learning portion of the project, one natural way to represent Y is to have $$k=26$$ output nodes (for each of the k individual genres) in the final layer of the network. Each of these output nodes would have a sigmoid activation with binary log-entropy loss and each of the output nodes would be fully connected to all of the nodes in the previous layer. Each of the output nodes would have an activation threshold (for example, 0.5), and a movie would belong to a particular genre if the output node corresponding to the genre exceeded that activation threshold. This approach allows for a movie to be classified into multiple genres which addresses the multilabel classification problem.

As noted in the literature, the main drawback of this approach is that such an approach does not take into account correlations between labels. In terms of our specific task, this makes sense since certain genres, such as Action and Adventure for instance are correlated. Nevertheless, this approach may still have respectable performance even despite this limitation, so we will experiment with it before moving on to more complicated methods.

Time permitting, we plan to also explore more sophisticated neural net-specific multilabel classification methods. In particular, [backpropagation for multi-label learning](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.130.7318&rep=rep1&type=pdf) (BP-MLL) seems promising, as it was developed specifically to address the correlation issue of the individual binary classifier approach. The main concern at this point with BP-MLL is the feasibility of integrating it with Keras, which we will have to research in further detail.

# Which features do you choose for X and why? 

We currently intend to use the following features for X, based on the conclusions from our visualizations for milestone 1:
* Number of faces in posters
* Facebook likes on movie page
* IMDB score
* Director
We are also considering using the following features:
* Gross
* Content rating
* Budget
* Number of critics that reviewed the movie

We are omitting the countries the movies were created in as a predictor variable because there didn’t appear to be a statistically significant difference in movie genres based on country from our Milestone 1 exploratory data analysis. 


Depending on the scores we obtain during the model-fitting process and new metadata we may obtain from further scraping with the IMDbPY python package and the rotten tomatoes API, we may adjust this set of predictors in future milestones.

# How do you sample your data, how many samples, and why?

Because we can simply just scrape more data and build a uniform, balanced dataset across individual genres, we can sample our entire dataset without any discriminative/deterministic methods. In terms of how many samples, we have 100 sets for each independent genre so that it is robust enough for modeling. If need be, we will scrape more.
