# Classification Project

## Overview

You will be analyzing a movie data. Your goals in this project are 
+ to build classifiers that predict the genre of a movie--action or romance--based on the frequencies in which certain words appear in the movie script and
+ to analyze the performance of these classifiers.

Project components:
1. **Data exploration and feature selection**

    In order to get a sense of what features can help us distinguish an action movie from a romance movie, you would have to do some initial exploration and visualization of your training dataset.  Your project should include at least three data visualizations that help justify how you design your classifier.<br><br>
    
2. **Development of your own classifier**
    
    Your task here is to build your own "simple classifier"; it should use at least three features and you should use at least two if-else statements.  This classifier should be informed by your initial data exploration (step 1 above).<br><br>
    
3. **Assessment of your classifier**

    Assess how your classifier does in predicting the genres of the movies in your test dataset.  You should measure the accuracy of your classifier.  Then, you should use one other metric (this could be one of the metrics we have discussed in class or one that you construct yourself).  If you use a metric that you created yourself, please include a brief explanation of your idea.<br><br>
    
4. **Comparison of the performance of your classifier to that of a k-Nearest Neighbor Classifier**

    Use a kNN classifier with a particular choice of $k$ to predict the genre of the movies in your test dataset and assess the quality of the predictions.  Explain your choice of $k$.  Compare the performance of your classifier to that of the kNN classifier; comment on the result.<br><br>
    
5. **Writeup (Summary and Conclusion)**

    Your writeup should include

    + An explanation of your data visualizations (why you picked the pairs of words to plot, etc.) and an interpretation of your data visualization
    + An explanation of your simple classifier, including how the data visualizations you produced inform the simple classifier(s) you constructed
    + A discussion on the metrics you use to assess the performance of the classifiers
    + A discussion of the performance of your simple classifier(s), in terms of accuracy and one other metric
    + A discussion of the performance of the kNN classifier, including what $k$ value you chose and why, in terms of accuracy and one other metric
    + A summary of how the simple classifier(s) you constructed compares to the kNN classifier, which classifier you would recommend to use, the shortcomings of your chosen classifiers, and any ideas for future improvements

## Team Information

Team members: (enter your names here)

NetIDs: 

## 0. Load data and libraries; split data into training and test data

In [1]:
# libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# load data
moviesdata = pd.read_csv('../datasets/movies.csv')

In [3]:
# size of data
print(moviesdata.shape) # 242 rows, 5006 columns -> 242 movies, 5006 columns

(242, 5006)


In [4]:
# Decide how many rows will be used for training and testing (for example, 30% for test)

# two ways (depending on course objectives)
# 1. the short way, using sklearn
#from sklearn.model_selection import train_test_split
#training_data, test_data = train_test_split(moviesdata, test_size=0.3)


# 2. do it "from scratch"
# Split moviesdata into two
#   scramble rows of the movies data
moviesdata_scrambled = moviesdata.sample(frac=1, random_state = 1).reset_index(drop=True)

#   split
num_test = np.round( 0.3 * 242 )

training_rows_indices = np.arange(num_test, 242)
test_rows_indices = np.arange(0, num_test)

training_data = moviesdata_scrambled.iloc[training_rows_indices, : ]
test_data = moviesdata_scrambled.iloc[test_rows_indices, : ]

In [5]:
# check numberof training and test data
print(training_data.shape)
print(test_data.shape)

(169, 5006)
(73, 5006)


## 1. Data exploration and feature selection

## 2. Development of your own classifier

## 3. Assessment of your classifier

## 4. k-Nearest Neighbor Classifier

## 5. Writeup (Summary and Conclusion)

(Your writeup goes here)