# Introduction

Some movies display intense violence, drug use, nudity, and harsh language, ergo it is crucial that parents are aware of the content before viewing a film with their child. That is why the widely adopted Classification and Rating Administration, or “CARA”, was developed by the Motion Picture Association in 1968: to inform parents of a movie content’s maturity by categorizing films into bins (*Motion Picture Association*, 2020). The bins are as follows:

* **G** (General Audiences)
* **PG** (Parental Guidance Suggested)
* **PG-13** (Parents Strongly Cautioned)
* **R** (Restricted)
* **NC-17** (Adults Only)

This discovery led us to inquire if a movie category could be predicted before officially being categorized. This might help CARA analysts better understand where a movie may fall before reviewing it. After exploring various options, this is the predictive question we designed.

**Given a movie’s run time, votes, and IMDb rating, can we predict the category?**

The dataset that will be used is called IMDb Top 100 Movies (Pathak, 2023). Using the list of the top 100 movies from 1972 to 2015 according to IMDb (*IMDb*, n.d.), we will predict movies’ category based on the variables stated above. The dataset can be found here: https://www.kaggle.com/datasets/themrityunjaypathak/imdb-top-100-movies.

# Preliminary exploratory data analysis

### Loading Libraries

In [None]:
library(tidyverse)
library(stringr)
library(repr)
install.packages("xlsx") 
library("xlsx")
library(rvest)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(rjson)
source('tests.R')
source("cleanup.R")

also installing the dependencies ‘rJava’, ‘xlsxjars’




### Reading and Tidying
* The dataset we downloaded from the internet and loaded into R was from this link (https://www.kaggle.com/datasets/themrityunjaypathak/imdb-top-100-movies) and then downloaded to Jupyter as `movies.csv`.
* We then will use `select` to choose the pertinent categories.
* The `filter` command will exclude *approved*, *passed*, and *GP*, which are irrelevant observations to our analysis.
* `Mutate` must be used to change the category to from the categoric type to the appropriate format, remove "min" from run time, and convert the run time to a numeric variable.

In [5]:
movies_data <- read_csv(url("https://www.kaggle.com/datasets/themrityunjaypathak/imdb-top-100-movies/download?datasetVersionNumber=1"))
movies_data
#select_movies <- select(movies_data, category, run_time, votes, imdb_rating)
#filter_movies <- filter(select_movies, category != "Approved", category != "Passed", category != "GP")
#mutated_movies <- mutate(filter_movies, category = as_factor(category))
#changed_movies <- mutated_movies |>
#    mutate(run_time = str_replace_all(run_time, "([min])", ""))|>
#    mutate(run_time = as.numeric(run_time)) |>
#    mutate(votes = votes/1000000)
#changed_movies

“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m88[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): <!DOCTYPE html>

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


<!DOCTYPE html>
<chr>
"<html lang=""en"">"
<head>
<title>Kaggle: Your Home for Data Science</title>
⋮
</main>
</body>
</html>


### Splitting

* We decided to allocate 75% of the data to training and 25% to testing. 
* This is a favorable split of our 91 rows because it will allow for higher accuracy when testing.

In [None]:
set.seed(2019)
movies_split <- initial_split(changed_movies, prop = 0.75, strata = category)  
movies_train <- training(movies_split)   
movies_test <- testing(movies_split)
movies_train

### Summary Table

This table shows the number of observations in each category of the training data. 

In [None]:
set.seed(2019)
movies_count <- nrow(movies_train)
movies_train |>
  group_by(category) |>
  summarize(count = n())

### Plotting our Analysis

The relationship between variables and the category can be seen below, color coded by category. The first graph shows a somewhat positive relationship between votes and IMDb rating. The second graph lacks a clear association between run time and IMDb rating.

In [None]:
set.seed(2019)
options(repr.plot.height = 8, repr.plot.width = 20)
movies_plot_1 <- ggplot(changed_movies, aes(x = votes, y = imdb_rating))+
    geom_point(aes(color = category))+
    geom_ellipse()+
    ggtitle("Votes' affect on IMDB Rating color coded by Category") +
    labs(x = "Votes (millions)", y = "IMDB Rating (scale of 1 to 10)", color = "Cateory")+
    theme(text = element_text(size = 18))

movies_plot_2 <- ggplot(changed_movies, aes(x = run_time, y = imdb_rating))+
    geom_point(aes(color = category))+
    geom_ellipse()+
    ggtitle("Run Time's affect on IMDB Rating color coded by Category") +
    labs(x = "Run Time (minutes)", y = "IMDB Rating (scale of 1 to 10)", color = "Cateory")+
    theme(text = element_text(size = 18))

require(gridExtra)
grid.arrange(movies_plot_1, movies_plot_2, ncol=2)

# Methods

### Conducting the Analysis

Using KNN classification, we plan to use nearest neighbors to categorize observations into the appropriate CARA category given the *IMDb rating*, *run time*, and *votes*. We decided to use these three columns because they were numerical data, which are easier to analyze. The other numerical column, *year of release*, doesn't show a clear connection with the category of a movie. The columns *movie name* and *genre* are not numerical, which is harder to analyze. Given that the data is split into training and testing, we will then be able to test the accuracy of the model.

### Visualizing the Results

One way that we plan to visualize the results is by comparing the accuracy to the number of neighbors (K) to find the optimal quantity. We will do this by completing the following steps.

1. Setting up nearest neighbors and tuning them to the optimal amount.
2. Designing the recipe and using predictors through `step_scale` and `step_center`.
3. Utilizing `vfold_cv` and creating a neighbor’s sequence for it to increase by in the chart (i.e., 5).
4. Creating a workflow and including `collect_metrics()`.
5. Filtering the accuracy from the results with `.metric` and arranging them in descending order.
6. Constructing a scatterplot to compare *Number of Nearest Neighbors (K)* to *Accuracy Estimate*.
7. Concluding which K value we want to use based on the graph.

# Expected outcomes and significance

### What do you expect to find?

We anticipate that the model will predict movie categories with moderate accuracy. After the model is trained and tested, our hope is that CARA could use this or a similar system to increase efficiency, and along the way, improve the model's accuracy.

### What impact could such findings have?

This could impact the speed in which movie categories are predicted. With further development and as accuracy increases with more training data, CARA could rely solely on this system to predict in which bin a movie would fall. This development would save money and allow for more movies to be rated.

### What future questions could this lead to?

1. To what degree could a reputable rating system like CARA rely on this prediction model with its current accuracy?
2. How frequently would the model need to be updated to maintain accuracy for new movies as popularity and run time for various categories change?
3. How will this affect jobs of CARA analysts? 

# References

Film ratings. Motion Picture Association. (2020, April 30). Retrieved March 11, 2023, from
https://www.motionpictures.org/film-ratings/ 

IMDb. (2023, February 15). IMDb Ratings FAQ. IMDb. Retrieved March 11, 2023, from
https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV# 

IMDb. (n.d.). Ratings, reviews, and where to watch the best movies &amp; TV shows. IMDb. Retrieved March 11, 2023, from
https://www.imdb.com/ 

Lemoine, A. (2021, January 19). What does 'IMDB' mean? Dictionary.com. Retrieved March 11, 2023, from
https://www.dictionary.com/e/acronyms/imdb/ 

MPA Film Ratings History. The Classification and Rating System (CARA). (n.d.). Retrieved March 11, 2023, from
https://50th.filmratings.com/core/ 

Pathak, M. (2023, January 11). IMDb top 100 movies. Kaggle. Retrieved March 11, 2023, from
https://www.kaggle.com/datasets/themrityunjaypathak/imdb-top-100-movies 