# DS5001 Final Report

Name: Theodore Thormann (nxb5kp@virginia.edu) <br>
Class: DS5001 Spring '23

## Introduction

Horror films have been a staple of cinema since the first horror movie “Le Manoir du Diable” or, in English, “The House of the Devil” directed by the French director Georges Méliès in 1896. Since then, the genre has experienced ups and downs throughout the decade and has exploded into a plethora of subgenres like the slasher, the demonic possession film, the paranormal, among many, many others. Until recently, horror films have largely been dismissed by the academic community as drivel. However, a recent revival in the horror genre has led many academics to look more critically at the horror films of the past in a more scholarly light. Kendall Phillips book “Projected Fears: Horror Films and American Culture” offers a strong thesis on why horror films are worthy of the academic spotlight when discussing American culture. He says, “…while any given film can be frightening to any given individual, certain films become the touchstone of fear for an entire generation. It is as if, at certain points, a particular film so captures our cultural anxieties and concerns that our collective fears seem projected onto the screen before us.” (Phillips). <br>
<br>
Phillips touches on an interesting point about certain films that transcend their genre and strike fear into an entire generation. While Phillips posits this generational fear, this paper is interested in seeing if those fears can be seen in text analysis and perhaps even if there are certain films from a generation they are noticeable across all generations. This question of fear over time is not completely novel as can be seen through John Kenneth Muir’s scholarly book series which investigates horror films by decade. While the majority of this project looks at seminal horror films from the 1980s, 1990s, and 2000s as a whole, there are points when a finer examination is conducted into each of these generations.

## Source Data

The corpus used for this project was titled “Film Corpus 2.0” and was obtained from the Baskin Engineering Lab at University of California, Santa Cruz. The corpus contains the complete text files for 1068 films. The .txt files were originally obtained by the authors in 2015 by scraping files from imdsb.com, a film script database. The original corpus contains 149 horror film scripts.

37 horror films were chosen for this analysis. These 37 films were chosen for several reasons. Primarily, they used conventional scene slug lines such as INT or EXT or MONTAGE to denote scene changes. Scene changes were a crucial delimiter in my analysis and were used to divide screenplays similar to how chapters divide novels. The screenplay also had to have a reasonable number of scenes to be chosen for analysis. A screenplay needed to have a minimum of 50 scenes for this analysis. 50 scenes was an arbitrary cutoff. The average film in this corpus has 153 scenes and an average of 22,654 tokens.

I also chose these 37 films because they were all from the three decades analyzed in this paper. The three decades analyzed in this paper are the 1980s, the 1990s, and the 2000s. Films released between 1980-1989 were placed in the 1980s category, films released between 1990-1999 were placed in the 1990s category, and films released between 2000-2010 were placed in the 2000s category. Two films from 2010, Insidious and Legion, were present in my corpus and were placed in the 2000s decade for use in this analysis. There are 8 films in the 1980 category, 10 films in the 1990 category, and 19 films in the 2000s category.

A link to the 37 .txt screenplay files can be found here at UVA Box: https://virginia.box.com/s/jmrw18xum970zrjkupuk6e7yy0claq6k

## Data Model

These tables can be accessed at the following link: https://virginia.box.com/s/4zfi385yvykwvuui179slxfubf22t5cq

This project consists of 21 CSV tables

### Data Table Descriptions

#### CORPUS & CORPUS_1980 & CORPUS_1990 & CORPUS_2000:
This table uses the OHCO: Movie_id, Scene_id, Sent_num, Token_num as an index. The purpose of this table is to capture all tokens present in the corpus and split them into different levels using the OHCO. This table also contains the part of speech of the associated token. This was accomplished using Natural Language Toolkit (NLTK) auto POS tagging in python.

| Column    | Type  | Description                                                                            |
|-----------|-------|----------------------------------------------------------------------------------------|
| movie_id  | int   | The ID of the associated film in the corpus                                            |
| scene_id  | int   | The scene number in the film                                                           |
| sent_num  | int   | The sentence number in the scene                                                       |
| token_num | int   | The token number in the sentence                                                       |
| pos_tuple | object | The token and its NLTK tagged part of speech                                           |
| pos       | object   | The part of speech associated with the token                                           |
| token_str | object   | A token in the corpus                                                                  |
| term_str  | object   | A normalized version of a token in the corpus in lowercase with any characters removed |

CORPUS contains all the films in the corpus.
CORPUS_1980 & CORPUS_1990 & CORPUS_2000 contain only films for that decade.

#### VOCAB & VOCAB_1980 & VOCAB_1990 & VOCAB_2000:
This table uses term_str (a list of every term present in the corpus) as the index. This table includes information about to how each term appears in the corpus.

| Column         | Type  | Description                                                                  |
|----------------|-------|------------------------------------------------------------------------------|
| term_str       | object   | A term in the corpus                                                         |
| n              | int   | The number of times that term appears in the corpus                          |
| n_chars        | int   | The number of characters of the term                                         |
| p              | float | The probability of that term appearing in the corpus                         |
| i              | float | The inverse log of the probabily                                             |
| max_pos        | object   | The most frequently associated part-of-space category for each token         |
| n_pos          | object   | The number of parts of speech associated with that token                     |
| cat_pos        | list  | A concatenated list of all the parts of speech associated with that token    |
| stop           | int   | 1 indicates the token is a stopword, 0 indicates the token is not a stopword |
| stem_porter    | object   | Porter stemming algorithm applied to the term                                |
| stem_snowball  | object   | Snowball stemming algorithm applied to the term                              |
| stem_lancaster | object   | Lancaster stemming algorithm applied to the term                             |
| tfidf          | float | Term frequency-inverse document frequency                                    |
| dfidf          | float | Document frequency-inverse document frequency                                    |

VOCAB contains all the films in the corpus
VOCAB_1980 & VOCAB_1990 & VOCAB_2000 contain only vocab tables for films for that decade


#### LIB:
This table contains the metadata of the project. The source_file_path contain data relevant to the author's machine and will need to be altered to be used on other machines.

| Column           | Type | Description                       |
|------------------|------|-----------------------------------|
| movie_id         | int  | The ID associated with that movie |
| movie_title      | object  | The title of the movie            |
| source_file_path | object  | The path to the source file       |
| year             | int  | The year of the film's release    |
| decade           | int  | The decade of the film's release  |
| movie_len        | int  | The number of tokens in the movie |
| n_scenes         | int  | The number of scenes in the movie |

#### DOC:
This table also contains the metadata for the project with a `title` column added. This table is mostly used in figures produced for the interpretation of the corpus.

| Column           | Type | Description                                |
|------------------|------|--------------------------------------------|
| movie_id         | int  | The ID associated with that movie          |
| scene_id         | int  | The scene number in the movie              |
| movie_title      | object  | The title of the movie                     |
| source_file_path | object  | The path to the source file                |
| year             | int  | The year of the film's release             |
| decade           | int  | The decade of the film's release           |
| movie_len        | int  | The number of tokens in the movie          |
| n_scenes         | int  | The number of scenes in the movie          |
| title            | object  | The movie title concatenated with the year |


#### TFIDF (Term Frequency-Inverse Document Frequency):
This table is used to quantify the importance of each term in a collection of documents. It is constructed by calculating two values for each term in each document: the term frequency (TF), which measures how frequently the term appears in the document, and the inverse document frequency (IDF), which measures how important the term is across the entire collection of documents. A high TFIDF score indicates that the term is both frequent in the document and rare in the collection, making it a good indicator of the document's content.

This table consists of one column that is the movie_id, one column that is the scene_id in that movie, and a column for each of the top 1000 terms by TFIDF in the corpus. The number below is that term's TFIDF value in that scene.

| Column   | Type | Description                                               |
|----------|------|-----------------------------------------------------------|
| movie_id | int  | The movie ID associated with a particular film            |
| scene_id | int  | The scene number in that film                             |
| term     | float  | Each token in the corpus is represented as a separate column. The number below that column is that term's TFIDF value in that scene. |

#### DFIDF (Document Frequency-Inverse Document Frequency):

| Column   | Type  | Description          |
|----------|-------|----------------------|
| term_str | object   | A term in the corpus |
| DFIDF    | float | DFIDF of term        |

#### COMPS:
This is a table of principal components to be used in principal component analysis (PCA).

| Column   | Type  | Description                                              |
|----------|-------|----------------------------------------------------------|
| pc_id    | object   | ID number associated with that principal component       |
| eig_val  | float | The eigen value associated with that principal component |
| term_str | object   | A term in the corpus                                     |

#### LOADINGS:
This table contains information on how much the associated term contributes to the PCA.

| Column   | Type  | Description                                                         |
|----------|-------|---------------------------------------------------------------------|
| term_str | object   | A term in the corpus                                                |
| pc_val   | float | The contribution of the index term to the principal component value |

#### DCM (Document Component Matrix):
This table contains information on how much the associated term contributes to the PCA in the context of a film and scene.

| Column   | Type  | Description                                                         |
|----------|-------|---------------------------------------------------------------------|
| movie_id | int   | The movie ID associated with a particular film                      |
| scene_id | int   | The scene number in that film                                       |
| pc_id    | float | The contribution of the index term to the principal component value |

#### THETA:
This table shows how much each scene in the movie relates to the given topic of the topic model 

| Column            | Type  | Description                                                                    |
|-------------------|-------|--------------------------------------------------------------------------------|
| movie_id          | int   | The movie ID associated with a particular film                                 |
| scene_id          | int   | The scene number in that film                                                  |
| topic_association | float | How much each scene in the movie relates to the given topic of the topic model |

#### PHI:
This table shows how much each term contributes to a given topic

| Column   | Type | Description             |
|----------|------|-------------------------|
| topic_id | object  | The ID of a given topic |
| term_str | object  | A term in the corpus    |

#### TOPICS:
This table shows every topic and what words are associated with that topic

| Column   | Type | Description             |
|----------|------|-------------------------|
| topic_id | object  | The ID of a given topic |
| term_associations | object  | Words assoicated with the topic   |

#### sentiment_polarity:
This table shows the sentiment of a movie from -1 to 1 with -1 being negative sentiment and 1 being positive sentiment. Sentiment is a measure of the emotional tone expressed in the language used in the text.

| Column      | Type  | Description                                                                                                       |
|-------------|-------|-------------------------------------------------------------------------------------------------------------------|
| movie_title | object   | The title of a film                                                                   |
| sentiment   | float | The overall sentiment of a movie between -1 and 1 with -1 being negative sentiment and 1 being positive sentiment |

#### emotions:
This table shows how much each film is associated with one of eight different emotions. The eight emotions are from Plutchik's Wheel of Emotions, which identifies eight primary emotions.

| Column       | Type  | Description                                                                                                       |
|--------------|-------|-------------------------------------------------------------------------------------------------------------------|
| movie_title  | object   | The title of a film                                                                                               |
| anger        | float | Amount the film is associated with anger on a scale of 0-1                                                        |
| anticipation | float | Amount the film is associated with anticipation on a scale of 0-1                                                 |
| disgust      | float | Amount the film is associated with disgust on a scale of 0-1                                                      |
| fear         | float | Amount the film is associated with fear on a scale of 0-1                                                         |
| joy          | float | Amount the film is associated with joy on a scale of 0-1                                                          |
| sadness      | float | Amount the film is associated with sadness on a scale of 0-1                                                      |
| surprise     | float | Amount the film is associated with surprise on a scale of 0-1                                                     |
| trust        | float | Amount the film is associated with trust on a scale of 0-1                                                        |
| sentiment    | float | The overall sentiment of a movie between -1 and 1 with -1 being negative sentiment and 1 being positive sentiment |


## Exploration

### Cluster Analysis

Hierarchical clustering was performed using all films in my corpus. 5 different distance measures were used. City block distance measure was used on the corpus without norming using weighted linkage visualized below:

Cosine distance measure was used on the corpus without norming using ward linkage visualized below:

![Drag Racing](Dragster.jpg)

Euclidean distance measure was used on the L2 normalized corpus using ward linkage visualized below:
(Euclidean distance)
Jaccard distance measure was used on the L0 normalized corpus using weighted linkage visualized below:
(Jaccard distance)
Jensen-Shannon measure was used on the L1 normalized corpus using weighted linkage visualized below:
(Jensen-Shannon)

## Interpretation