## 1. Scripts and IMDb Metadata

### Scripts and Sentiment Metrics

Our ```scripts.csv``` dataset contains all lines of spoken dialogue across every Seinfeld episode, as well as their sentimental makeup. ```get_final_data.py``` uses ```load_data.py``` to load in all script data from [Kaggle](https://www.kaggle.com/thec03u5/seinfeld-chronicles). Raw data is then cleaned, reindexed, and reformatted to adhere to episode naming and numbering conventions adhered to by IMDb and Netflix naming conventions.

*Refer to notes in ```load_data.py``` for in-depth notes on reindexing*

Sentiment Analysis is then applied to extract the sentiment of each line of dialogue, using the ```text2emotion``` Python package in ```./scripts/precompute_tools/sentiment.py```. The emotional makeup of each line of dialogue is computed for each row in our data, ranging from 'Happy', 'Angry', 'Surprise', 'Sad', and 'Fear'.

The final dialogue data, ```scripts```, is then stored as a ```.csv``` file in ```'./an_analysis_of_nothing/static/data'``` to be reformatted into a Pandas DataFrame with the following columns:

| Character   | Dialogue | SEID | Season | EpisodeNo | Happy | Angry | Surprise | Sad | Fear | numWords |
| ----------- | -------- | ---- | ------ | --------- | ----- | ----- | -------- | --- | ---- | -------- |
| JERRY | 'Yes, it was purple..' | S01E01 | 1 | 1 | .24 | 0 | 0 | .41 | .35 | 189 |


 ####

### IMDb Metadata

Our ```metadata.csv``` dataset contains basic and advances metadata for each episode of Seinfeld. Basic IMDb metadata is parsed and cleaned by ```load_data.py``` from [IMDb Interfaces](https://www.imdb.com/interfaces/). We then call ```_scrape_epi_pages.py``` and use the Python ```requests``` to scrape episode descriptions, user-generated plot summaries, and user-generated plot keywords from the IMDb pages of each episode.

The final dialogue data, ```meta```, is then stored as a ```.csv``` file in ```'./an_analysis_of_nothing/static/data'``` to be reformatted into a Pandas DataFrame with the following columns:

|Season|EpisodeNo|AirDate |Writers                    |Director |SEID  |tconst   |Title              |EpiNo_Netflix|runtimeMinutes|numVotes|averageRating|Description                                                                                               |Summaries                                                                                                                                                                                                                                                                                                                                                                      |keyWords                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|------|---------|--------|---------------------------|---------|------|---------|-------------------|-------------|--------------|--------|-------------|----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|1     |1        |5-Jul-89|Larry David, Jerry Seinfeld|Art Wolff|S01E01|tt0098286|Good News, Bad News|1            |23            |6031    |7.4          |'Jerry and George argue... '|['In this episode, ...']|['tv series pilot', 'cafe', 'waitress', ..]|

##

## 2. Search Query Vectors

Additional numpy vectors for episode querying are also generated using ```get_final_data.py```, which calls  ```create_corpus_embeddings``` in ```./scripts/precompute_tools/query_vectors.py``` to use a Sentence Transformer model to generate dialogue embeddings, and store sharded files as numpy arrays.

These dialogue feature vectors were called in advance to allow for fast querying without the need to re-encode every line of dialogue upon each new search query input. The result is a 54590x384 torch tensor stored as 10 .NPY files, located in ```./an_analysis_of_nothing/static/data/dialogue_tensors```